4

I'm trying to parse a html page and extract 2 values from a table row. The html for the table row is as follows: -

<tr>
<td title="Associated temperature in (ºC)" class="TABLEDATACELL" nowrap="nowrap" align="Left" colspan="1" rowspan="1">Max Temperature (ºC)</td>
<td class="TABLEDATACELLNOTT" nowrap="nowrap" align="Center" colspan="1" rowspan="1">6</td>
<td class="TABLEDATACELLNOTT" nowrap="nowrap" align="Center" colspan="1" rowspan="1"> 13:41:30</td>
</tr>

and the expression I have at the moment is:

<tr>[\s]<td[^<]+?>Max Temperature[\w\s]*</td>[\s]
<td[^<]+?>(?<value>([\d]+))</td>[\s]
<td[^<]+?>(?<time>([\d\:]+))</td>[\s]</tr>

However I don't seem to be able to extract any matches. Could anyone point me in the right direction, thanks.

7 Answers 7

4

Parsing HTML reliably using regexp is known to be notoriously difficult.

I think I would be looking for a HTML parsing library, or a "screen scraping" library ;)

If the HTML comes from an unreliable source, you have to be extra careful to handle malicious HTML syntax well. Bad HTML handling is a major source of security attacks.

1

Try

<tr>\s*
<td[^>]*>.*?</td>\s*
<td[^>]*>\s*(?<value>\d+)\s*</td>\s*
<td[^>]*>\s*(?<time>\d{2}:\d{2}:\d{2})\s*</td>\s*
</tr>\s*
0

When you write <td[^<]+?> I guess you really mean <td[^>]*>

That is "opening brace, td, maybe stuff other than closing brace..."

0
<tr>[\s]<td[^<]+?>Max Temperature[\w\s]*</td>[\s]

Not looked at it all yet, but that [^<] probably needs to be [^>] as you're trying to match all non-> until the > that's before Max temperature.

0

The " (ºC)" before the closing td was matched against:

<tr>[\s]<td[^<]+?>Max Temperature[^<]*</td>[\s]

Is that \w a word-boundary? I think that it gets a little tricky there, I'd use a more general approach.

And on the third line, there is one whitespace after the td tag, is that accounted for?

<td[^<]+?>[\s]?(?<time>([\d\:]+))</td>[\s]</tr>
0

I use https://www.regexbuddy.com/ for such controls. So far I tested @sgehrig's suggestion is correct

0

Use the Html Agility Pack or a similar library instead, as @Bjarke Ebert suggests. It's the right tool for the task.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.