2

I'm trying to write a regular expression for my html parser.

I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.

I'm using boost regex libraries.

1

5 Answers 5

7

You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.

2

You may also find these questions helpful:

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Can you provide an example of parsing HTML with your favorite parser?

2

As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.

1

Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.

I'm using these regexps:

/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:

/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links

(BTW can it be done better? - I suck at regex ;))

What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.

2
  • It's not that regular expressions are not the best solution - for what you're trying to do regex is not a valid solution at all. Use a HTML or XML parser instead. Commented Apr 27, 2009 at 13:21
  • Ok, so which one do you recommend. I'd prefer an easy one ;)
    – zajcev
    Commented Apr 27, 2009 at 16:41
0

Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.

This takes two regexps and a state variable.

SGML tags valid characters are [A-Za-z_:]

So: /<[A-Za-z_:]+>/ matches a tag.

1
  • Or, instead of re-inventing the wheel, use an existing parser which has already been written and will already deal with edge cases and so on. Commented Apr 27, 2009 at 13:15

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.