How to use regular expressions to parse HTML in Java?

Question

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?

Thanks for any suggestion.

Canonical question: RegEx match open tags except XHTML self-contained tags — Peter Mortensen, Commented Nov 11, 2014 at 0:06

Community · Accepted Answer · 2017-05-23 12:25:02Z

54

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Mar 24, 2009 at 11:41

David Webb

194k57 gold badges318 silver badges302 bronze badges

4

It depends on what you are doing. If you are processing a lot of HTML from random sources an HTML Parser may well fail on some of them and will likely require more memory and processing than a regex. For example the Heritrix web crawler uses regex for link extraction on HTML pages.
– Kris
Commented Mar 24, 2009 at 12:19
1

Please answer the original question first and then suggest how to optimize. Many people visit this question on SO hoping to learn how to parse HTML using regular expressions, but instead find something they weren't looking for. Using regular expressions is quick and dirty and you do not have to download a separate library for it to work.
– Drupad Panchal
Commented Jul 29, 2011 at 19:26
2

I disagree with this answer, it is by no means always a mistake to use regex on html - as @Kris pointed out: Trying to parse a full html document often requires valid html which is not always given. And it provides a huge overkill in cases where you have a clearly defined case like finding an <a> tag's href attribute value.
– Bachi
Commented Dec 18, 2013 at 10:19

Add a comment |

Henryk Konsek · Accepted Answer · 2009-03-24 13:17:37Z

The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

Thanks. While not a real "works-everywhere" regex this works for data returned from google hot trends and I have been pulling my hair to parse it for a long time... — rjha94, Commented Oct 17, 2010 at 16:01

mP. · Accepted Answer · 2009-03-24 12:40:22Z

6

Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.

answered Mar 24, 2009 at 12:40

mP.

18.3k12 gold badges78 silver badges109 bronze badges

Add a comment |

Scott Cowan · Accepted Answer · 2009-03-24 11:56:12Z

If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.

since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.

File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
                String parserLibrary = parserLibraryFile.getAbsolutePath();
                //  mozilla.dist.bin directory :
                final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());
        MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");
for (int i = 0; i < list.getLength(); i++) {
    Node n = list.item(i);
    NamedNodeMap m = n.getAttributes();
    if (m != null) {
        Node attrNode = m.getNamedItem("href");
        if (attrNode != null)
           System.out.println(attrNode.getNodeValue());

Mark · Accepted Answer · 2009-03-24 11:50:55Z

3

I searched the Regular Expression Library (https://regexlib.com/Search.aspx?k=href and https://regexlib.com/Search.aspx?k=src)

The best I found was

((?<html>(href|src)\s*=\s*")|(?<css>url\())(?<url>.*?)(?(html)"|\))

Check out these links for more expressions:

https://regexlib.com/REDetails.aspx?regexp_id=2261

https://regexlib.com/REDetails.aspx?regexp_id=758

https://regexlib.com/REDetails.aspx?regexp_id=774

https://regexlib.com/REDetails.aspx?regexp_id=1437

answered Mar 24, 2009 at 11:50

Mark

3711 gold badge5 silver badges14 bronze badges

2

I hate that site. I see they still don't bother to mention which flavor a given regex is targeted at. This regex (id=2261) uses named captures and conditionals, neither of which is supported by Java.
– Alan Moore
Commented Mar 24, 2009 at 17:03

Add a comment |

Jörg W Mittag · Accepted Answer · 2009-03-24 21:30:18Z

2

Regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.

HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers.

You should use you favorite HTML parser instead.

answered Mar 24, 2009 at 21:30

Jörg W Mittag

370k79 gold badges453 silver badges664 bronze badges

Add a comment |

Guss · Accepted Answer · 2009-03-25 08:49:23Z

1

Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).

If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.

Try something like this:

/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

answered Mar 25, 2009 at 8:49

Guss

32.6k19 gold badges116 silver badges143 bronze badges

Add a comment |

Collectives™ on Stack Overflow

How to use regular expressions to parse HTML in Java?

7 Answers 7

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Linked

Related