Any good Java HTML parsers?

Question

I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?

I've tried Cobra's built in one and HTMLCleaner without any luck.

Judging by your last question, the problem isn't with "XPath evaluator". You were using XPathFactory.newInstance(), which creates the stock Java evaluator that works on any XML document loaded in a DOM model (as instance of Document). CORBA itself isn't an XPath evaluator - it's an HTML parser which produces Document, and it did that wrong in your case. So what you actually want is a "good Java HTML parser", not "good Java XPath evaluator". — Pavel Minaev, Commented Nov 26, 2009 at 23:55
Oops... sorry. I've revised my question... I'm just going nuts with all the HTML in front of my eyes... — Legend, Commented Nov 27, 2009 at 0:05

Pascal Thivent · Accepted Answer · 2009-11-27 00:53:33Z

4

TagSoup is really great when dealing with crappy HTML/XHTML.

Jericho (and NekoHTML) are good too to parse non valid HTML.

TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.

answered Nov 27, 2009 at 0:53

Pascal Thivent

571k140 gold badges1.1k silver badges1.1k bronze badges

Add a comment |

Pavel Minaev · Accepted Answer · 2009-11-27 00:11:07Z

1

Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.

answered Nov 27, 2009 at 0:11

Pavel Minaev

102k27 gold badges222 silver badges293 bronze badges

Add a comment |

Jim Garrison · Accepted Answer · 2009-11-26 23:57:03Z

1

Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).

answered Nov 26, 2009 at 23:57

Jim Garrison

86.9k20 gold badges160 silver badges196 bronze badges

Saxon is an awesome XSLT 2.0 & XQuery implementation, but it doesn't parse HTML.
– Pavel Minaev
Commented Nov 27, 2009 at 0:10
@Pavel - The original question didn't mention HTML
– Jim Garrison
Commented Nov 27, 2009 at 2:31

Add a comment |

peter.murray.rust · Accepted Answer · 2009-11-28 06:47:15Z

1

[Answering the title - the overall question and comments are not consistsent]

JTidy (https://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.

answered Nov 28, 2009 at 6:47

peter.murray.rust

38.1k46 gold badges161 silver badges226 bronze badges

Add a comment |

Ms2ger · Accepted Answer · 2009-11-28 13:51:31Z

1

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

answered Nov 28, 2009 at 13:51

Ms2ger

16k6 gold badges40 silver badges36 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Any good Java HTML parsers?

5 Answers 5

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Linked

Related