1 of 20

Jodd Lagarto

Lagarto HTML parsers suite

Java libraries for HTML/XML parsing

Lagarto Suite is the family of HTML/XML parsers written in Java. It consists of the following libraries:

LagartoParser is an all-purpose fast and versatile event-based HTML parser. You can use it to modify or analyze some markup content, allowing you to assemble custom complex transformations and code analysis tools quickly. It is performant and follows the rules of the official HTML specification.
LagartoDom builds a DOM tree in memory from the input. You can manipulate a tree more conveniently, with minor performance sacrifice.
Jerry is a "jQuery in Java" - you can use the familiar syntax of JavaScript library inside of Java to parse and manipulate HTML.
CSSelly - finally, the parser of CSS3 selectors.

Each of the Lagarto libraries has its pros and cons. You should check each and use one that suits your requirements.

Lagarto parsers are compatible with Java 8 and newer.

License

The code is released under the BSD-2-Clause license. It has a minimal set of dependencies with the same or similarly open license, so you should be able to use it in any project and for any purpose.

Installation

Tips on how to install Jodd Lagarto library in your app

Jodd Lagarto is released on Maven Central. You can use the following snippets to add it to your project:

That is all!

Jodd Lagarto has only one (and small) dependency: the Jodd util library, that is released under the same license.

Snapshots

Snapshots are released manually. Feel free to contact me if you need a new SNAPSHOT release sooner.

Contact

Let's keep in touch!

info@jodd.org

Lagarto parser

LagartoParser

Fast event-based HTML parser

Let's see it in action:

LagartoParser lagartoParser = new LagartoParser("<html><h1>Hello</h1></html>");

TagVisitor tagVisitor = new EmptyTagVisitor() {
    @Override
    public void tag(final Tag tag) {
        if (tag.nameEquals("h1")) {
            System.out.println(tag.getName());
        }
    }
	
    @Override
    public void text(final CharSequence text) {
        System.out.println(text);
    }
};

lagartoParser.parse(tagVisitor);

As the input content is parsed, the callback methods in the visitor get invoked. In this case, the result is:

h1
Hello
h1

Note that the tag() event was emitted twice: first for the open tag, and then for the close tag. In other words, LagartoParser performs the tokenization of the input HTML.

Parsing specification

the text is emitted as a single block of text and not one by one character.
the case of a tag name (and other tokens) is not changed when emitted.
LagartoParser does only tokenization. The DOM tree is not created, neither validated.
the script tag is emitted separately.
Internet Explorer conditional comments are supported.
XML is supported too.

LagartoParser only performs tokenization and it does not verify if tags make sense. For example, if your HTML has a non-closed tag, LagartoParser will not consider this as an error. LagartoDom, on the other hand, will handle these cases.

Input types

LagartoParser accepts both char[] and CharSequence. This allows the usage of various implementations of inputs, including String, or even a Reader.

Events

List of events emitted during the parsing

During the parsing, LagartoParser calls various callback methods of provided TagVisitor implementation.

`start()` & `end()`

Invoked before and after the content is parsed.

`text(CharSequence)`

Callback invoked on a block of plain text.

Note that text() is called for all text blocks, including the whitespaces and indentations.

`comment(CharSequence)`

Invoked on an HTML comment. The argument contains the comment content, without the tag boundaries.

`tag(Tag)`

Callback invoked foo all HTML tags: open, close, or an empty tag. The argument is a Tag instance, containing various information about the tag: tags name, attributes, depth level, etc.

Tag instance is reused during HTML parsing for better performances! The same Taginstance is passed to all callback methods and for every detected HTML tag. Do not store it internally!

`script(Tag, CharSequence)`

Invoked on all script tags. The callback method receives the script Tag instance and the script body.

`doctype(Doctype)`

Callback for the doctype tag.

`xml()` and `cdata()`

These two callbacks are invoked for XML-specific tags when parsing the XML content.

error(String)

Every error is reported by visiting this method. Depending on the configuration, the error message would contain the exact error position or not.

Configuration

Fine-tune the parser

LagartoParser configuration is defined in LagartoParserConfig class. An instance of this class can be passed to a constructor or you can just modify it later, functional style. For example:

LagartoParserConfig cfg = new LagartoParserConfig().setCaseSensitive(true);
LagartoParser lagartoParser = new LagartoParser(cfg, "<html>");

LagartoParser lagartoParser =
    new LagartoParser("<html>")
        .configure(cfg -> {
            cfg.setCaseSensitive(true);
        });

The list of available properties follows.

calculatePosition

By default disabled, this property switches the calculation of elements position. The position consists of a line number, approximate column number, and a total offset in the file.

Calculating position makes processing slower.

enableConditionalComments

When enabled, LagartoParser will detect IE conditional comments. When its disabled and conditional comment is found, LagartoParser sends an error for revealed conditional comment tags or threats downlevel-hidden conditional comments as regular comments.

Enabling conditional comments also makes parsing slower.

caseSensitive

By default is false. When enabled various matching will be case sensitive. Should not be used for parsing HTML content.

parseXmlTags

Enables parsing of XML specific tags. By default it's disabled.

enableRawTextModes

Enabled by default, tells LagartoParser to take special take on all so-called 'raw' text tags, such as style, script, xmp and so on. You can disable this if that content is not of importance, and gain some more speed.

textBufferSize

By default its set to 1024. It's the size of the internal text buffer used to collect all text blocks. This buffer will grow if needed. If your HTML contains text blocks of large size, you may increase this number, just for the purpose of small performance improvement.

Adapter and Writer

Additional cool classes

Besides TagVisitor, you can use also one of the following classes.

EmptyTagVisitor - default implementation that does nothing. You will probably use it, as you can override just the methods you need.
TagVisitors - is a simple composite of many TagVisitors implementations. They will be invoked in the given order.
TagAdapter - is an adapter over target TagVisitor. With such adapter you can change the behavior of an existing visitor.

Enclosed adapters

You can use the following adapters:

StripHtmlTagAdapter - strips all the unnecessary whitespaces from text blocks and also removes all the comments. For example, multiple spaces would be replaced with a single space, etc.
UrlRewriterTagAdapter - as the name implies, you may change the <a href link values.

Writer

TagWriter is a simple TagVisitor that builds HTML from the events. Usually, you can use it as target of some adapter. This way you can modify input HTML by parsing it, adapt it, and then write it again to an Appendable content.

Example

TagWriter tagWriter = new TagWriter();
StripHtmlTagAdapter adapter = new StripHtmlTagAdapter(tagWriter);
LagartoParser lagartoParser = new LagartoParser(
        "<html> <h1>  Hello  </h1> </html>");

lagartoParser.parse(adapter);

System.out.println(tagWriter.getOutput().toString());

The resulting string would be:

<html><h1> Hello </h1></html>

FAQ

Some common answers

What's going on with my TITLE and TEXTAREA?

Those two tags are different. You can close them ONLY with a closing tag.

For example, the following snippet:

<html><head><title /></head><body>hello world!</body></html>

has the 3 tags only: html, body, and title. The text for title tag is everything from the </head up to the end of the string.

Feel free to ask something...

...and I will answer :)

Lagarto DOM

LagartoDOM

Parse HTML to a DOM tree.

LagartoDOM parses HTML content and creates a DOM tree from it. It is based on LagartoParser. The created DOM is convenient to traverse and manipulate. However, if you need ultimate performance, go with the event-based LagartoParser.

Let's see LagartoDom in action:

Document document = new LagartoDOMBuilder()
.parse("<html><h1>Hello</h1></html>");

Node html = document.getChild(0);
Node h1 = html.getFirstChild();

System.out.println(h1.getTextContent());				// Hello

Text text = (Text) h1.getFirstChild();
System.out.println(text.getTextValue());				// Hello

System.out.println(text.getCssPath());				  // html h1

It's simple as that. As said, the DOM tree is created. It consists of Node elements. Each Node contains a bunch of methods related to tree traversing, such as getChild(index), getFirstChild(), getParentNode(), getChildNodes(), etc.

Node contains also getters for node name, attributes, node values. Node can be detached from the tree or attached at some point. Element is a special type of the Node that represents elements, and there is a whole subset of methods that deal only with elements.

Finally, you can render DOM tree or any Node back to HTML content.

Parsing specification

LagartoDOM follows only a subset of the official DOM-building specification. Here is why!

By default, LagartoDOM follows all the rules that do not involve any movements of DOM nodes. This is done on purpose. The idea is to get the exact tree to what you have provided. For example, if you pass HTML with some tags that are not supposed to be nested, LagartoDOM would not complain and you will get exactly what you have on input.

In most cases, this will be perfectly fine, as developers are probably not using all the tricks of HTML5 for the sake of better readability.

Still, you can turn on some more rules, you can turn them on! In that case, the resulting DOM tree can be modified per HTML5 rules. I have implemented the most common of these rules and exceptions, but haven't covered them all (yet). So if you have some weird HTML, you might get a different tree than what you get in a browser.

LagartoDOM is not (yet) a strict implementation of HTML5 DOM-building rules, but it is good enough for most cases!

Carry on :)

Parsing modes

LagartoDom parses XML, XHTML and HTML

LagartoDom can parse XML, XHTML and HTML content. There are already predefined methods that quickly enable these modes:

enableHtmlMode()
enableXhtmlMode()
enableXmlMode()

HTML plus mode

There is one more mode available:

enableHtmlPlusMode()

The default HTML mode does not change the order of the nodes. However, HTML5 specification has some rules where nodes are moved around the DOM. For example, all tags written beyond table tags in a table are moved before table definition. Moreover, there are some special rules on which orphan tags may be closed and the scope in which they can be closed.

HTML Plus mode is one that enables these additional parsing rules. These rules require some additional processing and may slow down the processing.

Debug mode

Regardless of the parsing mode, LagartoDom may work in debug mode:

enableDebugMode()

In debugging mode all the errors are collected and their position is calculated. Of course, this slows down the processing.

Configuration

Fine-tune parsing to DOM tree.

LagartoDom configuration is specified in LagartoDomBuilderConfig class. Among new properties, there is also the instance of the LagartoParser configuration, of the parser that is used internally.

In most cases, you will just use the predefined modes. Here is the list of properties that you can configure.

ignoreWhitespacesBetweenTags

This flag is used for XML mode, to ignore all whitespace content between two starting or two ending tags. Whitespace content between one open and one closed tag is still not ignored.

ignoreComments

This flag simply defines if the resulting DOM tree should contain comments or not.

enabledVoidTags

Flag to enable/disable void tags.

selfCloseVoidTags

When an element is a void element, this flag defines if it can be self-closed or if it should have the standard end tag.

impliedEndTags

Enables rules for implicit end tags. There are a number of tags that do not require the use of a closing tag for valid HTML (body, li, dd, dt, p, td, tr,...). When this flag is on, these tags are implicitly closed if needed and no error/warning is logged.

This feature somewhat slows down the parsing. If you know that all tags are closed in input HTML, consider switching this feature off, to improve performances.

condCommentIEVersion

The version of conditional comments.

errorLogger & debugLogger

Custom loggers.

Jerry

jQuery in Java

Look, it's really cool:

Jerry doc = Jerry.of("<html><div id='jodd'><b>Hello</b> Jerry</div></html>");
doc.s("div#jodd b").css("color", "red").addClass("ohmy");

The (formatted) output will be:

<html>
    <div id="jodd">
        <b style="color:red;" class="ohmy">Hello</b> Jerry
    </div>
</html>

I tried to keep Jerry API identical to the jQuery as much as possible. In some cases, you can simply copy some jQuery code and paste it in Java - and it will work! Of course, there are some differences due to the different nature of the platforms.

The well-known jQuery method $() is renamed tos() in Jerry. The reason is compatibility with different JVMs: GraalVM, for example, does not allow usage of $ in method names.

If you don't like the s() method, use the find() alternative.

Using Jerry

Some differences and add-ons

In Java we do not have the document context as in browsers, so we need to create one first. To do that, simply pass HTML content to Jerry static factory method. That will create a root Jerry set, containing a Document root node of the parsed DOM tree.

What happens in the background is that LagartoDOM builds a DOM tree and wraps it in the Jerry/jQuery API.

Using CSS selectors

You can use most of the standard CSS selectors and also most of the jQuery CSS selectors extensions. CSS selectors are supported by the CSSelly.

Differences

As Jerry speaks Java, there are some differences in API made to make Jerry API more Java friendly. For example, css() method accepts an array of property/values, and not a single string:

Jerry
    .of(html)
    .s("tr:last")
    .css("background-color", "yellow", "fontWeight", "bolder");

Similarly, each() method receives a lambda:

Jerry.of(someHtml)
    .s("select option:selected")
    .each(($this, index) -> {
        System.out.println($this.text());
        return true;
    });

Unsupported stuff

As Jerry is all about 'static' manipulation of HTML content, all jQuery methods and selectors that are related to any dynamic activity are not supported. This includes animations, Ajax calls, selectors that depend on CSS definitions...

Add-ons

Jerry provides some add-ons that do not exist in jQuery.

First, there are few methods that return Node of parsed DOM tree (similar to JavaScript).

Then there are some new methods that are more meaningful in Java world. One of them is the form() method. It collects all parameters from a given form, allowing easy form handling. Here is an example:

Jerry.of("html")
     .form("#myform", (form, parameters) -> {
         // process form and parameters
     });

Convenient, right!?

CSSelly

Jerry has a girlfriend!

CSSelly is a Java implementation of the W3C Selectors Level 3 specification.

It's small, fast and extendable. CSSelly parses an input containing CSS selectors. The result then may be used by any HTML parser. Yet, it works the best with LagartoDOM tree and our Jerry.

Example

CSSelly csselly = new CSSelly("div:nth-child(2n) span#jodd");
List<CssSelector> selectors = csselly.parse();

As said, CSSelly is used as node selector in LagartoDOM (and therefore in Jerry, too):

NodeSelector nodeSelector = new NodeSelector(document);
LinkedList<Node> selectedNodes = nodeSelector.select("div#jodd li");

Selectors

Available selectors

The list of default selectors supported by CSSelly:

* any element
E an element of type E
E[foo] an E element with a "foo" attribute
E[foo="bar"] an E element whose "foo" attribute value is exactly equal to "bar"
E[foo~="bar"] an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar"
E[foo^="bar"] an E element whose "foo" attribute value begins exactly with the string "bar"
E[foo$="bar"] an E element whose "foo" attribute value ends exactly with the string "bar"
E[foo*="bar"] an E element whose "foo" attribute value contains the substring "bar"
E[foo|="en"] an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en"
E:root an E element, root of the document
E:nth-child(n) an E element, the n-th child of its parent
E:nth-last-child(n) an E element, the n-th child of its parent, counting from the last one
E:nth-of-type(n) an E element, the n-th sibling of its type
E:nth-last-of-type(n) an E element, the n-th sibling of its type, counting from the last one
E:first-child an E element, first child of its parent
E:last-child an E element, last child of its parent
E:first-of-type an E element, first sibling of its type
E:last-of-type an E element, last sibling of its type
E:only-child an E element, only child of its parent
E:only-of-type an E element, only sibling of its type
E:empty an E element that has no children (including text nodes)
E#myid an E element with ID equal to “myid”.
E F an F element descendant of an E element
E > F an F element child of an E element
E + F an F element immediately preceded by an E element
E ~ F an F element preceded by an E element

The list of additional pseudo-classes and pseudo-functions supported by CSSelly:

:first
:last
:button
:checkbox
:file
:header
:image
:input
:parent
:password
:radio
:reset
:selected
:checked
:submit
:text
:even
:odd
:eq(n)
:gt(n)
:lt(n)
:contains(text)

Escaping

CSSelly supports escaping characters using the backslash, e.g.: nspace\:name refers to the tag name nspace:name (that uses namespaces) and not for pseudo-class name.

Customize

Custom pseudo classes and functions

Custom pseudo-class

Custom pseudo-classes extends the PseudoClass and implements match(Node node). This method should return true if a node is matched. You may also override the method getPseudoClassName() if you don't want to generate a pseudo-class name from the class name. For example:

public class MyPseudoClass extends PseudoClass {
    @Override
    public boolean match(Node node) {
      return node.hasAttribute("jodd-attr");
    }

    @Override
    public String getPseudoClassName() {
      return "jodd";
    }
}

Then register your pseudo-class with:

    PseudoClassSelector.registerPseudoClass(MyPseudoClass.class);

From that moment you will be able to find all nodes with the attribute jodd-attr using the :jodd pseudo-class.

When a pseudo-class needs to perform an additional match in the range of matched nodes (e.g. first, last etc), then override matchInRange() method, too.

Custom pseudo-function

Similar to pseudo-classes, custom pseudo-function implements the PseudoFunction class. Additionally, you need to also implement a method that parses input expression. This expression is later passed to the matching method.

Let's make a function that matches all nodes with certain name length:

public class MyPseudoFunction extends PseudoFunction {
    @Override
    public Object parseExpression(String expression) {
        return Integer.valueOf(expression);
    }

    @Override
    public boolean match(Node node, Object expression) {
        Integer size = (Integer) expression;
        return node.getNodeName().length() == size.intValue();
    }

    @Override
    public String getPseudoFunctionName() {
        return "len-fn";
    }
}

PseudoFunctionSelector.registerPseudoFunction(MyPseudoFunction.class);

Start using it! E.g. :len-fn(3) to match all nodes with short names:)

Selectors

Available selectors

The list of default selectors supported by CSSelly:

* any element
E an element of type E
E[foo] an E element with a "foo" attribute
E[foo="bar"] an E element whose "foo" attribute value is exactly equal to "bar"
E[foo~="bar"] an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar"
E[foo^="bar"] an E element whose "foo" attribute value begins exactly with the string "bar"
E[foo$="bar"] an E element whose "foo" attribute value ends exactly with the string "bar"
E[foo*="bar"] an E element whose "foo" attribute value contains the substring "bar"
E[foo|="en"] an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en"
E:root an E element, root of the document
E:nth-child(n) an E element, the n-th child of its parent
E:nth-last-child(n) an E element, the n-th child of its parent, counting from the last one
E:nth-of-type(n) an E element, the n-th sibling of its type
E:nth-last-of-type(n) an E element, the n-th sibling of its type, counting from the last one
E:first-child an E element, first child of its parent
E:last-child an E element, last child of its parent
E:first-of-type an E element, first sibling of its type
E:last-of-type an E element, last sibling of its type
E:only-child an E element, only child of its parent
E:only-of-type an E element, only sibling of its type
E:empty an E element that has no children (including text nodes)
E#myid an E element with ID equal to “myid”.
E F an F element descendant of an E element
E > F an F element child of an E element
E + F an F element immediately preceded by an E element
E ~ F an F element preceded by an E element

The list of additional pseudo-classes and pseudo-functions supported by CSSelly:

:first
:last
:button
:checkbox
:file
:header
:image
:input
:parent
:password
:radio
:reset
:selected
:checked
:submit
:text
:even
:odd
:eq(n)
:gt(n)
:lt(n)
:contains(text)

Escaping

CSSelly supports escaping characters using the backslash, e.g.: nspace\:name refers to the tag name nspace:name (that uses namespaces) and not for pseudo-class name.

Jodd Lagarto

Lagarto HTML parsers suite

License

Installation

Snapshots

Contact

info@jodd.org

Lagarto parser

LagartoParser

Parsing specification

Input types

Events

start() & end()

text(CharSequence)

comment(CharSequence)

tag(Tag)

script(Tag, CharSequence)

doctype(Doctype)

xml() and cdata()

error(String)

Configuration

calculatePosition

enableConditionalComments

caseSensitive

parseXmlTags

enableRawTextModes

textBufferSize

Adapter and Writer

Enclosed adapters

Writer

Example

FAQ

What's going on with my TITLE and TEXTAREA?

Feel free to ask something...

Lagarto DOM

LagartoDOM

Parsing specification

Parsing modes

HTML plus mode

Debug mode

Configuration

ignoreWhitespacesBetweenTags

ignoreComments

enabledVoidTags

selfCloseVoidTags

impliedEndTags

condCommentIEVersion

errorLogger & debugLogger

Jerry

Jerry

Using Jerry

Using CSS selectors

Differences

Unsupported stuff

Add-ons

CSSelly

CSSelly

Example

Selectors

Escaping

Customize

Custom pseudo-class

Custom pseudo-function

Contact

info@jodd.org

Events

start() & end()

text(CharSequence)

comment(CharSequence)

tag(Tag)

script(Tag, CharSequence)

doctype(Doctype)

xml() and cdata()

error(String)

Lagarto HTML parsers suite

License

FAQ

What's going on with my TITLE and TEXTAREA?

Feel free to ask something...

Adapter and Writer

`start()` & `end()`

`text(CharSequence)`

`comment(CharSequence)`

`tag(Tag)`

`script(Tag, CharSequence)`

`doctype(Doctype)`

`xml()` and `cdata()`

`start()` & `end()`

`text(CharSequence)`

`comment(CharSequence)`

`tag(Tag)`

`script(Tag, CharSequence)`

`doctype(Doctype)`

`xml()` and `cdata()`