Only this pageAll pages
Powered by GitBook
1 of 20

Jodd Lagarto

Loading...

Loading...

Loading...

Lagarto parser

Loading...

Loading...

Loading...

Loading...

Loading...

Lagarto DOM

Loading...

Loading...

Loading...

Jerry

Loading...

Loading...

CSSelly

Loading...

Loading...

Loading...

Installation

Tips on how to install Jodd Lagarto library in your app

Jodd Lagarto is released on Maven Central. You can use the following snippets to add it to your project:

<dependency>
  <groupId>org.jodd</groupId>
  <artifactId>jodd-lagarto</artifactId>
  <version>x.x.x</version>
</dependency>
implementation 'org.jodd:jodd-lagarto:x.x.x'
implementation("org.jodd:jodd-lagarto:x.x.x")
libraryDependencies += "org.jodd" % "jodd-lagarto" % "x.x.x"
<dependency org="org.jodd" name="jodd-lagarto" rev="x.x.x" />
[org.jodd/jodd-lagarto "x.x.x"]
'org.jodd:jodd-lagarto:jar:x.x.x'

That is all!

Jodd Lagarto has only one (and small) dependency: the Jodd util library, that is released under the same license.

Snapshots

Jodd Lagarto snapshots are published on .

Snapshots are released manually. Feel free to contact me if you need a new SNAPSHOT release sooner.

Maven Central Snapshot repo

Lagarto HTML parsers suite

Java libraries for HTML/XML parsing

Lagarto Suite is the family of HTML/XML parsers written in Java. It consists of the following libraries:

  1. LagartoParser is an all-purpose fast and versatile event-based HTML parser. You can use it to modify or analyze some markup content, allowing you to assemble custom complex transformations and code analysis tools quickly. It is performant and follows the rules of the official HTML specification.

  2. LagartoDom builds a DOM tree in memory from the input. You can manipulate a tree more conveniently, with minor performance sacrifice.

  3. Jerry is a "jQuery in Java" - you can use the familiar syntax of JavaScript library inside of Java to parse and manipulate HTML.

  4. CSSelly - finally, the parser of CSS3 selectors.

Each of the Lagarto libraries has its pros and cons. You should check each and use one that suits your requirements.

Lagarto parsers are compatible with Java 8 and newer.

License

The code is released under the BSD-2-Clause license. It has a minimal set of dependencies with the same or similarly open license, so you should be able to use it in any project and for any purpose.

Events

List of events emitted during the parsing

During the parsing, LagartoParser calls various callback methods of provided TagVisitor implementation.

start() & end()

Invoked before and after the content is parsed.

text(CharSequence)

Callback invoked on a block of plain text.

Note that text() is called for all text blocks, including the whitespaces and indentations.

comment(CharSequence)

Invoked on an HTML comment. The argument contains the comment content, without the tag boundaries.

tag(Tag)

Callback invoked foo all HTML tags: open, close, or an empty tag. The argument is a Tag instance, containing various information about the tag: tags name, attributes, depth level, etc.

Tag instance is reused during HTML parsing for better performances! The same Taginstance is passed to all callback methods and for every detected HTML tag. Do not store it internally!

script(Tag, CharSequence)

Invoked on all script tags. The callback method receives the script Tag instance and the script body.

doctype(Doctype)

Callback for the doctype tag.

xml() and cdata()

These two callbacks are invoked for XML-specific tags when parsing the XML content.

error(String)

Every error is reported by visiting this method. Depending on the configuration, the error message would contain the exact error position or not.

Contact

Let's keep in touch!

[email protected]

LagartoParser

Fast event-based HTML parser

LagartoParser is an event-based HTML parser. It processes the input and emits events as they are parsed; using a . This makes parsing very fast and memory-usage is minimal. However, sometimes event-based parsing can be tedious; in that case, try LagartoDom parser instead.

Let's see it in action:

As the input content is parsed, the callback methods in the visitor get invoked. In this case, the result is:

Note that the tag() event was emitted twice: first for the open tag, and then for the close tag. In other words, LagartoParser performs the tokenization of the input HTML.

CSSelly

Jerry has a girlfriend!

CSSelly is a Java implementation of the W3C Selectors Level 3 specification.

It's small, fast and extendable. CSSelly parses an input containing CSS selectors. The result then may be used by any HTML parser. Yet, it works the best with LagartoDOM tree and our Jerry.

Example

As said, CSSelly is used as node selector in LagartoDOM (and therefore in Jerry, too):

FAQ

Some common answers

What's going on with my TITLE and TEXTAREA?

Those two tags are different. You can close them ONLY with a closing tag.

For example, the following snippet:

has the 3 tags only: html, body, and title. The text for title

Configuration

Fine-tune parsing to DOM tree.

LagartoDom configuration is specified in LagartoDomBuilderConfig class. Among new properties, there is also the instance of the LagartoParser configuration, of the parser that is used internally.

In most cases, you will just use the predefined modes. Here is the list of properties that you can configure.

ignoreWhitespacesBetweenTags

This flag is used for XML mode, to ignore all whitespace content between two starting or two ending tags. Whitespace content between one open and one closed tag is still not ignored.

ignoreComments

This flag simply defines if the resulting DOM tree should contain comments or not.

enabledVoidTags

Flag to enable/disable void tags.

selfCloseVoidTags

When an element is a void element, this flag defines if it can be self-closed or if it should have the standard end tag.

impliedEndTags

Enables rules for implicit end tags. There are a number of tags that do not require the use of a closing tag for valid HTML (body, li, dd, dt, p, td, tr,...). When this flag is on, these tags are implicitly closed if needed and no error/warning is logged.

This feature somewhat slows down the parsing. If you know that all tags are closed in input HTML, consider switching this feature off, to improve performances.

condCommentIEVersion

The version of conditional comments.

errorLogger & debugLogger

Custom loggers.

Parsing modes

LagartoDom parses XML, XHTML and HTML

LagartoDom can parse XML, XHTML and HTML content. There are already predefined methods that quickly enable these modes:

  • enableHtmlMode()

  • enableXhtmlMode()

  • enableXmlMode()

HTML plus mode

There is one more mode available:

  • enableHtmlPlusMode()

The default HTML mode does not change the order of the nodes. However, HTML5 specification has some rules where nodes are moved around the DOM. For example, all tags written beyond table tags in a table are moved before table definition. Moreover, there are some special rules on which orphan tags may be closed and the scope in which they can be closed.

HTML Plus mode is one that enables these additional parsing rules. These rules require some additional processing and may slow down the processing.

Debug mode

Regardless of the parsing mode, LagartoDom may work in debug mode:

  • enableDebugMode()

In debugging mode all the errors are collected and their position is calculated. Of course, this slows down the processing.

Parsing specification

HTML parsing (i.e. tokenization) is done strictly by the official HTML5 specification. Note the following:

  • the text is emitted as a single block of text and not one by one character.

  • the case of a tag name (and other tokens) is not changed when emitted.

  • LagartoParser does only tokenization. The DOM tree is not created, neither validated.

  • the script tag is emitted separately.

  • Internet Explorer conditional comments are supported.

  • XML is supported too.

LagartoParser only performs tokenization and it does not verify if tags make sense. For example, if your HTML has a non-closed tag, LagartoParser will not consider this as an error. LagartoDom, on the other hand, will handle these cases.

Input types

LagartoParser accepts both char[] and CharSequence. This allows the usage of various implementations of inputs, including String, or even a Reader.

visitor pattern
CSSelly csselly = new CSSelly("div:nth-child(2n) span#jodd");
List<CssSelector> selectors = csselly.parse();
NodeSelector nodeSelector = new NodeSelector(document);
LinkedList<Node> selectedNodes = nodeSelector.select("div#jodd li");
tag is
everything
from the
</head
up to the end of the string.

Feel free to ask something...

...and I will answer :)

<html><head><title /></head><body>hello world!</body></html>

Jerry

jQuery in Java

Jerry is a jQuery in Java - fast and focused Java library that simplifies parsing, traversing and manipulating the HTML content, using the same methods and syntax as the jQuery.

Look, it's really cool:

Jerry doc = Jerry.of("<html><div id='jodd'><b>Hello</b> Jerry</div></html>");
doc.s("div#jodd b").css("color", "red").addClass("ohmy");

The (formatted) output will be:

<html>
    <div id="jodd">
        <b style="color:red;" class="ohmy">Hello</b> Jerry
    </div>
</html>

I tried to keep Jerry API identical to the jQuery as much as possible. In some cases, you can simply copy some jQuery code and paste it in Java - and it will work! Of course, there are some differences due to the different nature of the platforms.

The well-known jQuery method $() is renamed tos() in Jerry. The reason is compatibility with different JVMs: GraalVM, for example, does not allow usage of $ in method names.

If you don't like the s() method, use the find() alternative.

LagartoParser lagartoParser = new LagartoParser("<html><h1>Hello</h1></html>");

TagVisitor tagVisitor = new EmptyTagVisitor() {
    @Override
    public void tag(final Tag tag) {
        if (tag.nameEquals("h1")) {
            System.out.println(tag.getName());
        }
    }
	
    @Override
    public void text(final CharSequence text) {
        System.out.println(text);
    }
};

lagartoParser.parse(tagVisitor);
h1
Hello
h1

Configuration

Fine-tune the parser

LagartoParser configuration is defined in LagartoParserConfig class. An instance of this class can be passed to a constructor or you can just modify it later, functional style. For example:

The list of available properties follows.

calculatePosition

By default disabled, this property switches the calculation of elements position. The position consists of a line number, approximate column number, and a total offset in the file.

Adapter and Writer

Additional cool classes

Besides TagVisitor, you can use also one of the following classes.

  • EmptyTagVisitor - default implementation that does nothing. You will probably use it, as you can override just the methods you need.

  • TagVisitors - is a simple composite of many TagVisitors implementations. They will be invoked in the given order.

Using Jerry

Some differences and add-ons

In Java we do not have the document context as in browsers, so we need to create one first. To do that, simply pass HTML content to Jerry static factory method. That will create a root Jerry set, containing a Document root node of the parsed DOM tree.

What happens in the background is that LagartoDOM builds a DOM tree and wraps it in the Jerry/jQuery API.

Using CSS selectors

You can use most of the standard CSS selectors and also most of the jQuery CSS selectors extensions. CSS selectors are supported by the

Calculating position makes processing slower.

enableConditionalComments

When enabled, LagartoParser will detect IE conditional comments. When its disabled and conditional comment is found, LagartoParser sends an error for revealed conditional comment tags or threats downlevel-hidden conditional comments as regular comments.

Enabling conditional comments also makes parsing slower.

caseSensitive

By default is false. When enabled various matching will be case sensitive. Should not be used for parsing HTML content.

parseXmlTags

Enables parsing of XML specific tags. By default it's disabled.

enableRawTextModes

Enabled by default, tells LagartoParser to take special take on all so-called 'raw' text tags, such as style, script, xmp and so on. You can disable this if that content is not of importance, and gain some more speed.

textBufferSize

By default its set to 1024. It's the size of the internal text buffer used to collect all text blocks. This buffer will grow if needed. If your HTML contains text blocks of large size, you may increase this number, just for the purpose of small performance improvement.

LagartoDOM

Parse HTML to a DOM tree.

LagartoDOM parses HTML content and creates a DOM tree from it. It is based on LagartoParser. The created DOM is convenient to traverse and manipulate. However, if you need ultimate performance, go with the event-based LagartoParser.

Let's see LagartoDom in action:

Document document = new LagartoDOMBuilder()
.parse("<html><h1>Hello</h1></html>");

Node html = document.getChild(0);
Node h1 = html.getFirstChild();

System.out.println(h1.getTextContent());				// Hello

Text text = (Text) h1.getFirstChild();
System.out.println(text.getTextValue());				// Hello

System.out.println(text.getCssPath());				  // html h1

It's simple as that. As said, the DOM tree is created. It consists of Node elements. Each Node contains a bunch of methods related to tree traversing, such as getChild(index), getFirstChild(), getParentNode(), getChildNodes(), etc.

Node contains also getters for node name, attributes, node values. Node can be detached from the tree or attached at some point. Element is a special type of the Node that represents elements, and there is a whole subset of methods that deal only with elements.

Finally, you can render DOM tree or any Node back to HTML content.

Parsing specification

LagartoDOM follows only a subset of the official DOM-building specification. Here is why!

By default, LagartoDOM follows all the rules that do not involve any movements of DOM nodes. This is done on purpose. The idea is to get the exact tree to what you have provided. For example, if you pass HTML with some tags that are not supposed to be nested, LagartoDOM would not complain and you will get exactly what you have on input.

In most cases, this will be perfectly fine, as developers are probably not using all the tricks of HTML5 for the sake of better readability.

Still, you can turn on some more rules, you can turn them on! In that case, the resulting DOM tree can be modified per HTML5 rules. I have implemented the most common of these rules and exceptions, but haven't covered them all (yet). So if you have some weird HTML, you might get a different tree than what you get in a browser.

LagartoDOM is not (yet) a strict implementation of HTML5 DOM-building rules, but it is good enough for most cases!

Carry on :)

  • TagAdapter - is an adapter over target TagVisitor. With such adapter you can change the behavior of an existing visitor.

  • Enclosed adapters

    You can use the following adapters:

    • StripHtmlTagAdapter - strips all the unnecessary whitespaces from text blocks and also removes all the comments. For example, multiple spaces would be replaced with a single space, etc.

    • UrlRewriterTagAdapter - as the name implies, you may change the <a href link values.

    Writer

    TagWriter is a simple TagVisitor that builds HTML from the events. Usually, you can use it as target of some adapter. This way you can modify input HTML by parsing it, adapt it, and then write it again to an Appendable content.

    Example

    The resulting string would be:

    CSSelly
    .

    Differences

    As Jerry speaks Java, there are some differences in API made to make Jerry API more Java friendly. For example, css() method accepts an array of property/values, and not a single string:

    Similarly, each() method receives a lambda:

    Unsupported stuff

    As Jerry is all about 'static' manipulation of HTML content, all jQuery methods and selectors that are related to any dynamic activity are not supported. This includes animations, Ajax calls, selectors that depend on CSS definitions...

    Add-ons

    Jerry provides some add-ons that do not exist in jQuery.

    First, there are few methods that return Node of parsed DOM tree (similar to JavaScript).

    Then there are some new methods that are more meaningful in Java world. One of them is the form() method. It collects all parameters from a given form, allowing easy form handling. Here is an example:

    Convenient, right!?

    Jerry
        .of(html)
        .s("tr:last")
        .css("background-color", "yellow", "fontWeight", "bolder");
    LagartoParserConfig cfg = new LagartoParserConfig().setCaseSensitive(true);
    LagartoParser lagartoParser = new LagartoParser(cfg, "<html>");
    LagartoParser lagartoParser =
        new LagartoParser("<html>")
            .configure(cfg -> {
                cfg.setCaseSensitive(true);
            });
    TagWriter tagWriter = new TagWriter();
    StripHtmlTagAdapter adapter = new StripHtmlTagAdapter(tagWriter);
    LagartoParser lagartoParser = new LagartoParser(
            "<html> <h1>  Hello  </h1> </html>");
    
    lagartoParser.parse(adapter);
    
    System.out.println(tagWriter.getOutput().toString());
    <html><h1> Hello </h1></html>
    Jerry.of(someHtml)
        .s("select option:selected")
        .each(($this, index) -> {
            System.out.println($this.text());
            return true;
        });
    Jerry.of("html")
         .form("#myform", (form, parameters) -> {
             // process form and parameters
         });

    Customize

    Custom pseudo classes and functions

    Custom pseudo-class

    Custom pseudo-classes extends the PseudoClass and implements match(Node node). This method should return true if a node is matched. You may also override the method getPseudoClassName() if you don't want to generate a pseudo-class name from the class name. For example:

    public class MyPseudoClass extends PseudoClass {
        @Override
        public boolean match(Node node) {
          return node.hasAttribute("jodd-attr");
        }
    
        @Override
        public String getPseudoClassName() {
          return "jodd";
        }
    }

    Then register your pseudo-class with:

    From that moment you will be able to find all nodes with the attribute jodd-attr using the :jodd pseudo-class.

    When a pseudo-class needs to perform an additional match in the range of matched nodes (e.g. first, last etc), then override matchInRange() method, too.

    Custom pseudo-function

    Similar to pseudo-classes, custom pseudo-function implements the PseudoFunction class. Additionally, you need to also implement a method that parses input expression. This expression is later passed to the matching method.

    Let's make a function that matches all nodes with certain name length:

    Register it with:

    Start using it! E.g. :len-fn(3) to match all nodes with short names:)

        PseudoClassSelector.registerPseudoClass(MyPseudoClass.class);
    public class MyPseudoFunction extends PseudoFunction {
        @Override
        public Object parseExpression(String expression) {
            return Integer.valueOf(expression);
        }
    
        @Override
        public boolean match(Node node, Object expression) {
            Integer size = (Integer) expression;
            return node.getNodeName().length() == size.intValue();
        }
    
        @Override
        public String getPseudoFunctionName() {
            return "len-fn";
        }
    }
    PseudoFunctionSelector.registerPseudoFunction(MyPseudoFunction.class);

    Selectors

    Available selectors

    The list of default selectors supported by CSSelly:

    • * any element

    • E an element of type E

    • E[foo] an E element with a "foo" attribute

    • E[foo="bar"] an E element whose "foo" attribute value is exactly equal to "bar"

    • E[foo~="bar"] an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar"

    • E[foo^="bar"] an E element whose "foo" attribute value begins exactly with the string "bar"

    • E[foo$="bar"] an E element whose "foo" attribute value ends exactly with the string "bar"

    • E[foo*="bar"] an E element whose "foo" attribute value contains the substring "bar"

    • E[foo|="en"] an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en"

    • E:root an E element, root of the document

    • E:nth-child(n) an E element, the n-th child of its parent

    • E:nth-last-child(n) an E element, the n-th child of its parent, counting from the last one

    • E:nth-of-type(n) an E element, the n-th sibling of its type

    • E:nth-last-of-type(n) an E element, the n-th sibling of its type, counting from the last one

    • E:first-child an E element, first child of its parent

    • E:last-child an E element, last child of its parent

    • E:first-of-type an E element, first sibling of its type

    • E:last-of-type an E element, last sibling of its type

    • E:only-child an E element, only child of its parent

    • E:only-of-type an E element, only sibling of its type

    • E:empty an E element that has no children (including text nodes)

    • E#myid an E element with ID equal to “myid”.

    • E F an F element descendant of an E element

    • E > F an F element child of an E element

    • E + F an F element immediately preceded by an E element

    • E ~ F an F element preceded by an E element

    The list of additional pseudo-classes and pseudo-functions supported by CSSelly:

    • :first

    • :last

    • :button

    Escaping

    CSSelly supports escaping characters using the backslash, e.g.: nspace\:name refers to the tag name nspace:name (that uses namespaces) and not for pseudo-class name.

    :checkbox
  • :file

  • :header

  • :image

  • :input

  • :parent

  • :password

  • :radio

  • :reset

  • :selected

  • :checked

  • :submit

  • :text

  • :even

  • :odd

  • :eq(n)

  • :gt(n)

  • :lt(n)

  • :contains(text)