🦎
Jodd Lagarto
  • Lagarto HTML parsers suite
  • Installation
  • Contact
  • Lagarto parser
    • LagartoParser
    • Events
    • Configuration
    • Adapter and Writer
    • FAQ
  • Lagarto DOM
    • LagartoDOM
    • Parsing modes
    • Configuration
  • Jerry
    • Jerry
    • Using Jerry
  • CSSelly
    • CSSelly
    • Selectors
    • Customize
Powered by GitBook
On this page
  • Parsing specification
  • Input types

Was this helpful?

Export as PDF
  1. Lagarto parser

LagartoParser

Fast event-based HTML parser

PreviousContactNextEvents

Last updated 3 years ago

Was this helpful?

LagartoParser is an event-based HTML parser. It processes the input and emits events as they are parsed; using a . This makes parsing very fast and memory-usage is minimal. However, sometimes event-based parsing can be tedious; in that case, try LagartoDom parser instead.

Let's see it in action:

LagartoParser lagartoParser = new LagartoParser("<html><h1>Hello</h1></html>");

TagVisitor tagVisitor = new EmptyTagVisitor() {
    @Override
    public void tag(final Tag tag) {
        if (tag.nameEquals("h1")) {
            System.out.println(tag.getName());
        }
    }
	
    @Override
    public void text(final CharSequence text) {
        System.out.println(text);
    }
};

lagartoParser.parse(tagVisitor);

As the input content is parsed, the callback methods in the visitor get invoked. In this case, the result is:

h1
Hello
h1

Note that the tag() event was emitted twice: first for the open tag, and then for the close tag. In other words, LagartoParser performs the tokenization of the input HTML.

Parsing specification

  • the text is emitted as a single block of text and not one by one character.

  • the case of a tag name (and other tokens) is not changed when emitted.

  • LagartoParser does only tokenization. The DOM tree is not created, neither validated.

  • the script tag is emitted separately.

  • Internet Explorer conditional comments are supported.

  • XML is supported too.

LagartoParser only performs tokenization and it does not verify if tags make sense. For example, if your HTML has a non-closed tag, LagartoParser will not consider this as an error. LagartoDom, on the other hand, will handle these cases.

Input types

LagartoParser accepts both char[] and CharSequence. This allows the usage of various implementations of inputs, including String, or even a Reader.

HTML parsing (i.e. tokenization) is done strictly by the official . Note the following:

visitor pattern
HTML5 specification