Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
List of events emitted during the parsing
During the parsing, LagartoParser calls various callback methods of provided TagVisitor
implementation.
start()
& end()
Invoked before and after the content is parsed.
text(CharSequence)
Callback invoked on a block of plain text.
comment(CharSequence)
Invoked on an HTML comment. The argument contains the comment content, without the tag boundaries.
tag(Tag)
Callback invoked foo all HTML tags: open, close, or an empty tag. The argument is a Tag
instance, containing various information about the tag: tags name, attributes, depth level, etc.
Tag
instance is reused during HTML parsing for better performances! The same Tag
instance is passed to all callback methods and for every detected HTML tag. Do not store it internally!
script(Tag, CharSequence)
Invoked on all script
tags. The callback method receives the script Tag instance and the script body.
doctype(Doctype)
Callback for the doctype
tag.
xml()
and cdata()
These two callbacks are invoked for XML-specific tags when parsing the XML content.
Every error is reported by visiting this method. Depending on the configuration, the error message would contain the exact error position or not.
Java libraries for HTML/XML parsing
Lagarto Suite is the family of HTML/XML parsers written in Java. It consists of the following libraries:
LagartoParser is an all-purpose fast and versatile event-based HTML parser. You can use it to modify or analyze some markup content, allowing you to assemble custom complex transformations and code analysis tools quickly. It is performant and follows the rules of the official HTML specification.
LagartoDom builds a DOM tree in memory from the input. You can manipulate a tree more conveniently, with minor performance sacrifice.
Jerry is a "jQuery in Java" - you can use the familiar syntax of JavaScript library inside of Java to parse and manipulate HTML.
CSSelly - finally, the parser of CSS3 selectors.
Lagarto parsers are compatible with Java 8 and newer.
The code is released under the BSD-2-Clause
license. It has a minimal set of dependencies with the same or similarly open license, so you should be able to use it in any project and for any purpose.
Some common answers
Those two tags are different. You can close them ONLY with a closing tag.
For example, the following snippet:
has the 3 tags only: html
, body
, and title
. The text for title
tag is everything from the </head
up to the end of the string.
...and I will answer :)
Additional cool classes
Besides TagVisitor
, you can use also one of the following classes.
EmptyTagVisitor
- default implementation that does nothing. You will probably use it, as you can override just the methods you need.
TagVisitors
- is a simple composite of many TagVisitor
s implementations. They will be invoked in the given order.
TagAdapter
- is an adapter over target TagVisitor
. With such adapter you can change the behavior of an existing visitor.
You can use the following adapters:
StripHtmlTagAdapter
- strips all the unnecessary whitespaces from text blocks and also removes all the comments. For example, multiple spaces would be replaced with a single space, etc.
UrlRewriterTagAdapter
- as the name implies, you may change the <a href
link values.
TagWriter
is a simple TagVisitor
that builds HTML from the events. Usually, you can use it as target of some adapter. This way you can modify input HTML by parsing it, adapt it, and then write it again to an Appendable
content.
The resulting string would be:
Fine-tune the parser
LagartoParser configuration is defined in LagartoParserConfig
class. An instance of this class can be passed to a constructor or you can just modify it later, functional style. For example:
The list of available properties follows.
By default disabled, this property switches the calculation of elements position. The position consists of a line number, approximate column number, and a total offset in the file.
Calculating position makes processing slower.
When enabled, LagartoParser will detect IE conditional comments. When its disabled and conditional comment is found, LagartoParser sends an error for revealed conditional comment tags or threats downlevel-hidden conditional comments as regular comments.
Enabling conditional comments also makes parsing slower.
By default is false
. When enabled various matching will be case sensitive. Should not be used for parsing HTML content.
Enables parsing of XML specific tags. By default it's disabled.
Enabled by default, tells LagartoParser to take special take on all so-called 'raw' text tags, such as style
, script
, xmp
and so on. You can disable this if that content is not of importance, and gain some more speed.
By default its set to 1024
. It's the size of the internal text buffer used to collect all text blocks. This buffer will grow if needed. If your HTML contains text blocks of large size, you may increase this number, just for the purpose of small performance improvement.
Tips on how to install Jodd Lagarto library in your app
Jodd Lagarto is released on Maven Central. You can use the following snippets to add it to your project:
That is all!
Jodd Lagarto has only one (and small) dependency: the Jodd util library, that is released under the same license.
Snapshots are released manually. Feel free to contact me if you need a new SNAPSHOT release sooner.
Jodd Lagarto snapshots are published on .
Fine-tune parsing to DOM tree.
LagartoDom configuration is specified in LagartoDomBuilderConfig
class. Among new properties, there is also the instance of the LagartoParser configuration, of the parser that is used internally.
In most cases, you will just use the predefined modes. Here is the list of properties that you can configure.
This flag is used for XML mode, to ignore all whitespace content between two starting or two ending tags. Whitespace content between one open and one closed tag is still not ignored.
This flag simply defines if the resulting DOM tree should contain comments or not.
Flag to enable/disable void tags.
When an element is a void element, this flag defines if it can be self-closed or if it should have the standard end tag.
Enables rules for implicit end tags. There are a number of tags that do not require the use of a closing tag for valid HTML (body
, li
, dd
, dt
, p
, td
, tr
,...). When this flag is on, these tags are implicitly closed if needed and no error/warning is logged.
This feature somewhat slows down the parsing. If you know that all tags are closed in input HTML, consider switching this feature off, to improve performances.
The version of conditional comments.
Custom loggers.
Parse HTML to a DOM tree.
LagartoDOM parses HTML content and creates a DOM tree from it. It is based on LagartoParser. The created DOM is convenient to traverse and manipulate. However, if you need ultimate performance, go with the event-based LagartoParser.
Let's see LagartoDom in action:
It's simple as that. As said, the DOM tree is created. It consists of Node
elements. Each Node
contains a bunch of methods related to tree traversing, such as getChild(index)
, getFirstChild()
, getParentNode()
, getChildNodes()
, etc.
Node
contains also getters for node name, attributes, node values. Node
can be detached from the tree or attached at some point. Element
is a special type of the Node
that represents elements, and there is a whole subset of methods that deal only with elements.
Finally, you can render DOM tree or any Node
back to HTML content.
LagartoDOM follows only a subset of the official DOM-building specification. Here is why!
By default, LagartoDOM follows all the rules that do not involve any movements of DOM nodes. This is done on purpose. The idea is to get the exact tree to what you have provided. For example, if you pass HTML with some tags that are not supposed to be nested, LagartoDOM would not complain and you will get exactly what you have on input.
In most cases, this will be perfectly fine, as developers are probably not using all the tricks of HTML5 for the sake of better readability.
Still, you can turn on some more rules, you can turn them on! In that case, the resulting DOM tree can be modified per HTML5 rules. I have implemented the most common of these rules and exceptions, but haven't covered them all (yet). So if you have some weird HTML, you might get a different tree than what you get in a browser.
Carry on :)
LagartoDom parses XML, XHTML and HTML
LagartoDom can parse XML, XHTML and HTML content. There are already predefined methods that quickly enable these modes:
enableHtmlMode()
enableXhtmlMode()
enableXmlMode()
There is one more mode available:
enableHtmlPlusMode()
The default HTML mode does not change the order of the nodes. However, HTML5 specification has some rules where nodes are moved around the DOM. For example, all tags written beyond table tags in a table are moved before table definition. Moreover, there are some special rules on which orphan tags may be closed and the scope in which they can be closed.
HTML Plus mode is one that enables these additional parsing rules. These rules require some additional processing and may slow down the processing.
Regardless of the parsing mode, LagartoDom may work in debug mode:
enableDebugMode()
In debugging mode all the errors are collected and their position is calculated. Of course, this slows down the processing.
Some differences and add-ons
In Java we do not have the document context as in browsers, so we need to create one first. To do that, simply pass HTML content to Jerry static factory method. That will create a root Jerry set, containing a Document
root node of the parsed DOM tree.
What happens in the background is that LagartoDOM builds a DOM tree and wraps it in the Jerry/jQuery API.
You can use most of the standard CSS selectors and also most of the jQuery CSS selectors extensions. CSS selectors are supported by the CSSelly.
As Jerry speaks Java, there are some differences in API made to make Jerry API more Java friendly. For example, css()
method accepts an array of property/values, and not a single string:
Similarly, each()
method receives a lambda:
As Jerry is all about 'static' manipulation of HTML content, all jQuery methods and selectors that are related to any dynamic activity are not supported. This includes animations, Ajax calls, selectors that depend on CSS definitions...
Jerry provides some add-ons that do not exist in jQuery.
First, there are few methods that return Node
of parsed DOM tree (similar to JavaScript).
Then there are some new methods that are more meaningful in Java world. One of them is the form()
method. It collects all parameters from a given form, allowing easy form handling. Here is an example:
Convenient, right!?
Jerry has a girlfriend!
CSSelly is a Java implementation of the W3C Selectors Level 3 specification.
It's small, fast and extendable. CSSelly parses an input containing CSS selectors. The result then may be used by any HTML parser. Yet, it works the best with LagartoDOM tree and our Jerry.
As said, CSSelly is used as node selector in LagartoDOM (and therefore in Jerry, too):
jQuery in Java
Look, it's really cool:
The (formatted) output will be:
I tried to keep Jerry API identical to the jQuery as much as possible. In some cases, you can simply copy some jQuery code and paste it in Java - and it will work! Of course, there are some differences due to the different nature of the platforms.
The well-known jQuery method $()
is renamed tos()
in Jerry. The reason is compatibility with different JVMs: GraalVM, for example, does not allow usage of $
in method names.
If you don't like the s()
method, use the find()
alternative.
Jerry is a in Java - fast and focused Java library that simplifies parsing, traversing and manipulating the HTML content, using the same methods and syntax as the jQuery.
Custom pseudo classes and functions
Custom pseudo-classes extends the PseudoClass
and implements match(Node node)
. This method should return true
if a node is matched. You may also override the method getPseudoClassName()
if you don't want to generate a pseudo-class name from the class name. For example:
Then register your pseudo-class with:
From that moment you will be able to find all nodes with the attribute jodd-attr
using the :jodd
pseudo-class.
When a pseudo-class needs to perform an additional match in the range of matched nodes (e.g. first, last etc), then override matchInRange()
method, too.
Similar to pseudo-classes, custom pseudo-function implements the PseudoFunction
class. Additionally, you need to also implement a method that parses input expression. This expression is later passed to the matching method.
Let's make a function that matches all nodes with certain name length:
Register it with:
Start using it! E.g. :len-fn(3)
to match all nodes with short names:)
Available selectors
The list of default selectors supported by CSSelly:
*
any element
E
an element of type E
E[foo]
an E element with a "foo" attribute
E[foo="bar"]
an E element whose "foo" attribute value is exactly equal to "bar"
E[foo~="bar"]
an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar"
E[foo^="bar"]
an E element whose "foo" attribute value begins exactly with the string "bar"
E[foo$="bar"]
an E element whose "foo" attribute value ends exactly with the string "bar"
E[foo*="bar"]
an E element whose "foo" attribute value contains the substring "bar"
E[foo|="en"]
an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en"
E:root
an E element, root of the document
E:nth-child(n)
an E element, the n-th child of its parent
E:nth-last-child(n)
an E element, the n-th child of its parent, counting from the last one
E:nth-of-type(n)
an E element, the n-th sibling of its type
E:nth-last-of-type(n)
an E element, the n-th sibling of its type, counting from the last one
E:first-child
an E element, first child of its parent
E:last-child
an E element, last child of its parent
E:first-of-type
an E element, first sibling of its type
E:last-of-type
an E element, last sibling of its type
E:only-child
an E element, only child of its parent
E:only-of-type
an E element, only sibling of its type
E:empty
an E element that has no children (including text nodes)
E#myid
an E element with ID equal to “myid”.
E F
an F element descendant of an E element
E > F
an F element child of an E element
E + F
an F element immediately preceded by an E element
E ~ F
an F element preceded by an E element
The list of additional pseudo-classes and pseudo-functions supported by CSSelly:
:first
:last
:button
:checkbox
:file
:header
:image
:input
:parent
:password
:radio
:reset
:selected
:checked
:submit
:text
:even
:odd
:eq(n)
:gt(n)
:lt(n)
:contains(text)
CSSelly supports escaping characters using the backslash, e.g.: nspace\:name
refers to the tag name nspace:name
(that uses namespaces) and not for pseudo-class name
.
Fast event-based HTML parser
Let's see it in action:
As the input content is parsed, the callback methods in the visitor get invoked. In this case, the result is:
Note that the tag()
event was emitted twice: first for the open tag, and then for the close tag. In other words, LagartoParser performs the tokenization of the input HTML.
the text is emitted as a single block of text and not one by one character.
the case of a tag name (and other tokens) is not changed when emitted.
LagartoParser does only tokenization. The DOM tree is not created, neither validated.
the script tag is emitted separately.
Internet Explorer conditional comments are supported.
XML is supported too.
LagartoParser only performs tokenization and it does not verify if tags make sense. For example, if your HTML has a non-closed tag, LagartoParser will not consider this as an error. LagartoDom, on the other hand, will handle these cases.
LagartoParser accepts both char[]
and CharSequence
. This allows the usage of various implementations of inputs, including String
, or even a Reader
.
LagartoParser is an event-based HTML parser. It processes the input and emits events as they are parsed; using a . This makes parsing very fast and memory-usage is minimal. However, sometimes event-based parsing can be tedious; in that case, try LagartoDom parser instead.
HTML parsing (i.e. tokenization) is done strictly by the official . Note the following: