Enables a developer to analyze an HTML document by its structure.
Note: HTML parsing is difficult because of its lenient specification. See Postel's Law.
For our purposes, it is less important to perfectly meet the criteria of the spec, so there will be numerous edge-cases unaccounted-for by this implementation. Nevertheless, this program should suit many needs for ordinary web applications.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionGiven any HTML input, scan through and generate a tree of HTML nodes.Search the node tree for matching elements.
-
Constructor Details
-
HtmlParser
public HtmlParser()
-
-
Method Details
-
parse
Given any HTML input, scan through and generate a tree of HTML nodes. Return a list of the roots of the tree.This parser operates with a very particular paradigm in mind. I'll explain it through examples. Let's look at some typical HTML:
<p>Hello world</p>
The way we will model this is as follows:
<ELEMENT_NAME_AND_DETAILS>content<END_OF_ELEMENT>
We will examine the first part, "ELEMENT_NAME_AND_DETAILS", and grab the element's name and any attributes. Then we will descend into the content section. We know we have hit the end of the element by keeping track of how far we have descended/ascended and whether we are hitting a closing HTML element.
Complicating this is that elements may not have content, for example any void elements or when a user chooses to create an empty tag
-
search
public List<HtmlParseNode> search(List<HtmlParseNode> nodes, TagName tagName, Map<String, String> attributes) Search the node tree for matching elements.If zero nodes are found, returns an empty list.
-