Class HtmlParser

java.lang.Object
com.renomad.minum.htmlparsing.HtmlParser

public final class HtmlParser extends Object
Converts HTML strings to object trees.

Enables a developer to analyze an HTML document by its structure.

Note: HTML parsing is difficult because of its lenient specification. See Postel's Law.

For our purposes, it is less important to perfectly meet the criteria of the spec, so there will be numerous edge-cases unaccounted-for by this implementation. Nevertheless, this program should suit many needs for ordinary web applications.

  • Constructor Details

    • HtmlParser

      public HtmlParser()
  • Method Details

    • parse

      public List<HtmlParseNode> parse(String input)
      Given any HTML input, scan through and generate a tree of HTML nodes. Return a list of the roots of the tree.

      This parser operates with a very particular paradigm in mind. I'll explain it through examples. Let's look at some typical HTML:

      <p>Hello world</p>

      The way we will model this is as follows:

      <ELEMENT_NAME_AND_DETAILS>content<END_OF_ELEMENT>

      We will examine the first part, "ELEMENT_NAME_AND_DETAILS", and grab the element's name and any attributes. Then we will descend into the content section. We know we have hit the end of the element by keeping track of how far we have descended/ascended and whether we are hitting a closing HTML element.

      Complicating this is that elements may not have content, for example any void elements or when a user chooses to create an empty tag

    • search

      public List<HtmlParseNode> search(List<HtmlParseNode> nodes, TagName tagName, Map<String,String> attributes)
      Search the node tree for matching elements.

      If zero nodes are found, returns an empty list.