com.renomad.minum.htmlparsing.HtmlParser

public final class HtmlParser extends Object

Converts HTML strings to object trees.

Enables a developer to analyze an HTML document by its structure.

Note: HTML parsing is difficult because of its lenient specification. See Postel's Law.

For our purposes, it is less important to perfectly meet the criteria of the spec, so there will be numerous edge-cases unaccounted-for by this implementation. Nevertheless, this program should suit many needs for ordinary web applications.

Constructor Summary

Constructors

Constructor

Description

HtmlParser()
Method Summary

Modifier and Type

Method

Description

List<HtmlParseNode>

parse(String input)

Given any HTML input, scan through and generate a tree of HTML nodes.

List<HtmlParseNode>

search(List<HtmlParseNode> nodes, TagName tagName, Map<String,String> attributes)

Search the node tree for matching elements.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- HtmlParser
  
  public HtmlParser()
Method Details
- parse
  
  public List<HtmlParseNode> parse(String input)
  Given any HTML input, scan through and generate a tree of HTML nodes. Return a list of the roots of the tree.
  This parser operates with a very particular paradigm in mind. I'll explain it through examples. Let's look at some typical HTML:
  
  <p>Hello world</p>
  
  The way we will model this is as follows:
  
  <ELEMENT_NAME_AND_DETAILS>content<END_OF_ELEMENT>
  
  We will examine the first part, "ELEMENT_NAME_AND_DETAILS", and grab the element's name and any attributes. Then we will descend into the content section. We know we have hit the end of the element by keeping track of how far we have descended/ascended and whether we are hitting a closing HTML element.
  
  Complicating this is that elements may not have content, for example any void elements or when a user chooses to create an empty tag
- search
  
  public List<HtmlParseNode> search(List<HtmlParseNode> nodes, TagName tagName, Map<String,String> attributes)
  
  Search the node tree for matching elements.
  If zero nodes are found, returns an empty list.

Class HtmlParser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

HtmlParser

Method Details

parse

search