dim

_images/download-dim.py-brightgreen.svg

dim is an HTML parser and simple DOM implementation with CSS selector support.

dim

  • is a single module;

  • has no dependency outside PSL;

  • is not crazy long;

  • supports Python 3.6 and forward,

so the file could be directly embedded in any Python 3.4+ application, or even in a monolithic source file. dim was designed to ease the development of googler(1), which itself promises to be a single Python script with zero third-party dep.

Simple example:

>>> import dim
>>> html = '''
... <html>
... <body>
...   <table id="primary">
...     <thead>
...       <tr><th class="bold">A</th><th>B</th></tr>
...     </thead>
...     <tbody>
...       <tr class="highlight"><td class="bold">1</td><td>2</td></tr>
...       <tr><td class="bold">3</td><td>4</td></tr>
...       <tr><td class="bold">5</td><td>6</td></tr>
...       <tr><td class="bold">7</td><td>8</td></tr>
...     </tbody>
...   </table>
...   <table id="secondary">
...     <thead>
...       <tr><th class="bold">C</th><th>D</th></tr>
...     </thead>
...     <tbody></tbody>
...   </table>
... </body>
... </html>'''
>>> root = dim.parse_html(html)
>>> [elem.text for elem in root.select_all('table#primary th.bold, '
...                                        'table#primary tr.highlight + tr > td.bold')]
['A', '3']
>>> [elem.text for elem in root.select_all('table#primary th.bold, '
...                                        'table#primary tr.highlight ~ tr > td.bold')]
['A', '3', '5', '7']
>>> [elem.text for elem in root.select_all('th.bold, tr.highlight ~ tr > td.bold')]
['A', '3', '5', '7', 'C']

dim.parse_html

Parses HTML string, builds DOM, and returns root node.

dim.DOMBuilder

HTML parser / DOM builder.

dim.Node

Represents a DOM node.

dim.ElementNode

Represents an element node.

dim.TextNode

Represents a text node.

dim.SelectorGroup

Represents a group of CSS selectors.

dim.Selector

Represents a CSS selector.

dim.AttributeSelector

Represents an attribute selector.

dim.AttributeSelectorType

Attribute selector types.

dim.Combinator

Combinator types.

dim.DOMBuilderException

Exception raised when DOMBuilder detects a bad state.

dim.SelectorParserException

Exception raised when the selector parser fails to parse an input.

Parsing HTML and building DOM

dim.parse_html(html, *, ParserClass=<class 'dim.DOMBuilder'>)[source]

Parses HTML string, builds DOM, and returns root node.

The parser may raise DOMBuilderException.

Parameters
Return type

Node

Returns

Root note of the parsed tree. If the HTML string contains multiple top-level elements, only the first is returned and the rest are lost.

class dim.DOMBuilder[source]

HTML parser / DOM builder.

Subclasses html.parser.HTMLParser.

Consume HTML and builds a Node tree. Once finished, use root to access the root of the tree.

This parser cannot parse malformed HTML with tag mismatch.

handle_starttag(tag, attrs)[source]
Parameters
Return type

None

handle_endtag(tag)[source]
Parameters

tag (str) –

Return type

None

handle_startendtag(tag, attrs)[source]
Parameters
Return type

None

handle_data(text)[source]
Parameters

text (str) –

Return type

None

property root

Finishes processing and returns the root node.

Raises DOMBuilderException if there is no root tag or root tag is not closed yet.

Return type

Node

Nodes and elements

The DOM implementation is exposed through the Node API. There are only two types of Node’s in this implementation: ElementNode and TextNode (both subclasses Node and supports the full API).

The base class Node should not be manually instantiated; use parse_html() or DOMBuilder. ElementNode and TextNode may be manually instantiated (though not recommended).

class dim.Node[source]

Represents a DOM node.

Parts of JavaScript’s DOM Node API and Element API are mirrored here, with extensions. In particular, querySelector and querySelectorAll are mirrored.

Notable properties and methods: attr(), classes, html, text, ancestors(), descendants(), select(), select_all(), matched_by(),

tag
Type

Optional[str]

attrs
Type

Dict[str, str]

parent
Type

Optional[Node]

children
Type

List[Node]

select(selector)[source]

DOM querySelector clone. Returns one match (if any).

Parameters

selector (Union[str, SelectorGroup, Selector]) –

Return type

Optional[Node]

query_selector(selector)[source]

Alias of select().

Parameters

selector (Union[str, SelectorGroup, Selector]) –

Return type

Optional[Node]

select_all(selector)[source]

DOM querySelectorAll clone. Returns all matches in a list.

Parameters

selector (Union[str, SelectorGroup, Selector]) –

Return type

List[Node]

query_selector_all(selector)[source]

Alias of select_all().

Parameters

selector (Union[str, SelectorGroup, Selector]) –

Return type

List[Node]

matched_by(selector, root=None)[source]

Checks whether this node is matched by selector.

See SelectorGroup.matches().

Parameters
Return type

bool

child_nodes()[source]
Return type

List[Node]

first_child()[source]
Return type

Optional[Node]

first_element_child()[source]
Return type

Optional[Node]

last_child()[source]
Return type

Optional[Node]

last_element_child()[source]
Return type

Optional[Node]

next_sibling()[source]

Note

Not O(1), use with caution.

Return type

Optional[Node]

next_siblings()[source]
Return type

List[Node]

next_element_sibling()[source]

Note

Not O(1), use with caution.

Return type

Optional[ElementNode]

previous_sibling()[source]

Note

Not O(1), use with caution.

Return type

Optional[Node]

previous_siblings()[source]

Compared to the natural DOM order, the order of returned nodes are reversed. That is, the adjacent sibling (if any) is the first in the returned list.

Return type

List[Node]

previous_element_sibling()[source]

Note

Not O(1), use with caution.

Return type

Optional[ElementNode]

ancestors(*, root=None)[source]

Ancestors are generated in reverse order of depth, stopping at root.

A RuntimeException is raised if root is not in the ancestral chain.

Parameters

root (Optional[Node]) –

Return type

Generator[Node, None, None]

descendants()[source]

Descendants are generated in depth-first order.

Return type

Generator[Node, None, None]

attr(attr)[source]

Returns the attribute if it exists on the node, otherwise None.

Parameters

attr (str) –

Return type

Optional[str]

property html

HTML representation of the node.

(For a TextNode, html() returns the escaped version of the text.

Return type

str

outer_html()[source]

Alias of html.

Return type

str

inner_html()[source]

HTML representation of the node’s children.

Return type

str

property text

This property is expected to be implemented by subclasses.

Return type

str

text_content()[source]

Alias of text.

Return type

str

property classes
Return type

List[str]

class_list()[source]
Return type

List[str]

class dim.ElementNode(tag, attrs, *, parent=None, children=None)[source]

Represents an element node.

Note that tag and attribute names are case-insensitive; attribute values are case-sensitive.

Parameters
property text

The concatenation of all descendant text nodes.

Return type

str

class dim.TextNode(text)[source]

Represents a text node.

Subclasses Node and str.

Parameters

text (str) –

__eq__(other)[source]

Two text nodes are equal if and only if they are the same node.

For string comparison, use text.

Parameters

other (object) –

Return type

bool

__ne__(other)[source]

Two text nodes are non-equal if they are not the same node.

For string comparison, use text.

Parameters

other (object) –

Return type

bool

property text

This property is expected to be implemented by subclasses.

Return type

str

CSS selectors

CSS querying support is implemented mainly through two classes: Selector and SelectorGroup. Both classes have a factory function named from_str() to parse string representations, although one may directly use selector (group) strings with the Node API (notably with Node.select(), Node.select_all(), and Node.matched_by()) and avoid explicitly constructing objects altogether.

class dim.SelectorGroup(selectors)[source]

Represents a group of CSS selectors.

A group of CSS selectors is simply a comma-separated list of selectors. 1 See Selector documentation for the scope of support.

Typically, a SelectorGroup is constructed from a string (e.g., th.center, td.center) using the factory function from_str().

1

https://www.w3.org/TR/selectors-3/#grouping

Parameters

selectors (Iterable[Selector]) –

__len__()[source]
Return type

int

__getitem__(index)[source]
Parameters

index (int) –

Return type

Selector

__iter__()[source]
Return type

Iterator[Selector]

classmethod from_str(s)[source]

Parses input string into a group of selectors.

SelectorParserException is raised on invalid input. See Selector documentation for the scope of support.

Parameters

s (str) – input string

Return type

SelectorGroup

Returns

Parsed group of selectors.

matches(node, root=None)[source]

Decides whether the group of selectors matches node.

The group of selectors matches node as long as one of the selectors matches node.

If root is provided and child and/or descendant combinators are involved, parent/ancestor lookup terminates at root.

Parameters
Return type

bool

class dim.Selector(*, tag=None, classes=None, id=None, attrs=None, combinator=None, previous=None)[source]

Represents a CSS selector.

Recall that a CSS selector is a chain of one or more sequences of simple selectors separated by combinators. 2 This concept is represented as a cons list of sequences of simple selectors (in right to left order). This class in fact holds a single sequence, with an optional combinator and reference to the previous sequence.

For instance, main#main p.important.definition > a.term[id][href] would be parsed into (schematically) the following structure:

">" tag='a' classes=('term') attrs=([id], [href]) ~>
" " tag='p' classes=('important', 'definition') ~>
tag='main' id='main'

Each line is held in a separate instance of Selector, linked together by the previous attribute.

Supported grammar (from selectors level 3 2):

  • Type selectors;

  • Universal selectors;

  • Class selectors;

  • ID selectors;

  • Attribute selectors;

  • Combinators.

Unsupported grammar:

  • Pseudo-classes;

  • Pseudo-elements;

  • Namespace prefixes (ns|, *|, |) in any part of any selector.

Rationale:

  • Pseudo-classes have too many variants, a few of which even complete with an admittedly not-so-complex minilanguage. These add up to a lot of code.

  • Pseudo-elements are useless outside rendering contexts, hence out of scope.

  • Namespace support is too niche to be worth the parsing headache. Using namespace prefixes may confuse the parser!

Note that the parser only loosely follows the spec and priotizes ease of parsing (which includes readability and writability of regexes), so some invalid selectors may be accepted (in fact, false positives abound, but accepting valid inputs is a much more important goal than rejecting invalid inputs for this library), and some valid selectors may be rejected (but as long as you stick to the scope outlined above and common sense you should be fine; the false negatives shouldn’t be used by actual human beings anyway).

In particular, whitespace character is simplified to \s (ASCII mode) despite CSS spec not counting U+000B (VT) as whitespace, identifiers are simplified to [\w-]+ (ASCII mode), and strings (attribute selector values can be either identifiers or strings) allow escaped quotes (i.e., \' inside single-quoted strings and \" inside double-quoted strings) but everything else is interpreted literally. The exact specs for CSS identifiers and strings can be found at 3.

Certain selectors and combinators may be implemented in the parser but not implemented in matching and/or selection APIs.

2(1,2)

https://www.w3.org/TR/selectors-3/

3

https://www.w3.org/TR/CSS21/syndata.html

Parameters
tag

Type selector.

Type

Optional[str]

classes

Class selectors.

Type

List[str]

id

ID selector.

Type

Optional[str]

attrs

Attribute selectors.

Type

List[AttributeSelector]

combinator

Combinator with the previous sequence of simple selectors in chain.

Type

Optional[Combinator]

previous

Reference to the previous sequence of simple selectors in chain.

Type

Optional[Selector]

classmethod from_str(s, cursor=0)[source]

Parses input string into selector.

This factory function only parses out one selector (up to a comma or EOS), so partial consumption is allowed — an optional cursor is taken as input (0 by default) and the moved cursor (either after the comma or at EOS) is returned as part of the output.

SelectorParserException is raised on invalid input. See Selector documentation for the scope of support.

If you need to completely consume a string representing (potentially) a group of selectors, use SelectorGroup.from_str().

Parameters
  • s (str) – input string

  • cursor (int) – initial cursor position on s

Return type

Tuple[Selector, int]

Returns

A tuple containing the parsed selector and the moved the cursor (either after a comma-delimiter, or at EOS).

matches(node, root=None)[source]

Decides whether the selector matches node.

Each sequence of simple selectors in the selector’s chain must be matched for a positive.

If root is provided and child and/or descendant combinators are involved, parent/ancestor lookup terminates at root.

Parameters
Return type

bool

class dim.AttributeSelector(attr, val, type)[source]

Represents an attribute selector.

Parameters
attr
Type

str

val
Type

Optional[str]

type
Type

AttributeSelectorType

matches(node)[source]
Parameters

node (Node) –

Return type

bool

class dim.AttributeSelectorType(value)[source]

Attribute selector types.

Members correspond to the following forms of attribute selector:

BARE = 1
EQUAL = 2
TILDE = 3
PIPE = 4
CARET = 5
DOLLAR = 6
ASTERISK = 7
class dim.Combinator(value)[source]

Combinator types.

Members correspond to the following combinators:

DESCENDANT = 1
CHILD = 2
NEXT_SIBLING = 3
SUBSEQUENT_SIBLING = 4

Exceptions

class dim.DOMBuilderException(pos, why)[source]

Exception raised when DOMBuilder detects a bad state.

Parameters
pos

Line number and offset in HTML input.

Type

Tuple[int, int]

why

Reason of the exception.

Type

str

class dim.SelectorParserException(s, cursor, why)[source]

Exception raised when the selector parser fails to parse an input.

Parameters
s

The input string to be parsed.

Type

str

cursor

Cursor position where the failure occurred.

Type

int

why

Reason of the failure.

Type

str