dim¶
dim is an HTML parser and simple DOM implementation with CSS
selector support.
is a single module;
has no dependency outside PSL;
is not crazy long;
supports Python 3.6 and forward,
so the file could be directly embedded in any Python 3.4+ application,
or even in a monolithic source file. dim was designed to ease the
development of googler(1), which
itself promises to be a single Python script with zero third-party dep.
Simple example:
>>> import dim
>>> html = '''
... <html>
... <body>
... <table id="primary">
... <thead>
... <tr><th class="bold">A</th><th>B</th></tr>
... </thead>
... <tbody>
... <tr class="highlight"><td class="bold">1</td><td>2</td></tr>
... <tr><td class="bold">3</td><td>4</td></tr>
... <tr><td class="bold">5</td><td>6</td></tr>
... <tr><td class="bold">7</td><td>8</td></tr>
... </tbody>
... </table>
... <table id="secondary">
... <thead>
... <tr><th class="bold">C</th><th>D</th></tr>
... </thead>
... <tbody></tbody>
... </table>
... </body>
... </html>'''
>>> root = dim.parse_html(html)
>>> [elem.text for elem in root.select_all('table#primary th.bold, '
... 'table#primary tr.highlight + tr > td.bold')]
['A', '3']
>>> [elem.text for elem in root.select_all('table#primary th.bold, '
... 'table#primary tr.highlight ~ tr > td.bold')]
['A', '3', '5', '7']
>>> [elem.text for elem in root.select_all('th.bold, tr.highlight ~ tr > td.bold')]
['A', '3', '5', '7', 'C']
Parses HTML string, builds DOM, and returns root node. |
|
HTML parser / DOM builder. |
|
Represents a DOM node. |
|
Represents an element node. |
|
Represents a text node. |
|
Represents a group of CSS selectors. |
|
Represents a CSS selector. |
|
Represents an attribute selector. |
|
Attribute selector types. |
|
Combinator types. |
|
Exception raised when |
|
Exception raised when the selector parser fails to parse an input. |
Parsing HTML and building DOM¶
-
dim.parse_html(html, *, ParserClass=<class 'dim.DOMBuilder'>)[source]¶ Parses HTML string, builds DOM, and returns root node.
The parser may raise
DOMBuilderException.- Parameters
html (
str) – input HTML stringParserClass (
type) –DOMBuilderor a subclass
- Return type
- Returns
Root note of the parsed tree. If the HTML string contains multiple top-level elements, only the first is returned and the rest are lost.
-
class
dim.DOMBuilder[source]¶ HTML parser / DOM builder.
Subclasses
html.parser.HTMLParser.Consume HTML and builds a
Nodetree. Once finished, userootto access the root of the tree.This parser cannot parse malformed HTML with tag mismatch.
-
property
root¶ Finishes processing and returns the root node.
Raises
DOMBuilderExceptionif there is no root tag or root tag is not closed yet.- Return type
-
property
Nodes and elements¶
The DOM implementation is exposed through the Node API. There are only
two types of Node’s in this implementation: ElementNode and
TextNode (both subclasses Node and supports the full API).
The base class Node should not be manually instantiated; use
parse_html() or DOMBuilder. ElementNode and
TextNode may be manually instantiated (though not recommended).
-
class
dim.Node[source]¶ Represents a DOM node.
Parts of JavaScript’s DOM
NodeAPI andElementAPI are mirrored here, with extensions. In particular,querySelectorandquerySelectorAllare mirrored.Notable properties and methods:
attr(),classes,html,text,ancestors(),descendants(),select(),select_all(),matched_by(),-
query_selector_all(selector)[source]¶ Alias of
select_all().
-
previous_siblings()[source]¶ Compared to the natural DOM order, the order of returned nodes are reversed. That is, the adjacent sibling (if any) is the first in the returned list.
-
ancestors(*, root=None)[source]¶ Ancestors are generated in reverse order of depth, stopping at root.
A
RuntimeExceptionis raised if root is not in the ancestral chain.
-
-
class
dim.ElementNode(tag, attrs, *, parent=None, children=None)[source]¶ Represents an element node.
Note that tag and attribute names are case-insensitive; attribute values are case-sensitive.
- Parameters
-
class
dim.TextNode(text)[source]¶ Represents a text node.
- Parameters
text (
str) –
-
__eq__(other)[source]¶ Two text nodes are equal if and only if they are the same node.
For string comparison, use
text.
CSS selectors¶
CSS querying support is implemented mainly through two classes:
Selector and SelectorGroup. Both classes have a factory
function named from_str() to parse string representations, although one may
directly use selector (group) strings with the Node API (notably with
Node.select(), Node.select_all(), and Node.matched_by())
and avoid explicitly constructing objects altogether.
-
class
dim.SelectorGroup(selectors)[source]¶ Represents a group of CSS selectors.
A group of CSS selectors is simply a comma-separated list of selectors. 1 See
Selectordocumentation for the scope of support.Typically, a
SelectorGroupis constructed from a string (e.g.,th.center, td.center) using the factory functionfrom_str().-
classmethod
from_str(s)[source]¶ Parses input string into a group of selectors.
SelectorParserExceptionis raised on invalid input. SeeSelectordocumentation for the scope of support.- Parameters
s (
str) – input string- Return type
- Returns
Parsed group of selectors.
-
classmethod
-
class
dim.Selector(*, tag=None, classes=None, id=None, attrs=None, combinator=None, previous=None)[source]¶ Represents a CSS selector.
Recall that a CSS selector is a chain of one or more sequences of simple selectors separated by combinators. 2 This concept is represented as a cons list of sequences of simple selectors (in right to left order). This class in fact holds a single sequence, with an optional combinator and reference to the previous sequence.
For instance,
main#main p.important.definition > a.term[id][href]would be parsed into (schematically) the following structure:">" tag='a' classes=('term') attrs=([id], [href]) ~> " " tag='p' classes=('important', 'definition') ~> tag='main' id='main'
Each line is held in a separate instance of
Selector, linked together by thepreviousattribute.Supported grammar (from selectors level 3 2):
Type selectors;
Universal selectors;
Class selectors;
ID selectors;
Attribute selectors;
Combinators.
Unsupported grammar:
Pseudo-classes;
Pseudo-elements;
Namespace prefixes (
ns|,*|,|) in any part of any selector.
Rationale:
Pseudo-classes have too many variants, a few of which even complete with an admittedly not-so-complex minilanguage. These add up to a lot of code.
Pseudo-elements are useless outside rendering contexts, hence out of scope.
Namespace support is too niche to be worth the parsing headache. Using namespace prefixes may confuse the parser!
Note that the parser only loosely follows the spec and priotizes ease of parsing (which includes readability and writability of regexes), so some invalid selectors may be accepted (in fact, false positives abound, but accepting valid inputs is a much more important goal than rejecting invalid inputs for this library), and some valid selectors may be rejected (but as long as you stick to the scope outlined above and common sense you should be fine; the false negatives shouldn’t be used by actual human beings anyway).
In particular, whitespace character is simplified to
\s(ASCII mode) despite CSS spec not counting U+000B (VT) as whitespace, identifiers are simplified to[\w-]+(ASCII mode), and strings (attribute selector values can be either identifiers or strings) allow escaped quotes (i.e.,\'inside single-quoted strings and\"inside double-quoted strings) but everything else is interpreted literally. The exact specs for CSS identifiers and strings can be found at 3.Certain selectors and combinators may be implemented in the parser but not implemented in matching and/or selection APIs.
- Parameters
-
attrs¶ Attribute selectors.
- Type
List[AttributeSelector]
-
combinator¶ Combinator with the previous sequence of simple selectors in chain.
- Type
Optional[Combinator]
-
classmethod
from_str(s, cursor=0)[source]¶ Parses input string into selector.
This factory function only parses out one selector (up to a comma or EOS), so partial consumption is allowed — an optional cursor is taken as input (0 by default) and the moved cursor (either after the comma or at EOS) is returned as part of the output.
SelectorParserExceptionis raised on invalid input. SeeSelectordocumentation for the scope of support.If you need to completely consume a string representing (potentially) a group of selectors, use
SelectorGroup.from_str().
-
class
dim.AttributeSelector(attr, val, type)[source]¶ Represents an attribute selector.
- Parameters
attr (
str) –type (
AttributeSelectorType) –
-
type¶
-
class
dim.AttributeSelectorType(value)[source]¶ Attribute selector types.
Members correspond to the following forms of attribute selector:
BARE:[attr];EQUAL:[attr=val];TILDE:[attr~=val];PIPE:[attr|=val];CARET:[attr^=val];DOLLAR:[attr$=val];ASTERISK:[attr*=val].
-
BARE= 1¶
-
EQUAL= 2¶
-
TILDE= 3¶
-
PIPE= 4¶
-
CARET= 5¶
-
DOLLAR= 6¶
-
ASTERISK= 7¶
-
class
dim.Combinator(value)[source]¶ Combinator types.
Members correspond to the following combinators:
DESCENDANT:A B;CHILD:A > B;NEXT_SIBLING:A + B;SUBSEQUENT_SIBLING:A ~ B.
-
DESCENDANT= 1¶
-
CHILD= 2¶
-
NEXT_SIBLING= 3¶
-
SUBSEQUENT_SIBLING= 4¶
