Metadata-Version: 2.0
Name: htmlement
Version: 0.2
Summary: Python HTMLParser extension with ElementTree support.
Home-page: https://github.com/willforde/python-htmlement
Author: William Forde
Author-email: willforde@gmail.com
License: MIT License
Keywords: html html5 parsehtml htmlparser elementtree dom
Platform: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Software Development :: Libraries :: Python Modules

.. image:: https://badge.fury.io/py/htmlement.svg
    :target: https://pypi.python.org/pypi/htmlement

.. image:: https://readthedocs.org/projects/python-htmlement/badge/?version=latest
    :target: http://python-htmlement.readthedocs.io/en/latest/?badge=latest

.. image:: https://travis-ci.org/willforde/python-htmlement.svg?branch=master
    :target: https://travis-ci.org/willforde/python-htmlement

.. image:: https://coveralls.io/repos/github/willforde/python-htmlement/badge.svg?branch=master
    :target: https://coveralls.io/github/willforde/python-htmlement?branch=master

.. image:: https://api.codacy.com/project/badge/Grade/6b46406e1aa24b95947b3da6c09a4ab5
    :target: https://www.codacy.com/app/willforde/python-htmlement?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=willforde/python-htmlement&amp;utm_campaign=Badge_Grade

Installation
------------
::

    pip install htmlement

-or- ::

    pip install git+https://github.com/willforde/python-htmlement.git

Why another Python HTML Parser?
-------------------------------

There is no "HTML Parser" in the "Python" Standard Library.
Actually, there is the html.parser.HTMLParser_ that simply "traverses the DOM tree" and allows me to be notified as
each tag is being parsed. Usually, when "parsing HTML" I want to query its elements and extract data from it.

There are a few third party "HTML parsers" available like "lxml", "html5lib" and "beautifulsoup".
    * "lxml" is the best "parser" available, fast and reliable but since it requires "C libraries", it's not always possible to install.
    * "html5lib" is a "pure-python library" and is designed to conform to the "WHATWG HTML" specification. But it is very slow at parsing HTML.
    * "beautifulsoup" is also a "pure-python library" but is considered by most to be "very slow".

The "Object" of this project is to be a "pure-python HTML parser" which is also "faster" than "beautifulsoup".
And like "beautifulsoup", will also parse invalid html.
The most simple way to do this is to use `XPath expressions`__.
Python does support a simple (read limited) XPath engine inside its "ElementTree" module.
A benefit of using "ElementTree" is that it can use a "C implementation" whenever available.

This "HTML Parser" extends html.parser.HTMLParser_ to build a tree of ElementTree.Element_ instances.
The returned "root element" natively supports the ElementTree API.


Parsing HTML
------------
Here I’ll be using a sample "HTML document" that will be "parsed" using "htmlement": ::

    html = """
    <html>
      <head>
        <title>GitHub</title>
      </head>
      <body>
        <a href="https://github.com/marmelo">GitHub</a>
        <a href="https://github.com/marmelo/python-htmlparser">GitHub Project</a>
      </body>
    </html>
    """

    # Parse the document
    import htmlement
    root = htmlement.fromstring(html)

Root is an ElementTree.Element_ and supports the ElementTree API
with XPath expressions. With this I'm easily able to get both the title and all anchors in the document. ::

    # Get title
    title = root.find("head/title").text
    print("Parsing: %s" % title)

    # Get all anchors
    for a in root.iterfind(".//a"):
        print(a.get("href"))

And the output is as follows: ::

    Parsing: GitHub
    https://github.com/willforde
    https://github.com/willforde/python-htmlement


Parsing HTML with a filter
--------------------------
Here I’ll be using a slightly more complex "HTML document" that will be "parsed" using "htmlement with a filter" to fetch
only the menu items. This can be very useful when dealing with large "HTML documents" since it can be a lot faster to
only "parse the required section" and to ignore everything else. ::

    html = """
    <html>
      <head>
        <title>Coffee shop</title>
      </head>
      <body>
        <ul class="menu">
          <li>Coffee</li>
          <li>Tea</li>
          <li>Milk</li>
        </ul>
        <ul class="extras">
          <li>Sugar</li>
          <li>Cream</li>
        </ul>
      </body>
    </html>
    """

    # Parse the document
    import htmlement
    root = htmlement.fromstring(html, "ul", attrs={"class": "menu"})

In this case I'm not unable to get the title, since all elements outside the filter were ignored.
But this allows me to be able to extract all "list_item elements" within the menu list and nothing else. ::

    # Get all listitems
    for item in root.iterfind(".//li"):
        # Get text from listitem
        print(item.text)

And the output is as follows: ::

    Coffee
    Tea
    Milk


Compatibility
-------------
* python 2.7
* python 3.3
* python 3.4
* python 3.5
* python 3.6
* pypy

.. _html.parser.HTMLParser: https://docs.python.org/3.6/library/html.parser.html#html.parser.HTMLParser
.. _ElementTree.Element : https://docs.python.org/3.6/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element
.. _examples.py: https://github.com/willforde/python-htmlement/blob/master/examples.py
.. _Xpath: https://docs.python.org/3.6/library/xml.etree.elementtree.html#xpath-support
__ XPath_


