Metadata-Version: 1.1
Name: htmlement
Version: 0.2
Summary: Python HTMLParser extension with ElementTree support.
Home-page: https://github.com/willforde/python-htmlement
Author: William Forde
Author-email: willforde@gmail.com
License: MIT License
Description: .. image:: https://badge.fury.io/py/htmlement.svg
            :target: https://pypi.python.org/pypi/htmlement
        
        .. image:: https://readthedocs.org/projects/python-htmlement/badge/?version=latest
            :target: http://python-htmlement.readthedocs.io/en/latest/?badge=latest
        
        .. image:: https://travis-ci.org/willforde/python-htmlement.svg?branch=master
            :target: https://travis-ci.org/willforde/python-htmlement
        
        .. image:: https://coveralls.io/repos/github/willforde/python-htmlement/badge.svg?branch=master
            :target: https://coveralls.io/github/willforde/python-htmlement?branch=master
        
        .. image:: https://api.codacy.com/project/badge/Grade/6b46406e1aa24b95947b3da6c09a4ab5
            :target: https://www.codacy.com/app/willforde/python-htmlement?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=willforde/python-htmlement&amp;utm_campaign=Badge_Grade
        
        Installation
        ------------
        ::
        
            pip install htmlement
        
        -or- ::
        
            pip install git+https://github.com/willforde/python-htmlement.git
        
        Why another Python HTML Parser?
        -------------------------------
        
        There is no "HTML Parser" in the "Python" Standard Library.
        Actually, there is the html.parser.HTMLParser_ that simply "traverses the DOM tree" and allows me to be notified as
        each tag is being parsed. Usually, when "parsing HTML" I want to query its elements and extract data from it.
        
        There are a few third party "HTML parsers" available like "lxml", "html5lib" and "beautifulsoup".
            * "lxml" is the best "parser" available, fast and reliable but since it requires "C libraries", it's not always possible to install.
            * "html5lib" is a "pure-python library" and is designed to conform to the "WHATWG HTML" specification. But it is very slow at parsing HTML.
            * "beautifulsoup" is also a "pure-python library" but is considered by most to be "very slow".
        
        The "Object" of this project is to be a "pure-python HTML parser" which is also "faster" than "beautifulsoup".
        And like "beautifulsoup", will also parse invalid html.
        The most simple way to do this is to use `XPath expressions`__.
        Python does support a simple (read limited) XPath engine inside its "ElementTree" module.
        A benefit of using "ElementTree" is that it can use a "C implementation" whenever available.
        
        This "HTML Parser" extends html.parser.HTMLParser_ to build a tree of ElementTree.Element_ instances.
        The returned "root element" natively supports the ElementTree API.
        
        
        Parsing HTML
        ------------
        Here I’ll be using a sample "HTML document" that will be "parsed" using "htmlement": ::
        
            html = """
            <html>
              <head>
                <title>GitHub</title>
              </head>
              <body>
                <a href="https://github.com/marmelo">GitHub</a>
                <a href="https://github.com/marmelo/python-htmlparser">GitHub Project</a>
              </body>
            </html>
            """
        
            # Parse the document
            import htmlement
            root = htmlement.fromstring(html)
        
        Root is an ElementTree.Element_ and supports the ElementTree API
        with XPath expressions. With this I'm easily able to get both the title and all anchors in the document. ::
        
            # Get title
            title = root.find("head/title").text
            print("Parsing: %s" % title)
        
            # Get all anchors
            for a in root.iterfind(".//a"):
                print(a.get("href"))
        
        And the output is as follows: ::
        
            Parsing: GitHub
            https://github.com/willforde
            https://github.com/willforde/python-htmlement
        
        
        Parsing HTML with a filter
        --------------------------
        Here I’ll be using a slightly more complex "HTML document" that will be "parsed" using "htmlement with a filter" to fetch
        only the menu items. This can be very useful when dealing with large "HTML documents" since it can be a lot faster to
        only "parse the required section" and to ignore everything else. ::
        
            html = """
            <html>
              <head>
                <title>Coffee shop</title>
              </head>
              <body>
                <ul class="menu">
                  <li>Coffee</li>
                  <li>Tea</li>
                  <li>Milk</li>
                </ul>
                <ul class="extras">
                  <li>Sugar</li>
                  <li>Cream</li>
                </ul>
              </body>
            </html>
            """
        
            # Parse the document
            import htmlement
            root = htmlement.fromstring(html, "ul", attrs={"class": "menu"})
        
        In this case I'm not unable to get the title, since all elements outside the filter were ignored.
        But this allows me to be able to extract all "list_item elements" within the menu list and nothing else. ::
        
            # Get all listitems
            for item in root.iterfind(".//li"):
                # Get text from listitem
                print(item.text)
        
        And the output is as follows: ::
        
            Coffee
            Tea
            Milk
        
        
        Compatibility
        -------------
        * python 2.7
        * python 3.3
        * python 3.4
        * python 3.5
        * python 3.6
        * pypy
        
        .. _html.parser.HTMLParser: https://docs.python.org/3.6/library/html.parser.html#html.parser.HTMLParser
        .. _ElementTree.Element : https://docs.python.org/3.6/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element
        .. _examples.py: https://github.com/willforde/python-htmlement/blob/master/examples.py
        .. _Xpath: https://docs.python.org/3.6/library/xml.etree.elementtree.html#xpath-support
        __ XPath_
        
Keywords: html html5 parsehtml htmlparser elementtree dom
Platform: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Software Development :: Libraries :: Python Modules
