Metadata-Version: 2.1
Name: html2json
Version: 0.2.3
Summary: Parsing HTML to JSON
Home-page: https://github.com/chuanconggao/html2json
Author: Chuancong Gao
Author-email: chuancong@gmail.com
License: MIT
Download-URL: https://github.com/chuanconggao/html2json/tarball/0.2.3
Description: [![PyPi version](https://img.shields.io/pypi/v/html2json.svg)](https://pypi.python.org/pypi/html2json/)
        [![PyPi pyversions](https://img.shields.io/pypi/pyversions/html2json.svg)](https://pypi.python.org/pypi/html2json/)
        [![PyPi license](https://img.shields.io/pypi/l/html2json.svg)](https://pypi.python.org/pypi/html2json/)
        
        Convert a HTML webpage to JSON data using a template defined in JSON.
        
        Installation
        ----
        
        This package is available on PyPi. Just use `pip install -U html2json` to install it. Then you can import it using `from html2json import collect`.
        
        API
        ----
        
        The method is `collect(html, template)`. `html` is the HTML of page loaded as string, and `template` is the JSON of template loaded as Python objects.
        
        Note that the HTML must contain the root node, like `<html>...</html>` or `<div>...</div>`. The root node itself cannot be matched.
        
        Template Syntax
        ----
        
        - The basic syntax is `keyName: [selector, attr, [listOfRegexes]]`.
            1. `selector` is a CSS selector (supported by [lxml](http://lxml.de/)).
            2. `attr` matches the attribute value. It can be `null` to match either the inner text or the outer text when the inner text is empty.
            3. The list of regexes `[listOfRegexes]` supports two forms of regex operations. The operations with in the list are executed sequentially.
                - Replacement: `s/regex/replacement/g`. `g` is optional for multiple replacements.
                - Extraction: `/regex/`.
        
        For example:
        
        ```json
        {
            "Color": ["head link:nth-of-type(1)", "href", ["/\\w+(?=\\.css)/"]],
        }
        ```
        
        - As JSON, nested structure can be easily constructed.
        
        ```json
        {
            "Cover": {
                "URL": [".cover img", "src", []],
                "Number of Favorites": [".cover .favorites", "value", []]
            },
        }
        ```
        
        - An alternative simplified syntax `keyName: [subRoot, subTemplate]` can be used.
            1. `subRoot` a CSS selector of the new root for each sub entry.
            2. `subTemplate` is a sub-template for each entry, recursively.
        
        For example, the previous example can be simplified as follow.
        
        ```json
        {
            "Cover": [".cover", {
                "URL": ["img", "src", []],
                "Number of Favorites": [".favorites", "value", []]
            }],
        }
        ```
        
        - To extract a list of sub-entries following the same sub-template, the list syntax is `keyName: [[subRoot, subTemplate]]`. Please note the difference (surrounding `[` and `]`) from the previous syntax above.
            1. `subRoot` is the CSS selector of the new root for each sub entry.
            2. `subTemplate` is the sub-template for each entry, recursively.
        
        For example:
        
        ```json
        {
            "Comments": [[".comments", {
                "From": [".from", null, []],
                "Content": [".content", null, []],
                "Photos": [["img", {
                    "URL": ["", "src", []]
                }]]
            }]]
        }
        ```
        
Keywords: parser,html,json
Platform: UNKNOWN
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
