Hodor |PyPI|
============

A simple html scraper with xpath or css.

Install
-------

``pip install hodorlive``

Usage
-----

As python package
~~~~~~~~~~~~~~~~~

***WARNING: This package by default doesn't verify ssl connections.
Please check the `arguments <#arguments>`__ to enable them.***

Sample code
^^^^^^^^^^^

.. code:: python

    from hodor import Hodor
    from dateutil.parser import parse


    def date_convert(data):
        return parse(data)

    url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'

    CONFIG = {
        'old_symbol': {
            'css': '#SymbolChangeList_table tr td:nth-child(1)',
            'many': True
        },
        'new_symbol': {
            'css': '#SymbolChangeList_table tr td:nth-child(2)',
            'many': True
        },
        'effective_date': {
            'css': '#SymbolChangeList_table tr td:nth-child(3)',
            'many': True,
            'transform': date_convert
        },
        '_groups': {
            'data': '__all__',
            'ticker_changes': ['old_symbol', 'new_symbol']
        },
        '_paginate_by': {
            'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
            'many': False
        }
    }

    h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)

    h.data

Sample output
^^^^^^^^^^^^^

.. code:: python

    {'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
               'new_symbol': 'ARNC',
               'old_symbol': 'AA'},
              {'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
               'new_symbol': 'ARNC$',
               'old_symbol': 'AA$'},
              {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
               'new_symbol': 'MALN8',
               'old_symbol': 'AHUSDN2018'},
              {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
               'new_symbol': 'MALN9',
               'old_symbol': 'AHUSDN2019'},
              {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
               'new_symbol': 'MALQ6',
               'old_symbol': 'AHUSDQ2016'},
              {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
               'new_symbol': 'MALQ7',
               'old_symbol': 'AHUSDQ2017'},
              {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
               'new_symbol': 'MALQ8',
               'old_symbol': 'AHUSDQ2018'}]}

Arguments
^^^^^^^^^

-  ``ua`` (User-Agent)
-  ``proxies`` (check requesocks)
-  ``auth``
-  ``crawl_delay`` (crawl delay in seconds across pagination - default:
   3 seconds)
-  ``pagination_max_limit`` (max number of pages to crawl - default:
   100)
-  ``ssl_verify`` (default: False)
-  ``robots`` (if set respects robots.txt - default: True)
-  ``reppy_capacity`` (robots cache LRU capacity - default: 100)
-  ``trim_values`` (if set trims output for leading and trailing
   whitespace - default: True)

Config parameters:
^^^^^^^^^^^^^^^^^^

-  By default any key in the config is a rule to parse.

   -  Each rule can be either a ``xpath`` or a ``css``
   -  Each rule can extract ``many`` values by default unless explicity
      set to ``False``
   -  Each rule can allow to ``transform`` the result with a function if
      provided

-  Extra parameters include grouping (``_groups``) and pagination
   (``_paginate_by``) which is also of the rule format.

.. |PyPI| image:: https://img.shields.io/pypi/v/hodorlive.svg?maxAge=2592000?style=plastic
   :target: https://pypi.python.org/pypi/hodorlive/
