Hodor |PyPI|
============

A simple html scraper with xpath or css.

Install
-------

``pip install hodorlive``

Usage
-----

As python package
~~~~~~~~~~~~~~~~~

**WARNING: This package by default doesn’t verify ssl connections.
Please check the** `arguments <#arguments>`__ **to enable them.**

Sample code
^^^^^^^^^^^

.. code:: python

   from hodor import Hodor
   from dateutil.parser import parse


   def date_convert(data):
       return parse(data)

   url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'

   CONFIG = {
       'old_symbol': {
           'css': '#SymbolChangeList_table tr td:nth-child(1)',
           'many': True
       },
       'new_symbol': {
           'css': '#SymbolChangeList_table tr td:nth-child(2)',
           'many': True
       },
       'effective_date': {
           'css': '#SymbolChangeList_table tr td:nth-child(3)',
           'many': True,
           'transform': date_convert
       },
       '_groups': {
           'data': '__all__',
           'ticker_changes': ['old_symbol', 'new_symbol']
       },
       '_paginate_by': {
           'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
           'many': False
       }
   }

   h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)

   h.data

Sample output
^^^^^^^^^^^^^

.. code:: python

   {'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
              'new_symbol': 'ARNC',
              'old_symbol': 'AA'},
             {'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
              'new_symbol': 'ARNC$',
              'old_symbol': 'AA$'},
             {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
              'new_symbol': 'MALN8',
              'old_symbol': 'AHUSDN2018'},
             {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
              'new_symbol': 'MALN9',
              'old_symbol': 'AHUSDN2019'},
             {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
              'new_symbol': 'MALQ6',
              'old_symbol': 'AHUSDQ2016'},
             {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
              'new_symbol': 'MALQ7',
              'old_symbol': 'AHUSDQ2017'},
             {'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
              'new_symbol': 'MALQ8',
              'old_symbol': 'AHUSDQ2018'}]}

Arguments
^^^^^^^^^

-  ``ua`` (User-Agent)
-  ``proxies`` (check requesocks)
-  ``auth``
-  ``crawl_delay`` (crawl delay in seconds across pagination - default:
   3 seconds)
-  ``pagination_max_limit`` (max number of pages to crawl - default:
   100)
-  ``ssl_verify`` (default: False)
-  ``robots`` (if set respects robots.txt - default: True)
-  ``reppy_capacity`` (robots cache LRU capacity - default: 100)
-  ``trim_values`` (if set trims output for leading and trailing
   whitespace - default: True)

Config parameters:
^^^^^^^^^^^^^^^^^^

-  By default any key in the config is a rule to parse.

   -  Each rule can be either a ``xpath`` or a ``css``
   -  Each rule can extract ``many`` values by default unless explicity
      set to ``False``
   -  Each rule can allow to ``transform`` the result with a function if
      provided

-  Extra parameters include grouping (``_groups``) and pagination
   (``_paginate_by``) which is also of the rule format.

.. |PyPI| image:: https://img.shields.io/pypi/v/hodorlive.svg?maxAge=2592000?style=plastic
   :target: https://pypi.python.org/pypi/hodorlive/
