Metadata-Version: 2.1
Name: scrapinghub-autoextract
Version: 0.1
Summary: Python interface to Scrapinghub Automatic Extraction API
Home-page: https://github.com/scrapinghub/scrapinghub-autoextract
Author: Mikhail Korobov
Author-email: kmike84@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Requires-Dist: requests
Requires-Dist: tenacity ; python_version >= "3.6"
Requires-Dist: aiohttp (>=3.6.0) ; python_version >= "3.6"
Requires-Dist: tqdm ; python_version >= "3.6"

=======================
scrapinghub-autoextract
=======================

.. image:: https://img.shields.io/pypi/v/scrapinghub-autoextract.svg
   :target: https://pypi.python.org/pypi/scrapinghub-autoextract
   :alt: PyPI Version

.. image:: https://img.shields.io/pypi/pyversions/scrapinghub-autoextract.svg
   :target: https://pypi.python.org/pypi/scrapinghub-autoextract
   :alt: Supported Python Versions

.. image:: https://travis-ci.org/scrapinghub/scrapinghub-autoextract.svg?branch=master
   :target: https://travis-ci.org/scrapinghub/scrapinghub-autoextract
   :alt: Build Status

.. image:: https://codecov.io/github/scrapinghub/scrapinghub-autoextract/coverage.svg?branch=master
   :target: https://codecov.io/gh/scrapinghub/scrapinghub-autoextract
   :alt: Coverage report


Python client libraries for `Scrapinghub AutoExtract API`_.
It allows to extract product and article information from any website.

Both synchronous and asyncio wrappers are provided by this package.

License is BSD 3-clause.

.. _Scrapinghub AutoExtract API: https://scrapinghub.com/autoextract


Installation
============

::

    pip install scrapinghub-autoextract

scrapinghub-autoextract requires Python 3.6+ for CLI tool and for
the asyncio API; basic, synchronous API works with Python 3.5.

Usage
=====

First, make sure you have an API key. To avoid passing it in ``api_key``
argument with every call, you can set ``SCRAPINGHUB_AUTOEXTRACT_KEY``
environment variable with the key.

Command-line interface
----------------------

The most basic way to use the client is from a command line.
First, create a file with urls, an URL per line (e.g. ``urls.txt``).
Second, set ``SCRAPINGHUB_AUTOEXTRACT_KEY`` env variable with your
AutoExtract API key (you can also pass API key as ``--api-key`` script
argument).

Then run a script, to get the results::

    python -m autoextract urls.txt --page-type article > res.jl

Run ``python -m autoextract --help`` to get description of all supported
options.

Synchronous API
---------------

Synchronous API provides an easy way to try autoextract in a script.
For production usage asyncio API is strongly recommended.

You can send requests as described in `API docs`_::

    from autoextract.sync import request_raw
    query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]
    results = request_raw(query)

Note that if there are several URLs in the query, results can be returned in
arbitrary order.

There is also a ``autoextract.sync.request_batch`` helper, which accepts URLs
and page type, and ensures results are in the same order as requested URLs::

    from autoextract.sync import request_batch
    urls = ['http://example.com/foo', 'http://example.com/bar']
    results = request_batch(urls, page_type='article')

.. note::
    Currently request_batch is limited to 100 URLs at time only.

.. _API docs: https://doc.scrapinghub.com/autoextract.html


asyncio API
-----------

Basic usage is similar to sync API (``request_raw``),
but asyncio event loop is used::

    from autoextract.aio import request_raw

    async def foo():
        results1 = await request_raw(query)
        # ...

There is also ``request_parallel`` function, which allows to process
many URLs in parallel, using both batching and multiple connections::

    import sys
    from autoextract.aio import request_parallel, create_session

    async def foo():
        async with create_session() as session:
            res_iter = request_parallel(urls, page_type='article',
                                        n_conn=10, batch_size=3,
                                        session=session)
            for f in res_iter:
                try:
                    batch_result = await f
                    for res in batch_result:
                        # do something with a result
                except ApiError as e:
                    print(e, file=sys.stderr)
                    raise

``request_parallel`` and ``request_raw`` functions handle throttling
(http 429 errors) and network errors, retrying a request in these cases.

CLI interface implementation (``autoextract/__main__.py``) can serve
as an usage example.

Contributing
============

* Source code: https://github.com/scrapinghub/scrapinghub-autoextract
* Issue tracker: https://github.com/scrapinghub/scrapinghub-autoextract/issues

Use tox_ to run tests with different Python versions::

    tox

The command above also runs type checks; we use mypy.

.. _tox: https://tox.readthedocs.io


Changes
=======

TBA
---

Initial release.

