Metadata-Version: 2.1
Name: scrapy-autoextract
Version: 0.1
Summary: Scrapinghub AutoExtract API integration for Scrapy
Home-page: https://github.com/scrapinghub/scrapy-autoextract
Author: Scrapinghub Inc
Author-email: info@scrapinghub.com
Maintainer: Scrapinghub Inc
Maintainer-email: info@scrapinghub.com
License: UNKNOWN
Keywords: scrapy autoextract middleware
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Framework :: Scrapy

====================================
Scrapy & Autoextract API integration
====================================

This library integrates ScrapingHub's AI Enabled Automatic Data Extraction
into a Scrapy spider using a downloader middleware.
The middleware adds the result of AutoExtract to ``response.meta['autoextract']``
for consumption by the spider.


Installation
============

::

    pip install scrapy-autoextract

scrapy-autoextract requires Python 3.5+


Configuration
=============

Add the AutoExtract downloader middleware in the settings file::

    DOWNLOADER_MIDDLEWARES = {
        'scrapy_autoextract.AutoExtractMiddleware': 543,
    }

Note that this should be the last downloader middleware to be executed.


Usage
=====

The middleware is opt-in and can be explicitly enabled per request,
with the ``{'autoextract': {'enabled': True}}`` request meta.
All the options below can be set either in the project settings file,
or just for specific spiders, in the ``custom_settings`` dict.

Available settings:

- ``AUTOEXTRACT_USER`` [mandatory] is your AutoExtract API key
- ``AUTOEXTRACT_URL`` [optional] the AutoExtract service url. Defaults to autoextract.scrapinghub.com.
- ``AUTOEXTRACT_TIMEOUT`` [optional] sets the response timeout from AutoExtract. Defaults to 660 seconds.
  Can also be defined by setting the "download_timeout" in the request.meta.
- ``AUTOEXTRACT_PAGE_TYPE`` [mandatory] defines the kind of document to be extracted.
  Current available options are `"product"` and `"article"`.
  Can also be defined on ``spider.page_type``, or ``{'autoextract': {'pageType': '...'}}`` request meta.
  This is required for the AutoExtract classifier to know what kind of page needs to be extracted.


Within the spider, consuming the AutoExtract result is as easy as::

    def parse(self, response):
        yield response.meta['autoextract']


Limitations
===========

When using the AutoExtract middleware, there are some limitations.

* The incoming spider request is rendered by AutoExtract, not just downloaded by Scrapy,
  which can change the result - the IP is different, headers are different, etc.
* Only GET requests are supported
* Custom headers and cookies are not supported (i.e. Scrapy features to set them don't work)
* Proxies are not supported (they would work incorrectly,
  sitting between Scrapy and AutoExtract, instead of AutoExtract and website)
* AutoThrottle extension can work incorrectly for AutoExtract requests,
  because AutoExtract timing can be much larger than time required to download a page,
  so it's best to use ``AUTHTHROTTLE_ENABLED=False`` in the settings.
* Redirects are handled by AutoExtract, not by Scrapy,
  so these kinds of middlewares might have no effect
* Retries should be disabled, because AutoExtract handles them internally
  (use ``RETRY_ENABLED=False`` in the settings)
  There is an exception, if there are too many requests sent in
  a short amount of time and AutoExtract returns HTTP code 429.
  For that case it's best to use ``RETRY_HTTP_CODES=[429]``.


