Metadata-Version: 2.1
Name: scrapy-toolbox
Version: 0.2.2
Summary: Error Handling and Processing for your Scrapy Exceptions
Home-page: https://github.com/janwendt/scrapy-toolbox
Author: Jan Wendt
License: UNKNOWN
Download-URL: https://github.com/janwendt/scrapy-toolbox/archive/0.2.2.tar.gz
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: scrapy
Requires-Dist: sqlalchemy
Requires-Dist: sqlalchemy-utils
Requires-Dist: gitpython
Requires-Dist: pygithub

scrapy-toolbox
=============

A Python library that extends Scrapy with the following features:
- Error Saving to the Database Table "__errors" for manual error analysis (incl. traceback and response) and automated request reconstruction containing the following columns:
  - failed_at
  - spider
  - traceback
  - url (original url)
  - request_method
  - request_url
  - request_meta (json dump that can be loaded with json.loads())
  - request_cookies (json dump that can be loaded with json.loads())
  - request_headers (json dump that can be loaded with json.loads())
  - request_body
  - response_status
  - response_url
  - response_headers (json dump that can be loaded with json.loads())
  - response_body
- Error Processing with request reconstruction
- DatabasePipeline for SQLAlchemy
- Mail Notification when an Exception occurs (HTTP Errors (404, 502, ...) are excluded and only stored in the Database)
- Automatic GitHub Issue creation when an Exception occurs (HTTP Errors (404, 502, ...) are excluded and only stored in the Database)

Requisites: 
-----------

* Environment variable "PRODUCTION" for Produciton Mode for instance in your Dockerfile
* The ErrorSavingMiddleware defines an errback Callback for your Requests. If you want to make use of this Feature do not define any errback.

Installation
------------

  ```
  pip install --upgrade scrapy-toolbox
  ```

Setup
-----

Add the scrapy_toolbox Middlewares to your Scrapy Project `settings.py` and set your DATABASE_DEV and DATABASE.

  ```
  # settings.py
  SPIDER_MIDDLEWARES = {
      'scrapy_toolbox.database.DatabasePipeline': 999,
      'scrapy_toolbox.error_handling.ErrorSavingMiddleware': 1000,
      'scrapy_toolbox.error_processing.ErrorProcessingMiddleware': 1000,
  }

  # Example when using a MySQL
  DATABASE = {
    'drivername': 'mysql+pymysql', 
    'username': '...',
    'password': '...',
    'database': '...',
    'host': '...',
    'port': '3306'
  }

  DATABASE_DEV = {
      'drivername': 'mysql+pymysql',
      'username': '...',
      'password': '...',
      'database': '...',
      'host': '127.0.0.1',
      'port': '3306'
  }

  CREATE_GITHUB_ISSUE = True # Toggle GitHub Issue creation
  GITHUB_TOKEN = "..."
  GITHUB_REPO = "janwendt/scrapy-toolbox" # for instance

  SEND_MAILS = True # Toggle Mail Notification
  MAIL_HOST = "..."
  MAIL_FROM = "..."
  MAIL_TO = "..."
  ```

Usage
-----
Spider (Import ErrorCatcher first!!!):
  ```
  from scrapy_toolbox.error_handling import ErrorCatcher
  import scrapy
  ...

  class XyzSpider(scrapy.Spider, metaclass=ErrorCatcher):
  ...
  ```

Database Pipeline:
  ```
  # pipelines.py
  from scrapy_toolbox.database import DatabasePipeline

  class ScraperXYZPipeline(DatabasePipeline):
    def process_item(self, item, spider):
        session = self.session
        ...
  ```

  ```
  # models.py
  import scrapy_toolbox.database as db

  # then use db.DeclarativeBase as your declarative base
  class Car(db.DeclarativeBase):
    ...
  ```

Query Data:
  ```
  # spiderXYZ.py
  session = self.crawler.database_session
  session.query(models.Market.id, models.Market.zip_code).all()
  ```

Process Errors:
  ```
  scrapy crawl spider_xyz -a process_errors=True
  ```

Limitations
------------------
Syntax Errors in your settings.py are not handled.

Supported versions
------------------
This package works with Python 3. It has been tested with Scrapy up to version 1.4.0.

Tasklist
------------------
- [] Catalog Exceptions so that an exception only creates one new Github Issue

Build Realease
------------------
```
python setup.py sdist bdist_wheel
twine upload dist/*
```


