Metadata-Version: 1.1
Name: scrapy-eagle
Version: 0.0.10
Summary: Run Scrapy Distributed
Home-page: http://github.com/rafaelcapucho/scrapy-eagle
Author: Rafael Alfredo Capucho
Author-email: rafael.capucho@gmail.com
License: BSD
Description: .. image:: docs/images/logo_readme.jpg
        ======================================
        
        .. image:: https://travis-ci.org/rafaelcapucho/scrapy-eagle.svg?branch=master
            :target: https://travis-ci.org/rafaelcapucho/scrapy-eagle
        
        Scrapy Eagle is a tool that allow us to run any Scrapy_ based project in a distributed fashion and monitor how it is going on and how many resources it is consuming on each server.
        
        .. _Scrapy: http://scrapy.org
        
        **This project is Under Development, don't use it yet**
        
        Requeriments
        ------------
        
        Scrapy Eagle uses Redis_ as Distributed Queue, so you will need a redis instance running.
        
        .. _Redis: http://mail.python.org/pipermail/doc-sig/
        
        Installation
        ------------
        
        It could be easily made by running the code bellow,
        
        .. code-block:: console
        
            $ pip install scrapy-eagle
            
        You should create one ``configparser`` configuration file (e.g. in /etc/scrapy-eagle.ini) containing:
        
        .. code-block:: console
        
            [redis]
            host = 10.10.10.10
            port = 6379
            db = 0
            
        Then you will be able to execute the `eagle_server` command like,
        
        .. code-block:: console
        
            eagle_server --config-file=/etc/scrapy-eagle.ini
            
        Changes into your Scrapy project
        --------------------------------
        
        Enable the components in your `settings.py` of your Scrapy project:
        
        .. code-block:: python
        
          # Enables scheduling storing requests queue in redis.
          SCHEDULER = "scrapy_eagle.worker.scheduler.DistributedScheduler"
        
          # Ensure all spiders share same duplicates filter through redis.
          DUPEFILTER_CLASS = "scrapy_eagle.worker.dupefilter.RFPDupeFilter"
        
          # Schedule requests using a priority queue. (default)
          SCHEDULER_QUEUE_CLASS = 'sscrapy_eagle.worker.queue.SpiderPriorityQueue'
        
          # Schedule requests using a queue (FIFO).
          SCHEDULER_QUEUE_CLASS = 'scrapy_eagle.worker.queue.SpiderQueue'
        
          # Schedule requests using a stack (LIFO).
          SCHEDULER_QUEUE_CLASS = 'scrapy_eagle.worker.queue.SpiderStack'
        
          # Max idle time to prevent the spider from being closed when distributed crawling.
          # This only works if queue class is SpiderQueue or SpiderStack,
          # and may also block the same time when your spider start at the first time (because the queue is empty).
          SCHEDULER_IDLE_BEFORE_CLOSE = 0
        
          # Specify the host and port to use when connecting to Redis (optional).
          REDIS_HOST = 'localhost'
          REDIS_PORT = 6379
        
          # Specify the full Redis URL for connecting (optional).
          # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
          REDIS_URL = 'redis://user:pass@hostname:9001'
          
        Once the configuration is finished, you should adapt each spider to use our Mixin:
        
        .. code-block:: python
        
            from scrapy.spiders import CrawlSpider, Rule
            from scrapy_eagle.worker.spiders import DistributedMixin
            
            class YourSpider(DistributedMixin, CrawlSpider):
            
                name = "domain.com"
            
                # start_urls = ['http://www.domain.com/']
                redis_key = 'domain.com:start_urls'
                
                rules = (
                    Rule(...),
                    Rule(...),
                )
                
                def _set_crawler(self, crawler):
                    CrawlSpider._set_crawler(self, crawler)
                    DistributedMixin.setup_redis(self)
        
        
        Dashboard Development
        ---------------------
        
        If you would like to change the client-side then you'll need to have NPM_ installed because we use ReactJS_ to build our interface. Installing all dependencies locally:
        
        .. _ReactJS: https://facebook.github.io/react/
        .. _NPM: https://www.npmjs.com/
        
        .. code-block:: console
        
            cd scrapy-eagle/dashboard
            npm install 
        
        Then you can run ``npm start`` to compile and start monitoring any changes and recompiling automatically.
        
        To be easier to test the Dashboard you could use one simple http server instead of run the ``eagle_server``, like:
        
        .. code-block:: console
        
            sudo npm install -g http-server
            cd scrapy-eagle/dashboard
            http-server templates/
        
        **Note**: Until now the Scrapy Eagle is mostly based on https://github.com/rolando/scrapy-redis.
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Scrapy
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.5
Classifier: Intended Audience :: Developers
