Metadata-Version: 1.1
Name: ftw.crawler
Version: 1.4.0
Summary: Crawl sites, extract text and metadata, index it in Solr
Home-page: https://github.com/4teamwork/ftw.crawler
Author: 4teamwork AG
Author-email: mailto:info@4teamwork.ch
License: GPL2
Description: ftw.crawler
        ===========
        
        Installation
        ------------
        
        To install ``ftw.crawler``, the easiest way is to create a buildout that
        contains the configuration, pulls in the egg using ``zc.recipe.egg`` and
        creates a script in the ``bin/`` directory that directly launches the crawler
        with the respective configuration as an argument:
        
        - First, create a configuration file for the crawler. You can base your
          configuration on `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_ by copying
          it to your buildout and adapting it as needed.
        
          Make sure to configure at least the ``tika`` and ``solr`` URLs to point to
          the correct locations of the respective services, and to adapt the ``sites``
          list to your needs.
        
        - Create a buildout config that installs ``ftw.crawler`` using ``zc.recipe.egg``:
        
          ``crawler.cfg``
        
        	.. code:: ini
        
        		[buildout]
        		parts +=
        		    crawler
        		    crawl-foo-org
        
        		[crawler]
        		recipe = zc.recipe.egg
        		eggs = ftw.crawler
        
        
        - Further define a buildout section that creates a ``bin/crawl-foo-org``
          script, which will call ``bin/crawl foo_org_config.py`` using absolute paths
          (for easier use from cron jobs):
        
        	.. code:: ini
        
        		[crawl-foo-org]
        		recipe = collective.recipe.scriptgen
        		cmd = ${buildout:bin-directory}/crawl
        		arguments =
        		    ${buildout:directory}/foo_org_config.py
        		    --tika http://localhost:9998/
        		    --solr http://localhost:8983/solr
        
          (The ``--tika`` and ``--solr`` command line arguments are optional, they
          can also be set in the configuration file. If given, the command line
          arguments take precedence over any parameters in the config file.)
        
        
        - Add a buildout config that downloads and configures a Tika JAXRS server:
        
          ``tika-server.cfg``
        
        	.. code:: ini
        
        		[buildout]
        		parts +=
        		    supervisor
        		    tika-server-download
        		    tika-server
        
        		[supervisor]
        		recipe = collective.recipe.supervisor
        		plugins =
        		      superlance
        		port = 8091
        		user = supervisor
        		password = admin
        		programs =
        		    10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user
        
        		[tika-server-download]
        		recipe = hexagonit.recipe.download
        		url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar
        		md5sum = 0f70548f233ead7c299bf7bc73bfec26
        		download-only = true
        		filename = tika-server.jar
        
        		[tika-server]
        		port = 9998
        		recipe = collective.recipe.scriptgen
        		cmd = java
        		arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}
        
          Modify ``your_os_user`` and the supervisor and Tika ports as needed.
        
        
        - Finally, add a `bootstrap.py <http://downloads.buildout.org/2/bootstrap.py>`_
          and create the ``buildout.cfg`` that pulls all of the above together:
        
          ``buildout.cfg``
        
        	.. code:: ini
        
        		[buildout]
        		extensions = mr.developer
        
        		extends =
        		    tika-server.cfg
        		    crawler.cfg
        
        
        - Bootstrap and run buildout:
        
        	.. code:: bash
        
        		python bootstrap.py
        		bin/buildout
        
        
        Running the crawler
        -------------------
        
        If you created the ``bin/crawl-foo-org`` script with the buildout described
        above, that's all you need to run the crawler:
        
        - Make sure Tika and Solr are running
        - Run ``bin/crawl-foo-org`` *(with either a relative or absolute path, working
          directory doesn't matter, so it can easily be called from a cron job)*
        
        
        Running ``bin/crawl`` directly
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        The ``bin/crawl-foo-org`` is just a tiny wrapper that calls the ``bin/crawl``
        script, generated by ``ftw.crawler``'s setuptools ``console_script``
        entry point, with the absolute path to the configuration file as the only
        argument. Any other arguments to the ``bin/crawl-foo-org`` script will be
        forwarded to ``bin/crawl``.
        
        Therefore running ``bin/crawl-foo-org [args]`` is equivalent to
        ``bin/crawl foo_org_config.py [args]``.
        
        Provide known sitemap urls in site configs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        If you know the sitemap url, you can configure one or many sitemap urls
        statically:
        
        .. code:: python
        
            Site('http://example.org/foo/',
                 sitemap_urls=['http://example.org/foo/the_sitemap.xml'])
        
        
        Configure site ID for purging
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        In order for the purging to work smoothly it is recommend to configure a
        crawler site ID.
        Make sure that each site ID is unique per solr core!
        Candidate documents for purging will be identified by this crawler site id.
        
        .. code:: python
        
            Site('http://example.org/',
                 crawler_site_id='example.org-news')
        
        Be aware that your solr core must provide a string-field ``crawler_site_id``.
        
        
        Indexing only a particular URL
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        If you only want to index a particular URL, pass that URL as the first
        argument to ``bin/crawl-foo-org``. The crawler will then only fetch and index
        that specific URL.
        
        
        Slack-Notifications
        -------------------
        
        ``ftw.crawler`` supports Slack-Notifications. Those notifications can be used
        to monitor the crawler on possible errors while crawling.
        To enable slack-notifications for your environment, you need to do the following things:
        
        - Install ``ftw.crawler`` with the ``slack`` extra.
        - Set the `SLACK_TOKEN` and the `SLACK_CHANNEL` params in your crawler config or
        - use the `--slacktoken` and the `--slackchannel` arguments in the command line when
          calling the `/crawl` script.
        
        To generate a valid slack token for your integration, you have to create a new bot in
        your slack-team. After you generated the new bot slack will automatically generate a
        valid token for this bot. This token can then be used for your integration.
        You can also generate a test token to test your integration, but don't forget to create
        a bot for this if your application goes to production!
        
        
        Development
        -----------
        
        To start hacking on ``ftw.crawler``, use the ``development.cfg`` buildout:
        
        
        .. code:: bash
        
        	ln -s development.cfg buildout.cfg
        	python bootstrap.py
        	bin/buildout
        
        This will build a Tika JAXRS server and a Solr instance for you. The Solr
        configuration is set up to be compatible with the testing / example
        configuration at  `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_.
        
        To run the crawler against the example configuration:
        
        .. code:: bash
        
        	bin/tika-server
        	bin/solr-instance fg
        	bin/crawl ftw/crawler/tests/assets/basic_config.py
        
        
        Links
        -----
        
        - Github: https://github.com/4teamwork/ftw.crawler
        - Issues: https://github.com/4teamwork/ftw.crawler/issues
        - Pypi: http://pypi.python.org/pypi/ftw.crawler
        - Continuous integration: https://jenkins.4teamwork.ch/search?q=ftw.crawler
        
        
        Copyright
        ---------
        
        This package is copyright by `4teamwork <http://www.4teamwork.ch/>`_.
        
        ``ftw.crawler`` is licensed under GNU General Public License, version 2.
        
        Changelog
        =========
        
        
        1.4.0 (2017-11-08)
        ------------------
        
        - Add crawler_site_id option for improving purging. [jone]
        
        1.3.0 (2017-11-03)
        ------------------
        
        - Fix purging problem.
          Warning: updating "ftw.crawler" to this version breaks your existing crawlers
          when you set the site url to a sitemap url. Please use the "sitemap_urls"
          attribute instead. You also need to purge the Solr index manually and reindex.
          [jone]
        
        1.2.1 (2017-10-30)
        ------------------
        
        - Encode URL in UTF-8 before generating MD5-Hash.
          [raphael-s]
        
        
        1.2.0 (2017-06-22)
        ------------------
        
        - Support Slack notifications.
          [raphael-s]
        
        
        1.1.0 (2016-10-04)
        ------------------
        
        - Support configuration of absolute sitemap urls. [jone]
        
        - Slow down on too many requests. [jone]
        
        
        1.0 (2015-11-09)
        ----------------
        
        - Initial implementation.
          [lgraf]
        
Keywords: crawling extraction solr
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
