Metadata-Version: 1.1
Name: basiccrawler
Version: 0.1.0
Summary: Basic web crawler that automates website exploration and producing web resource trees.
Home-page: https://github.com/learningequality/BasicCrawler
Author: Learning Equality
Author-email: ivan@learningequalty.org
License: MIT license
Description-Content-Type: UNKNOWN
Description: BasicCrawler
        ============
        
        Basic web crawler that automates website exploration and producing web
        resource trees.
        
        Version 0.2 TODO
        ----------------
        
        -  Finish "is file" logic to check content-type before downloading to
           avoid large downloads
        
           -  infer file from extentsion in URL
        
        -  Make a single IGNORE\_URLS list that accpets:
        -  full urls (string)
        -  compiled RE objects
        -  functions for deciding what to ignore rather (anything callable)
        
        -  path to url / vice versa (and possibly elsewhere): consider
           ``urllib.urlparse``? [e.g. ``url.startwith(source_domain)`` could be
           ``source_domain in url.domain`` to make it more flexible with
           subdomains
        -  Additional valid domains can be specified but ``url_to_path_list``
           assumes adding CHANNEL\_ROOT\_DOMAIN [we may wish to expand all links
           based on parent URL]
        -  refactor and remove need for MAIN\_SOURCE\_DOMAIN and use only
           SOURCE\_DOMAINS instead
        
        Feature ideas
        -------------
        
        -  Asynchronous download (not necessary but might be good for
           performance on large sites)
        -  don't block for HTTP
        -  allow multiple workers getting from queue
        
        Usage
        -----
        
        The goal of the ``BasicCrawler`` class is to help with the initial
        exploration of the source website. It is your responsibility to write a
        subclass that uses the HTML, URL structure, and content to guide the
        crawling and produce the web resource tree.
        
        The workflow is as follows
        
        1. Create your subclass
        
        -  set the following attributes
        
           -  ``MAIN_SOURCE_DOMAIN`` e.g. ``'https://learningequality.org'``
           -  ``START_PAGE`` e.g. ``'https://learningequality.org/'``
        
        2. Run for the first time by calling ``crawler.crawl()`` or as a command
           line script
        
        -  The BasicCrawler has basic logic for visiting pages and will print
           out on the a summary of the auto inferred site stricture findings and
           recommendations based on the URL structure observed during the
           initial crawl.
        -  Based on the number of times a link appears on different pages of the
           site the crawler will suggest to you candidates for global navigation
           links. Most websites have an /about page, /contact us, and other such
           non-content-containing pages, which we do not want to include in the
           web resource tree. You should inspect these suggestions and decide
           which should be ignored (i.e. not crawled or included in the
           web\_resource\_tree output). To ignore URLs you can edit the
           attributes:
        
           -  ``IGNORE_URLS`` (list of strings): crawler will ignore this URL
           -  ``IGNORE_URL_PATTERNS`` (list of RE objects): regular expression
              that do the same thing Edit your crawler subclass' code and append
              to ``IGNORE_URLS`` and ``IGNORE_URL_PATTERNS`` the URLs you want
              to skip (anything that is not likely to contain content).
        
        3. Run the crawler again, this time there should be less noise in the
           output.
        
        -  Note the suggestion for different paths that you might want to handle
           specially (e.g. ``/course``, ``/lesson``, ``/content``, etc.) You can
           define class methods to handle each of these URL types:
        
           ::
        
                def on_course(self, url, page, context):
                    # what do you want the crawler to do when it visits the  course with `url`
                    # in the `context` (used for extra metadata; contains reference to parent)
                    # The BeautifulSoup parsed contents of the `url` are provided as `page`.
        
                def on_lesson(self, url, page, context):
                    # what do you want the crawler to do when it visits the lesson
        
                def on_content(self, url, page, context):
                    # what do you want the crawler to do when it visits the content url
        
Keywords: basiccrawler
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
