Metadata-Version: 2.0
Name: scrape
Version: 0.0.11
Summary: a webpage scraping tool
Home-page: https://github.com/huntrar/scrape
Author: Hunter Hammond
Author-email: huntrar@gmail.com
License: MIT
Keywords: scrape webpage website pdf text keyword crawl save page
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: lxml
Requires-Dist: pdfkit

# scrape

## 
a web scraping tool

## Installation
* `pip install scrape`

## Usage
    usage: scrape [-h] [-f [FILTER [FILTER ...]]] [-c [CRAWL [CRAWL ...]]] [-ca]
                  [-l LIMIT] [-t] [-vb] [-v]
                  [urls [urls ...]]

    a web scraping tool

    positional arguments:
      urls                  urls to scrape

    optional arguments:
      -h, --help            show this help message and exit
      -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                            filter lines by keywords, text only
      -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                            enter keywords to crawl links
      -ca, --crawl-all      crawl all links
      -l LIMIT, --limit LIMIT
                            crawl page limit
      -t, --text            write to text instead of pdf
      -vb, --verbose        show pdfkit messages
      -v, --version         display current version

## Author
* Hunter Hammond (huntrar@gmail.com)

## Notes
* Unless specified using the --text flag, all webpages are saved as pdf files using pdfkit.

* The --filter flag may be used in conjunction with --text to only save lines matching one or more keywords provided

* Subsequent links may be followed by entering --crawl-all or --crawl. --crawl accepts a list of substrings to control which URL's to crawl, while --crawl-all will attempt to follow links indefinitely.

* There is no limit to the number of pages to be crawled unless one is set using the --limit flag, thus to cancel crawling and begin processing simply press Ctrl-C.



News
====

0.0.11
------

 - fixed missing comma in install_requires in setup.py
 - also labeled now as beta as there are still some kinks with crawling

0.0.10
------

 - now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9
------

 - pdfkit now ignores load errors and writes as many pages as possible

0.0.8
------

 - better implementation of crawler, can now scrape entire websites
 - added OrderedSet class to utils.py

0.0.7
------

 - changed --keywords to --filter and positional arg url to urls

0.0.6
------

 - use --keywords flag for filtering text
 - can pass multiple links now
 - will not write empty files anymore

0.0.5
------

 - added --verbose argument for use with pdfkit
 - improved output file name processing

0.0.4
------

 - accepts 0 or 1 url's, allowing a call with just --version

0.0.3
------

 - Moved utils.py to scrape/

0.0.2
------

 - First entry




