Metadata-Version: 2.1
Name: robotspy
Version: 0.3.3
Summary: Robots Exclusion Protocol File Parser
Home-page: https://github.com/andreburgaud/robotspy
Author: Andre Burgaud
Author-email: andre.burgaud@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown

# Robots Exclusion Standard Parser for Python

The `robots` Python module implements a parser for robots.txt file. The recommended class to use is
`robots.RobotsParser`. Besides, a thin facade `robots.RobotFileParser` also exists to be used as
a substitute for [`urllib.robotparser.RobotFileParser`](https://docs.python.org/3/library/urllib.robotparser.html),
available in the Python standard library. The facade `robots.RobotFileParser` exposes an API that is
mostly compatible with `urllib.robotparser.RobotFileParser`.

The main reasons for this rewrite are the following:

1. It was initially intended to experiment with parsing `robots.txt` for a link checker project
(not implemented).
1. It is attempting to follow the latest internet draft
[Robots Exclusion Protocol](https://tools.ietf.org/html/draft-koster-rep-00).
1. It does not try to be compliant with commonly accepted directives that are not in the current
[specs]((https://tools.ietf.org/html/draft-koster-rep-00)) such as `request-rate` and `crawl-delay`,
but it currently supports `sitemaps`.
1. It satisfies the same tests as the [Google Robots.txt Parser](https://github.com/google/robotstxt),
except for some custom behaviors specific to Google Robots.

## Installation

**Note**: Python 3.8.x is required

You preferably want to install the `robots` package after creating a Python virtual environment,
in a newly created directory, as follows:

```
$ mkdir project && cd project
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install --upgrade pip
(robotspy) $ python -m pip install --upgrade setuptools
(robotspy) $ python -m pip install robotspy
```

## Usage

The `robots` package can be imported as a module and also exposes an executable invokable with
`python -m`.

### Execute the Package

After installing `robotspy`, you can validate the installation by running the following command:

```
(robotspy) $ python -m robots --help
usage: robots (<robots_path>|<robots_url>) <user_agent> <URI>

Shows whether the given user agent and URI combination are allowed or
disallowed by the given robots.txt file.

positional arguments:
  robotstxt      robots.txt file path or URL
  useragent      User agent name
  uri            Path or URI

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
```

### Examples

The content of http://www.pythontest.net/elsewhere/robots.txt is the following:

```
# Used by NetworkTestCase in Lib/test/test_robotparser.py

User-agent: Nutch
Disallow: /
Allow: /brian/

User-agent: *
Disallow: /webstats/
```

To check if the user agent `Nutch` can fetch the path `/brian/` you can execute:

```
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI '/brian/': ALLOWED
```

Or, you can also pass the full URL, http://www.pythontest.net/brian/:

```
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian/
user-agent 'Nutch' with URI 'http://www.pythontest.net/brian/': ALLOWED
```

Can user agent `Nutch` fetch the path `/brian`?

```
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /brian
user-agent 'Nutch' with URI '/brian': DISALLOWED
```

Or, `/`?

```
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Nutch /
user-agent 'Nutch' with URI '/': DISALLOWED
```

How about user agent `Johnny`?

```
(robotspy) $ python -m robots http://www.pythontest.net/elsewhere/robots.txt Johnny /
user-agent 'Johnny' with URI '/': ALLOWED
```

### Use the Module in a Project

Here is an example with the same data as above, using the `robots` package from the Python shell:

```
(robotspy) $ python
>>> import robots
>>> parser = robots.RobotsParser.from_uri('http://www.pythontest.net/elsewhere/robots.txt')
>>> useragent = 'Nutch'
>>> path = '/brian/'
>>> result = parser.can_fetch(useragent, path)
>>> print(f'Can {useragent} fetch {path}? {result}')
Can Nutch fetch /brian/? True
>>>
```

### Bug in the Python standard library

There is a bug in [`urllib.robotparser`](https://docs.python.org/3/library/urllib.robotparser.html)
from the Python standard library that causes the following test to differ from the example above with `robotspy`.

The example with `urllib.robotparser` is the following:

```
$ python
>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url('http://www.pythontest.net/elsewhere/robots.txt')
>>> rp.read()
>>> rp.can_fetch('Nutch', '/brian/')
False
```

Notice that the result is `False` whereas `robotspy` return `True`.

Bug [bpo-39187](https://bugs.python.org/issue39187) was open to raise awareness on this issue and PR
https://github.com/python/cpython/pull/17794 was submitted as a possible fix. `robotspy` does not
exhibit this problem.

## Development

The main development dependency is `pytest` for executing the tests. It is automatically
installed if you perform the following steps:

```
$ git clone https://github.com/andreburgaud/robotspy
$ cd robotspy
$ python -m venv .venv --prompt robotspy
$ . .venv/bin/activate
(robotspy) $ python -m pip install -r requirements.txt
(robotspy) $ python -m pip install -e .
(robotspy) $ make test
(robotspy) $ deactivate
$
```

Other dependencies are intended for deployment to the [Cheese Shop](https://wiki.python.org/moin/CheeseShop) ([PyPI](https://pypi.org/)):

* [Wheel](https://pypi.org/project/wheel/0.22.0/)
* [twine](https://pypi.org/project/twine/)

See the build file, `Makefile`, for the commands and parameters.

The `Makefile` also invokes the following tools:

* [Black](https://github.com/psf/black)
* [Mypy](http://mypy-lang.org/)
* [Pylint](https://www.pylint.org/)

At this stage of the development, version 0.3.1, the three development tools above are expected to be installed globally.

### Dependency Tree

To display the dependency tree:

```
$ pipdeptree
```

or

```
$ make tree
```

To display the reverse dependency tree of a particular package, `idna` in the example below:

```
$ pipdeptree --reverse --packages idna
```

## Release History

* 0.3.3:
  * Upgraded `tqdm`, and `cryptography` packages
  * 0.3.2:
  * Upgraded `bleach`, `tqdm`, and `setuptools` packages
* 0.3.1:
  * Updated `idna` and `wcwidth` packages
  * Added `pipdeptree` package to provide visibility on dependencies
  * Fixed `mypy` errors
  * Explicitly ignored `pylint` errors related to commonly used names like `f`, `m`, or `T`
* 0.3.0: Updated `bleach` package to address CVE-2020-6802
* 0.2.0: Updated the documentation
* 0.1.0: Initial release

## License

MIT License

