Metadata-Version: 2.0
Name: cdx-toolkit
Version: 0.9.4
Summary: A toolkit for working with CDX indices
Home-page: https://github.com/cocrawler/cdx_toolkit
Author: Greg Lindahl and others
Author-email: lindahl@pbm.com
License: Apache 2.0
Description-Content-Type: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3 :: Only

cdx\_toolkit
============

|Build Status| |Coverage Status| |Apache License 2.0|

cdx\_toolkit is a set of tools for working with CDX indices of web
crawls and archives, including those at CommonCrawl and the Internet
Archive's Wayback Machine.

CommonCrawl uses Ilya Kramer's pywb to serve the CDX API, which is
somewhat different from the Internet Archive's CDX API. cdx\_toolkit
hides these differences as best it can. cdx\_toolkit also knits together
the monthly Common Crawl CDX indices into a single, virtual index.

Installing
----------

::

    $ pip install cdx_toolkit

or clone this repo and use ``python setup.py install``.

Example
-------

::

    import cdx_toolkit

    cdx = cdx_toolkit.CDXFetcher(source='cc')
    url = 'commoncrawl.org/*'

    print(url, 'size estimate', cdx.get_size_estimate(url))

    for obj in cdx.items(url, limit=10):
        print(obj)

at the moment will print:

::

    size estimate 6000
    http://commoncrawl.org/ 200
    http://commoncrawl.org/ 200
    http://commoncrawl.org/ 200
    http://www.commoncrawl.org/ 301
    https://www.commoncrawl.org/ 301
    http://www.commoncrawl.org/ 301
    http://commoncrawl.org/ 200
    http://commoncrawl.org/2011/12/mapreduce-for-the-masses/ 200
    http://commoncrawl.org/2012/03/data-2-0-summit/ 200
    http://commoncrawl.org/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages/ 200

Command-line tools
------------------

The above example can also be done as

::

    $ cdx_size 'commoncrawl.org/*' --cc
    $ cdx_iter 'commoncrawl.org/*' --cc --limit 10 --cc-duration='90d'

or

::

    $ cdx_size 'commoncrawl.org/*' --ia
    $ cdx_iter 'commoncrawl.org/*' --ia --limit 10

cdx\_iter can generate jsonl or csv outputs; see

::

    $ cdx_iter --help

for details. Set the environment variable LOGLEVEL=DEBUG if you'd like
more details about what's going on inside cdx\_iter.

CDX Jargon, Field Names, and such
---------------------------------

cdx\_toolkit supports all of the options and fields discussed in the CDX
API documentation:

-  https://github.com/webrecorder/pywb/wiki/CDX-Server-API
-  https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

A **capture** is a single crawled url, be it a copy of a webpage, a
redirect to another page, an error such as 404 (page not found), or a
revisit record (page identical to a previous capture.)

The **url** used by cdx\_tools can be wildcarded in two ways. One way is
``*.example.com``, which in CDX jargon sets **matchType='domain'**, and
will return captures for blog.example.com, support.example.com, etc. The
other, ``example.com/*``, will return captures for any page on
example.com.

A **timestmap** represents year-month-day-time as a string of digits run
togther. Example: January 5, 2016 at 12:34:56 UTC is 20160105123456.
These timestamps are a field in the index, and are also used to pick
specify the dates used by **--from=**, **--to**, and **--closest** on
the command-line. (Programmatically, use from\_ts=, to=, and closest=.)

An **urlkey** is a SURT, which is a munged-up url suitable for
deduplication and sorting. This sort order is how CDX indices
efficiently support queries like ``*.example.com``. The SURTs for
www.example.com and example.com are identical, which is handy when these
2 hosts actually have identical web content. The original url should be
present in all records, if you want to know exactly what it is.

The **limit** argument limits how many captures will be returned. There
is a default limit of 1,000 captures.

A **filter** allows a user to select a subset of CDX records, reducing
network traffic between the CDX API server and the user. For example,
filter='!=status:200' will only show captures whose http status is not
200. Filters and **limit** work together, with the limit applying to the
count of captures after the filter is applied.

CDX API servers support a **paged interface** for efficient access to
large sets of URLs. cdx\_toolkit iterators always use the paged
interface. cdx\_toolkit is also polite to CDX servers by being
single-threaded and serial. If it's not fast enough for you, consider
downloading Common Crawl's index files directly.

A **digest** is a sha1 checksum of the contents of a capture. The
purpose of a digest is to be able to easily figure out if 2 captures
have identical content.

Common Crawl publishes a new index each month. cdx\_toolkit will start
using new ones as soon as they are published. By default, cdx\_toolkit
will use the previous year of Common Crawl; you can change that using
**--from** or **from=** and **--to** or **to=**.

CDX implementations do not efficiently support reversed sort orders, so
cdx\_toolkit results will be ordered by ascending SURT and by ascending
timestamp. However, since CC has an individual index for each month, and
because most users want more recent results, cdx\_toolkit defaults to
querying CC's CDX indices in decreasing month order, but each month's
result will be in ascending SURT and ascending timestamp. If you'd like
pure ascending, set **--cc-sort** or **cc\_sort=** to 'ascending'. You
may want to also specify **--from** or **from\_ts=** to set a starting
timestamp.

The main problem with this ascending sort order is that it's a pain to
get the most recent N captures: --limit and limit= will return the
oldest N captures.

TODO
----

Add a call to download a capture from ia or cc, given an URL and a
timestamp.

Status
------

cdx\_toolkit has reached the beta-testing stage of development.

License
-------

Apache 2.0

.. |Build Status| image:: https://travis-ci.org/cocrawler/cdx_toolkit.svg?branch=master
   :target: https://travis-ci.org/cocrawler/cdx_toolkit
.. |Coverage Status| image:: https://coveralls.io/repos/github/cocrawler/cdx_toolkit/badge.svg?branch=master
   :target: https://coveralls.io/github/cocrawler/cdx_toolkit?branch=master
.. |Apache License 2.0| image:: https://img.shields.io/github/license/cocrawler/cdx_toolkit.svg
   :target: LICENSE


