Metadata-Version: 2.1
Name: swh.scrubber
Version: 2.3.0
Summary: Software Heritage datastore scrubber
Author-email: Software Heritage developers <swh-devel@inria.fr>
Project-URL: Homepage, https://gitlab.softwareheritage.org/swh/devel/swh-scrubber
Project-URL: Bug Reports, https://gitlab.softwareheritage.org/swh/devel/swh-scrubber/-/issues
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-scrubber/
Project-URL: Source, https://gitlab.softwareheritage.org/swh/devel/swh-scrubber.git
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Requires-Python: >=3.7
Description-Content-Type: text/x-rst
License-File: LICENSE
License-File: AUTHORS
Requires-Dist: dulwich
Requires-Dist: humanize
Requires-Dist: psycopg2
Requires-Dist: tenacity
Requires-Dist: swh.core[http] >=3.0.0
Requires-Dist: swh.loader.git >=1.4.0
Requires-Dist: swh.model >=5.0.0
Requires-Dist: swh.storage >=2.0.0
Requires-Dist: swh.journal >=1.3.0
Provides-Extra: testing
Requires-Dist: msgpack ; extra == 'testing'
Requires-Dist: pytest ; extra == 'testing'
Requires-Dist: pytest-mock ; extra == 'testing'
Requires-Dist: pyyaml ; extra == 'testing'
Requires-Dist: swh.core[testing] >=3.0.0 ; extra == 'testing'
Requires-Dist: swh.graph ; extra == 'testing'

Software Heritage - Datastore Scrubber
======================================

Tools to periodically checks data integrity in swh-storage and swh-objstorage,
reports errors, and (try to) fix them.

This is a work in progress; some of the components described below do not
exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)

The Scrubber package is made of the following parts:


Checking
--------

Highly parallel processes continuously read objects from a data store,
compute checksums, and write any failure in a database, along with the data of
the corrupt object.

There is one "checker" for each datastore package: storage (postgresql and cassandra),
journal (kafka), and objstorage.

The journal is "crawled" using its native streaming; others are crawled by range,
reusing swh-storage's backfiller utilities, and checkpointed from time to time
to the scrubber's database (in the ``checked_range`` table).

Storage
+++++++

For the storage checker, a checking configuration must be created before being
able to spawn a number of checkers.

A new configuration is created using the ``swh scrubber check init`` tool:

.. code-block:: bash

   $ swh scrubber check init --object-type snapshot --nb-partitions 65536 --name chk-snp
   Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql

One (or more) checking worker can then be spawned by using the ``swh scrubber
check storage`` command:

.. code-block:: bash

   $ swh scrubber check storage chk-snp
   [...]


.. note:: A configuration file is expected, as for most ``swh`` tools.
          This file must have a ``scrubber`` section with the configuration of
          the scrubber database. For storage checking operations, this
          configuration file must also have a ``storage`` configuration section.
          See the `swh-storage documentation`_ for more details on this. A
          typical configuration file could look like:

   .. code-block:: yaml

      scrubber:
        cls: postgresql
        db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824

      storage:
        cls: postgresql
        db: service=swh
        objstorage:
          cls: noop

.. _`swh-storage documentation`: https://docs.softwareheritage.org/devel/swh-storage/index.html

.. note:: The configuration section ``scrubber_db`` has been renamed as
          ``scrubber`` in ``swh-scrubber`` version 2.0.0

Recovery
--------

Then, from time to time, jobs go through the list of known corrupt objects,
and try to recover the original objects, through various means:

* Brute-forcing variations until they match their checksum
* Recovering from another data store
* As a last resort, recovering from known origins, if any


Reinjection
-----------

Finally, when an original object is recovered, it is reinjected in the original
data store, replacing the corrupt one.
