Metadata-Version: 2.1
Name: swh.indexer
Version: 0.3.0
Summary: Software Heritage Content Indexer
Home-page: https://forge.softwareheritage.org/diffusion/78/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-indexer
Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-indexer/
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: vcversioner
Requires-Dist: click
Requires-Dist: python-magic (>=0.4.13)
Requires-Dist: pyld
Requires-Dist: xmltodict
Requires-Dist: swh.core[db,http] (>=0.3)
Requires-Dist: swh.model (>=0.0.15)
Requires-Dist: swh.objstorage (>=0.0.43)
Requires-Dist: swh.scheduler (>=0.5.2)
Requires-Dist: swh.storage (>=0.12.0)
Requires-Dist: swh.journal (>=0.1.0)
Provides-Extra: testing
Requires-Dist: confluent-kafka ; extra == 'testing'
Requires-Dist: pytest ; extra == 'testing'
Requires-Dist: pytest-mock ; extra == 'testing'
Requires-Dist: hypothesis (>=3.11.0) ; extra == 'testing'
Requires-Dist: swh.scheduler[testing] (>=0.5.0) ; extra == 'testing'
Requires-Dist: swh.storage[testing] (>=0.10.0) ; extra == 'testing'

swh-indexer
============

Tools to compute multiple indexes on SWH's raw contents:
- content:
  - mimetype
  - ctags
  - language
  - fossology-license
  - metadata
- revision:
  - metadata

An indexer is in charge of:
- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db

There are multiple indexers working on different object types:
  - content indexer: works with content sha1 hashes
  - revision indexer: works with revision sha1 hashes
  - origin indexer: works with origin identifiers

Indexation procedure:
- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
- store the result to swh's storage

Current content indexers:

- mimetype (queue swh_indexer_content_mimetype): detect the encoding
  and mimetype

- language (queue swh_indexer_content_language): detect the
  programming language

- ctags (queue swh_indexer_content_ctags): compute tags information

- fossology-license (queue swh_indexer_fossology_license): compute the
  license

- metadata: translate file into translated_metadata dict

Current revision indexers:

- metadata: detects files containing metadata and retrieves translated_metadata
  in content_metadata table in storage or run content indexer to translate
  files.


