Metadata-Version: 2.1
Name: contentmap
Version: 0.5.0
Summary: 
Author: Philippe Oger
Author-email: phil.oger@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: aiohttp (>=3.9.1,<4.0.0)
Requires-Dist: langchain (>=0.1.8,<0.2.0)
Requires-Dist: lxml (==4.9.4)
Requires-Dist: requests (>=2.31.0,<3.0.0)
Requires-Dist: sentence-transformers (>=2.3.1,<3.0.0)
Requires-Dist: sqlite-vss (>=0.1.2,<0.2.0)
Requires-Dist: tqdm (>=4.66.1,<5.0.0)
Requires-Dist: trafilatura (>=1.6.4,<2.0.0)
Description-Content-Type: text/markdown

# Content map

A way to share content from a specific domain using SQLite as an alternative to 
RSS feeds. The purpose of this library is to simply create a dataset for all the
content on your website, using the XML sitemap as a starting point.  

Possibility to include vector search similarity features in the dataset very easily.

Article that explains the rationale behind this type of datasets [here](https://philippeoger.com/pages/can-we-rag-the-whole-web/).


## Installation

```bash

pip install contentmap

```

## Quickstart

To build your contentmap.db with vector search capabilities and containing all 
your content using your XML sitemap as a starting point, you only need to write the
following: 

```python
from contentmap.sitemap import SitemapToContentDatabase

database = SitemapToContentDatabase(
    sitemap_sources=["https://yourblog.com/sitemap.xml"],
    concurrency=10,
    include_vss=True
)
database.build()

```

This will automatically create the SQLite database file, with vector search 
capabilities (piggybacking on sqlite-vss integration on Langchain).

Thanks to @medoror for contributing.

