Metadata-Version: 2.1
Name: context-converter
Version: 1.0.2
Summary: Convert HTML to Markdown using Regex, BeautifulSoup4, and filter repeating characters with Jina Embeddings and a similarity threshold.
Author-Email: Daethyra <109057945+Daethyra@users.noreply.github.com>
License: MIT
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12.2
Requires-Dist: markdownify>=0.11.6
Requires-Dist: transformers>=4.36.2
Requires-Dist: torch>=2.1.2
Requires-Dist: aiofiles>=23.2.1
Requires-Dist: asyncio>=3.4.3
Description-Content-Type: text/markdown

# Convert and Format HTML to Markdown

## Purpose

For converting HTML to Markdown and formatting a dataset of HTML content 
into structured Markdown, with added capabilities of processing text embeddings to identify and remove repetitive content.

## Installation & Setup

First clone the package: `git clone https://github.com/daethyra/context-converter.git`

To get started, run:

`pip install context-converter`

* Run `jina_embeddings.py` to preemptively download the embeddings model.

**Example integration**:

* Please see an example usage in [gpt-crawler](https://github.com/Daethyra/gpt-crawler). This fork of `gpt-crawler` has the `context-converter` package integrated into its processing pipeline. 

**Configuration**:
* You can clone the package repository to configure similarity threshold for removing content, chunk size, maximum number of threads, the file pattern to match when loading files for conversion, and the output file's name.

`git clone https://github.com/daethyra/context-converter.git`