Metadata-Version: 2.1
Name: indexify-extractor-sdk
Version: 0.0.47
Summary: Indexify Extractor SDK to build new extractors for extraction from unstructured data
Author: Diptanu Gon Choudhury
Author-email: diptanu@tensorlake.ai
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: azure-identity (>=1.15.0,<2.0.0)
Requires-Dist: azure-storage-blob (>=12.19.0,<13.0.0)
Requires-Dist: boto3 (>=1.34.37,<2.0.0)
Requires-Dist: docker (>=7.0.0,<8.0.0)
Requires-Dist: fastapi (>=0.109.2,<0.110.0)
Requires-Dist: fsspec (>=2024.2.0,<2025.0.0)
Requires-Dist: genson (>=1.2.2,<2.0.0)
Requires-Dist: google-cloud-storage (>=2.14.0,<3.0.0)
Requires-Dist: grpcio (>=1.60.1,<2.0.0)
Requires-Dist: httpx (>=0.26.0,<0.27.0)
Requires-Dist: indexify_text_splitter (>=0.1.1,<0.2.0)
Requires-Dist: jinja2 (>=3.1.3,<4.0.0)
Requires-Dist: nanoid (>=2.0.0,<3.0.0)
Requires-Dist: netifaces2 (==0.0.21)
Requires-Dist: protobuf (>=4.25.2,<5.0.0)
Requires-Dist: pydantic (>=2.6.1,<3.0.0)
Requires-Dist: rich (>=13.7.1,<14.0.0)
Requires-Dist: typer[all] (>=0.9.0,<0.10.0)
Requires-Dist: uvicorn (>=0.27.0.post1,<0.28.0)
Requires-Dist: websockets (>=12.0,<13.0)
Description-Content-Type: text/markdown

# Indexify Extractor SDK 

[![PyPI version](https://badge.fury.io/py/indexify-extractor-sdk.svg)](https://badge.fury.io/py/indexify-extractor-sdk)

Indexify Extractor SDK is for developing new extractors to extract information from any unstructured data sources.

We already have a few extractors here - https://github.com/tensorlakeai/indexify If you don't find one that works for your use-case use this SDK to build one. 



## Install the SDK
Install the SDK from PyPi
```bash
virtualenv ve
source ve/bin/activate
pip install indexify-extractor-sdk
```

## Implement the extractor SDK
Implement the extractor interface 
```python
class MyExtractor(Extractor):
    input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]

    def __init__(self):
        super().__init__()

    def extract(self, content: Content, params: InputParams) -> List[Content]:
        return [
            Content.from_text(
                text="Hello World",
                features=[
                    Feature.embedding(values=[1, 2, 3]),
                    Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
                ],
                labels={"url": "test.com"},
            ),
            Content.from_text(
                text="Pipe Baz",
                features=[Feature.embedding(values=[1, 2, 3])],
                labels={"url": "test.com"},
            ),
        ]

    def sample_input(self) -> Content:
        return Content.from_text("hello world")

```

## Test the extractor
You can run the extractor locally using the command line tool attached to the SDK like this, by passing some arbitrary text or a file. 
```bash
indexify-extractor local my_extractor.py:MyExtractor --text "hello"
```

## Deploy the extractor

Once you are ready to deploy the new extractor and ready to build pipelines with it. Package the extractor and deploy as many copies you want, and point it to the indexify server. Indexify server has two addresses, one for sending your extractor the extraction task, and another endpoint for your extractor to write the extracted content.
```
indexify-extractor join-server my_extractor.py:MyExtractor --coordinator-addr localhost:8950 --ingestion-addr:8900
```


