Metadata-Version: 2.1
Name: timescale-vector
Version: 0.0.1
Summary: Python library for storing vector data in Postgres
Home-page: https://github.com/timescale/python-vector
Author: Matvey Arye
Author-email: mat@timescale.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: asyncpg
Requires-Dist: psycopg2
Requires-Dist: pgvector
Provides-Extra: dev
Requires-Dist: python-dotenv ; extra == 'dev'

# Timescale-vector

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

This file will become your README and also the index of your
documentation.

## Install

``` sh
pip install timescale_vector
```

## Basic Usage

Load up your postgres credentials. Safest way is with a .env file:

``` python
from dotenv import load_dotenv, find_dotenv
import os
```

``` python
_ = load_dotenv(find_dotenv()) 
service_url  = os.environ['TIMESCALE_SERVICE_URL']
```

Next, create the client.

This takes three arguments:

- A connection string

- The name of the collection

- Number of dimensions

  In this tutorial, we will use the async client. But we have a sync
  client as well (with an almost identical interface)

``` python
from timescale_vector import client
```

``` python
vec  = client.Async(service_url, "my_data", 2)
```

Next, create the tables for the collection:

``` python
await vec.create_tables()
```

Next, insert some data. The data record contains:

- A uuid to uniquely identify the emedding
- A json blob of metadata about the embedding
- The text the embedding represents
- The embedding itself

Because this data already includes uuids we only allow upserts

``` python
import uuid
```

``` python
await vec.upsert([\
    (uuid.uuid4(), '''{"animal":"fox"}''', "the brown fox", [1.0,1.3]),\
    (uuid.uuid4(), '''{"animal":"fox", "action":"jump"}''', "jumped over the", [1.0,10.8]),\
])
```

Now you can query for similar items:

``` python
await vec.search([1.0, 9.0])
```

    [<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
     <Record id=UUID('2cdb8cbd-5dd7-4555-926a-5efafb4b1cf0') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]

You can specify the number of records to return.

``` python
await vec.search([1.0, 9.0], limit=1)
```

    [<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]

You can also specify a filter on the metadata as a simple dictionary

``` python
await vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
```

    [<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]

You can also specify a list of filter dictionaries, where an item is
returned if it matches any dict

``` python
await vec.search([1.0, 9.0], limit=2, filter=[{"action": "jump"}, {"animal": "fox"}])
```

    [<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
     <Record id=UUID('2cdb8cbd-5dd7-4555-926a-5efafb4b1cf0') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]

You can access the fields as follows

``` python
records = await vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
records[0][client.SEARCH_RESULT_ID_IDX]
```

    UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864')

``` python
records[0][client.SEARCH_RESULT_METADATA_IDX]
```

    {'action': 'jump', 'animal': 'fox'}

``` python
records[0][client.SEARCH_RESULT_CONTENTS_IDX]
```

    'jumped over the'

``` python
records[0][client.SEARCH_RESULT_EMBEDDING_IDX]
```

    array([ 1. , 10.8], dtype=float32)

``` python
records[0][client.SEARCH_RESULT_DISTANCE_IDX]
```

    0.00016793422934946456

You can delete by ID:

``` python
await vec.delete_by_ids([records[0][client.SEARCH_RESULT_ID_IDX]])
```

    []

Or you can delete by metadata filters:

``` python
await vec.delete_by_metadata({"action": "jump"})
```

    []

To delete all records use:

``` python
await vec.delete_all()
```

## Advanced Usage

### Indexing

Indexing speeds up queries over your data.

By default, we setup indexes to query your data by the uuid and the
metadata.

If you have many rows, you also need to setup an index on the embedding.
You can create a timescale-vector index on the table with.

``` python
await vec.create_embedding_index(client.TimescaleVectorIndex())
```

Please see
[`TimescaleVectorIndex`](https://timescale.github.io/python-vector/vector.html#timescalevectorindex)
documentation for advanced options. the You can drop the index with:

``` python
await vec.drop_embedding_index()
```

While we recommend the timescale-vector index type, we also have 2 more
index types availabe:

- The pgvector ivfflat index with
  [`IvfflatIndex`](https://timescale.github.io/python-vector/vector.html#ivfflatindex)
- The pgvector hnsw index with
  [`HNSWIndex`](https://timescale.github.io/python-vector/vector.html#hnswindex)

Usage examples below:

``` python
await vec.create_embedding_index(client.IvfflatIndex())
await vec.drop_embedding_index()
await vec.create_embedding_index(client.HNSWIndex())
await vec.drop_embedding_index()
```

Please note it is very important create the ivfflat index only after you
have data in the table.

Please note the community is actively working on new indexing methods
for embeddings. As they become available, we will add them to our client
as well.

### Time-partitioning

In many use-cases where you have many embeddings time is an important
component associated with the embeddings. For example, when embedding
news stories you often search by time as well as similarity
(e.g. stories related to bitcoin in the past week, or stories about
Clinton in November 2016).

Yet, traditionally, searching by two components “similarity” and “time”
is challenging approximate nearest neigbor (ANN) indexes and makes the
similariy-search index less effective.

One approach to solving this is partitioning the data by time and
creating ANN indexes on each partition individually. Then, during search
you can:

- Step 1: filter our partitions that don’t match the time predicate
- Step 2: perform the similarity search on all matching partitions
- Step 3: combine all the results from each partition in step 2, rerank,
  and filter out results by time.

Step 1 makes the search a lot more effecient by filtering out whole
swaths of data in one go.

Timescale-vector supports time partitioning using TimescaleDB’s
hypertables. To use this feature, simply indicate the length in time for
each partition when creating the client:

``` python
from datetime import timedelta
from datetime import datetime
```

``` python
vec = client.Async(service_url, "my_data_with_time_partition", 2, time_partition_interval=timedelta(hours=6))
await vec.create_tables()
```

Then insert data where the ids use uuid’s v1 and the time component of
the uuid specifies the time of the embedding. For example, to create an
embedding for the current time simply do:

``` python
id = uuid.uuid1()
await vec.upsert([(id, {"key": "val"}, "the brown fox", [1.0, 1.2])])
```

To insert data for a specific time in the past, create the uuid using
our
[`uuid_from_time`](https://timescale.github.io/python-vector/vector.html#uuid_from_time)
function

``` python
specific_datetime = datetime(2018, 8, 10, 15, 30, 0)
await vec.upsert([(client.uuid_from_time(specific_datetime), {"key": "val"}, "the brown fox", [1.0, 1.2])])
```

You can then query the data by specifing a `uuid_time_filter` in the
search call:

``` python
rec = await vec.search([1.0, 2.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime-timedelta(days=7), specific_datetime+timedelta(days=7)))
```

## Development

Please note that this project is developed with
[nbdev](https://nbdev.fast.ai/). Please see that website for the
development process.
