Metadata-Version: 2.1
Name: comparisonframe
Version: 0.0.0
Summary: A simple tool to compare textual data against validation sets.
Author: Kyrylo Mordan
Author-email: parachute.repo@gmail.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sentence-transformers ==2.2.2
Requires-Dist: dill ==5.0.1
Requires-Dist: pandas
Requires-Dist: attrs >=22.2.0
Requires-Dist: scikit-learn ==1.3.1

# Comparisonframe

Comparison Frame is designed to automate and streamline the process of comparing textual data, particularly focusing on various metrics
such as character and word count, punctuation usage, and semantic similarity.
It's particularly useful for scenarios where consistent text analysis is required,
such as evaluating the performance of natural language processing models, monitoring content quality,
or tracking changes in textual data over time using manual evaluation.

```python
from comparisonframe import ComparisonFrame
```

## Usage examples

The examples contain: 
1. creating validation set and saving it to be reused
2. comparing newly generated data with expected results 
3. recording test statuses
4. reseting statuses, flushing records and comparison results

### 1. Creating validation set

### 1.1 Initialize comparison class


```python
comparer = ComparisonFrame(
    # optionally
    ## provide name of the model from sentence_transformer package
    model_name = "all-mpnet-base-v2",
    ## provide filenames to persist state
    record_file = "record_file.csv",  # file where queries and expected results are stored
    results_file = "comparison_results.csv", # file where comparison results will be stored
    embeddings_file = "embeddings.dill",
    ## provide soup for scraping if was already defined externally
    embedder = None
)
```

    INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-mpnet-base-v2
    INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


#### 1.2 Recording queries and expected responses (validation set)


```python
comparer.record_query(query = "Black metal",
                      expected_text = "Black metal is an extreme subgenre of heavy metal music.")
comparer.record_query(query = "Tribulation",
                      expected_text = "Tribulation are a Swedish heavy metal band from Arvika that formed in 2005.")
```

    Batches:   0%|          | 0/1 [00:00<?, ?it/s]

    Batches: 100%|██████████| 1/1 [00:00<00:00,  1.76it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00,  8.03it/s]


### 2. Comparing with expected results

#### 2.1 Initialize new comparison class


```python
comparer = ComparisonFrame()
```

    INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-mpnet-base-v2
    INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


### 2.2 Show validation set


```python
untested_queries = comparer.get_all_queries(
    ## optionall
    untested_only=True)
print(untested_queries)
```

    ['Black metal', 'Tribulation']



```python
comparer.get_all_records()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>timestamp</th>
      <th>query</th>
      <th>expected_text</th>
      <th>tested</th>
      <th>test_status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2023-11-04 03:28:48</td>
      <td>Black metal</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>no</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>2023-11-04 03:28:48</td>
      <td>Tribulation</td>
      <td>Tribulation are a Swedish heavy metal band fro...</td>
      <td>no</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>



#### 2.3 Compare newly generated with recorded


```python
valid_answer_query_1 = "Black metal is an extreme subgenre of heavy metal music."
very_similar_answer_query_1 = "Black metal is a subgenre of heavy metal music."
unexpected_answer_query_1 = "Black metals are beautiful and are often used in jewelry design."
```


```python
# with no entry to records
comparer.compare_with_record(query = "Black metal",
                             provided_text = valid_answer_query_1,
                             mark_as_tested=False)
comparer.compare_with_record(query = "Black metal",
                             provided_text = very_similar_answer_query_1,
                             mark_as_tested=False)
comparer.compare_with_record(query = "Black metal",
                             provided_text = unexpected_answer_query_1,
                             mark_as_tested=False)
```

    Batches:   0%|          | 0/1 [00:00<?, ?it/s]

    Batches: 100%|██████████| 1/1 [00:00<00:00,  1.66it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00,  9.88it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00, 10.24it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00, 10.22it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00, 10.25it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00, 10.38it/s]


#### 2.4 Check comparison results


```python
comparer.get_comparison_results()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>query</th>
      <th>char_count_diff</th>
      <th>word_count_diff</th>
      <th>line_count_diff</th>
      <th>punctuation_diff</th>
      <th>semantic_similarity</th>
      <th>expected_text</th>
      <th>provided_text</th>
      <th>id</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Black metal</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1.000000</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Black metal</td>
      <td>9</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0.974236</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>Black metal is a subgenre of heavy metal music.</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Black metal</td>
      <td>8</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0.499244</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>Black metals are beautiful and are often used ...</td>
      <td>1</td>
    </tr>
  </tbody>
</table>
</div>



### 3. Record test statuses


```python
comparer.compare_with_record(query = "Black metal",
                             provided_text = very_similar_answer_query_1,
                             mark_as_tested=True)
```

    Batches: 100%|██████████| 1/1 [00:00<00:00,  9.84it/s]
    Batches: 100%|██████████| 1/1 [00:00<00:00,  8.83it/s]



```python
comparer.get_all_records()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>timestamp</th>
      <th>query</th>
      <th>expected_text</th>
      <th>tested</th>
      <th>test_status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2023-11-04 03:28:48</td>
      <td>Black metal</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>yes</td>
      <td>pass</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>2023-11-04 03:28:48</td>
      <td>Tribulation</td>
      <td>Tribulation are a Swedish heavy metal band fro...</td>
      <td>no</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>



### 4. Reseting and flushing results

#### 4.1 Reselt test statuses


```python
comparer.reset_record_statuses(
    # optionally
    record_ids = [1]
)
```


```python
comparer.get_all_records()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>timestamp</th>
      <th>query</th>
      <th>expected_text</th>
      <th>tested</th>
      <th>test_status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>2023-11-04 03:28:48</td>
      <td>Black metal</td>
      <td>Black metal is an extreme subgenre of heavy me...</td>
      <td>no</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>2023-11-04 03:28:48</td>
      <td>Tribulation</td>
      <td>Tribulation are a Swedish heavy metal band fro...</td>
      <td>no</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>



#### 4.2 Flush comparison results


```python
comparer.flush_comparison_results()
```


```python
comparer.get_comparison_results()
```

    ERROR:ComparisonFrame:No results file found. Please perform some comparisons first.


#### 4.3 Flush records


```python
comparer.flush_records()
```


```python
comparer.get_all_records()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>timestamp</th>
      <th>query</th>
      <th>expected_text</th>
      <th>tested</th>
      <th>test_status</th>
    </tr>
  </thead>
  <tbody>
  </tbody>
</table>
</div>


