Metadata-Version: 2.1
Name: corpus-patterns
Version: 0.1.2
Summary: Building blocks for spacy Matcher patterns
Home-page: https://mv3.dev
Author: Marcelino G. Veloso III
Author-email: contact@mv3.dev
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: environs (>=9.5.0,<10.0.0)
Requires-Dist: inflect (>=7.0,<8.0)
Requires-Dist: rich (>=13.7.0,<14.0.0)
Requires-Dist: roman (>=4.1,<5.0)
Requires-Dist: spacy[apple] (>=3.7.2,<4.0.0)
Requires-Dist: sqlite-utils (>=3.36,<4.0)
Project-URL: Documentation, https://justmars.github.io/corpus-patterns
Project-URL: Repository, https://github.com/justmars/corpus-patterns
Description-Content-Type: text/markdown

# corpus-patterns

![Github CI](https://github.com/justmars/corpus-patterns/actions/workflows/main.yml/badge.svg)

A preparatory utils library.

## Create a custom tokenizer

```py
from corpus_patterns import set_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = set_tokenizer(nlp)
```

The tokenizer:

1. Removes dashes from infixes
2. Adds prefix/suffix rules for parenthesis/brackets
3. Adds special exceptions to treat dotted text as a single token

Use with modified config file:

```py
@spacy.registry.tokenizers("test")  # type: ignore
def create_corpus_tokenizer():
    def create_tokenizer(nlp):
        return set_tokenizer(nlp)
    return create_tokenizer

nlp = spacy.load("en_core_web_sm", config={"nlp": {"tokenizer": {"@tokenizers": "test"}}},
)
```

## Add .jsonl files to directory

Each file will contain lines of spacy matcher patterns.

```py
from corpus_patterns import create_rules
from pathlib import Path

create_rules(folder=Path("location-here"))  # check directory
```

## Utils

1. `annotate_fragments()` - given an nlp object and some `*.txt` files, create a single annotation `*.jsonl` file
2. `extract_lines_from_txt_files()` - accepts an iterator of `*.txt` files and yields each line (after sorting the same and ensuring uniqueness of content).
3. `split_data()` - given a list of text strings, split the same into two groups and return a dictionary containing these groups based on the ratio provided (defaults to 0.80)

