Metadata-Version: 2.1
Name: corpus-preprocess
Version: 0.0.4
Summary: Utility functions to preprocess Phil. legalese in weasel-based flows.
Home-page: https://mv3.dev
Author: Marcelino G. Veloso III
Author-email: contact@mv3.dev
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: inflect (>=7.0,<8.0)
Requires-Dist: ipykernel (>=6.27.1,<7.0.0)
Requires-Dist: rich (>=13.7.0,<14.0.0)
Requires-Dist: roman (>=4.1,<5.0)
Requires-Dist: spacy[apple] (>=3.7.2,<4.0.0)
Requires-Dist: sqlite-utils (>=3.36,<4.0)
Project-URL: Documentation, https://justmars.github.io/corpus-preprocess
Project-URL: Repository, https://github.com/justmars/corpus-preprocess
Description-Content-Type: text/markdown

# corpus-preprocess

![Github CI](https://github.com/justmars/corpus-preprocess/actions/workflows/main.yml/badge.svg)

Utility functions to preprocess Phil. legalese in [weasel](https://github.com/explosion/weasel)-based flows:

1. [lexcat-proj](https://github.com/justmars/lexcat-proj); and
2. [lexcat-multi](https://github.com/justmars/lexcat-multi)

> [!IMPORTANT]
> Relies on a private [corpus-assets](https://github.com/justmars/corpus-assets) to be cloned locally.

```yml
- corpus-assets: #  folder should have the following structure:
  - data: # used as data folder in tokenization
    - single_tokens.json
    - report_publishers.json
  - ents: # collected in `setup_span_ruler.py`
    - casenames.txt # each line is a clean case
    - clean_statute_titles.txt # each line is a clean title
  - concepts: # collected in `setup_span_ruler.py`
    - political: # main subject category
        - bill_of_rights: # sub-topic
            - patterns.json # contains matcher files
            - q.txt # contains lines which can be used to query the database
  - metas: # collected in `setup_span_ruler.py`
    - artifacts:
      - axiom:
        - patterns.json # same
        - q.txt # same
```

## Custom tokenizer / span ruler

```py
import spacy

from .setup_span_ruler import set_patterns_from_assets
from .setup_tokenizer import customize_tokenizer
from .tokens_single import import_data_tokens
from .utils import validated_path

# limit number of spans returned, ruler key is default
@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
    doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
    return doc

# initialize model, get special rules for tokenization, here: tokens_dir = /corpus_assets/data
rules_file = validated_path(tokens_dir)
special_rules = import_data_tokens(data_path=rules_file)
nlp = spacy.load("en_core_web_sm", exclude=("ner", "senter"))
nlp.tokenizer = customize_tokenizer(nlp, special_rules)

# prepare patterns for span rule, here assets_dir = /corpus_assets
span_patterns = set_patterns_from_assets(path=validated_path(assets_dir))
ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"})
ruler.add_patterns(span_patterns)
nlp.add_pipe("filter_added_spans")
nlp.to_disk("models/")  # will save entire directory which includes the pipeline
```

> [!NOTE]
> Loading the model can take awhile if more patterns in `set_patterns_from_assets()` are included, e.g.
> 130k pattern files takes about 90seconds.

## Processes

### Generate queries

The `q.txt` lines will be used as criteria to fetch relevant segments from the database.

The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. `/scripts/extract.py`
utilizes [table.search()](https://sqlite-utils.datasette.io/en/stable/python-api.html#searching-with-table-search). See [code](./corpus_preprocess/asset_extractors.py):

```py
def extract_txt_from_db(
    source_db_file: str,
    path: Path,
    max_segments: int,
    min_char_segment: int = 100,
    max_char_segment: int = 3000,
    is_unique_txt: bool = True,
):
    """An fts expression is auto-generated by `q.txt` files found in the `path`. This
    expression is used to generate strings of text that match the aggregated query."""
    db = Database(source_db_file)
    tbl = db["opinion_segments"]
    rows = tbl.search(  # type: ignore
        q=create_fts_expr(path), # an sqlite fts5 expression is made via q.txt files
        where="category='ruling' and char_count > :min_char and char_count < :max_char ",
        where_args={"min_char": min_char_segment, "max_char": max_char_segment},
        limit=max_segments,
        columns=["text", "id"],
    )
    if is_unique_txt:
        rows = filter_unique_texts(rows)
    return rows
```

### Create matcher patterns

A [SpanRuler](https://spacy.io/api/spanruler) component will be based on `patterns.json` (with `q.txt` as phrases). These patterns are aggregated via `set_patterns_from_assets()` but can be used individually. See [code](./corpus_preprocess/setup_span_ruler.py):

```py
def set_patterns_from_assets(path: Path):
    axioms = axiom.collect_patterns(path.joinpath("meta"))
    concepts = create_concept_patterns(path.joinpath("concepts"))
    ents = extract_ents(path.joinpath("ents"))
    return axioms + concepts + ents
```

### Enabling textcat_multilabel

The `create_concept_patterns()` can be mapped to their ids which is their location in corpus-assets. This makes it possible to create a textcat-multilabel component using the span.id, e.g.:

```py
textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, options: list[str]):
        self.nlp = nlp
        self.options = options

    def __call__(self, doc) -> Doc:
        doc.cats = {op: 0.0 for op in self.options}
        for span in doc.spans["sc"]:
            if span.id:  # some spans won't have an id
                value = self.nlp.vocab.strings[span.id]
                if "/" in value:  # e.g. political/bill_of_rights
                    main_topic = value.split("/")[0]  # just political
                    if main_topic in self.options:
                        if doc.cats[main_topic] == 0.0:
                            doc.cats[main_topic] = 1.0
        return doc
```

