Metadata-Version: 2.1
Name: corpus-preprocess
Version: 0.0.3
Summary: Utility functions to preprocess Phil. legalese in weasel-based flows.
Home-page: https://mv3.dev
Author: Marcelino G. Veloso III
Author-email: contact@mv3.dev
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: inflect (>=7.0,<8.0)
Requires-Dist: ipykernel (>=6.27.1,<7.0.0)
Requires-Dist: rich (>=13.7.0,<14.0.0)
Requires-Dist: roman (>=4.1,<5.0)
Requires-Dist: spacy[apple] (>=3.7.2,<4.0.0)
Requires-Dist: sqlite-utils (>=3.36,<4.0)
Project-URL: Documentation, https://justmars.github.io/corpus-preprocess
Project-URL: Repository, https://github.com/justmars/corpus-preprocess
Description-Content-Type: text/markdown

# corpus-preprocess

![Github CI](https://github.com/justmars/corpus-preprocess/actions/workflows/main.yml/badge.svg)

Utility functions to preprocess Phil. legalese in [weasel](https://github.com/explosion/weasel)-based flows:

1. [lexcat-proj](https://github.com/justmars/lexcat-proj); and
2. [lexcat-multi](https://github.com/justmars/lexcat-multi)

> [!IMPORTANT]
> Relies on a private [corpus-assets](https://github.com/justmars/corpus-assets) to be cloned locally.

```yml
- corpus-assets: #  folder should have the following structure:
  - data: # used as data folder in tokenization
    - single_tokens.json
    - report_publishers.json
  - ents: # collected in `setup_span_ruler.py`
    - casenames.txt # each line is a clean case
    - clean_statute_titles.txt # each line is a clean title
  - concepts: # collected in `setup_span_ruler.py`
    - political: # main subject category
        - bill_of_rights: # sub-topic
            - patterns.json # contains matcher files
            - q.txt # contains lines which can be used to query the database
  - metas: # collected in `setup_span_ruler.py`
    - artifacts:
      - axiom:
        - patterns.json # same
        - q.txt # same
```

## Custom tokenizer / span ruler

```py
import spacy

from .setup_span_ruler import set_patterns_from_assets
from .setup_tokenizer import customize_tokenizer
from .tokens_single import import_data_tokens
from .utils import validated_path

# limit number of spans returned, ruler key is default
@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
    doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
    return doc

# initialize model, get special rules for tokenization, here: tokens_dir = /corpus_assets/data
rules_file = validated_path(tokens_dir)
special_rules = import_data_tokens(data_path=rules_file)
nlp = spacy.load("en_core_web_sm", exclude=("ner", "senter"))
nlp.tokenizer = customize_tokenizer(nlp, special_rules)

# prepare patterns for span rule, here assets_dir = /corpus_assets
span_patterns = set_patterns_from_assets(path=validated_path(assets_dir))
ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"})
ruler.add_patterns(span_patterns)
nlp.add_pipe("filter_added_spans")
nlp.to_disk("models/")  # will save entire directory which includes the pipeline
```

> [!NOTE]
> Loading the model can take awhile if more patterns in `set_patterns_from_assets()` are included, e.g.
> 130k pattern files takes about 90seconds.

## Processes

### Generate queries

The `q.txt` lines will be used as criteria to fetch relevant segments from the database.

The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. `/scripts/extract.py`
utilizes [table.search()](https://sqlite-utils.datasette.io/en/stable/python-api.html#searching-with-table-search).

See [code](./corpus_preprocess/asset_extractors.py)

```py
def extract_txt_from_db(
    source_db_file: str,
    path: Path,
    max_segments: int,
    min_char_segment: int = 100,
    max_char_segment: int = 3000,
    is_unique_txt: bool = True,
):
    """An fts expression is auto-generated by `q.txt` files found in the `path`. This
    expression is used to generate strings of text that match the aggregated query."""
    db = Database(source_db_file)
    tbl = db["opinion_segments"]
    rows = tbl.search(  # type: ignore
        q=create_fts_expr(path),
        where="category='ruling' and char_count > :min_char and char_count < :max_char ",
        where_args={"min_char": min_char_segment, "max_char": max_char_segment},
        limit=max_segments,
        columns=["text", "id"],
    )
    if is_unique_txt:
        rows = filter_unique_texts(rows)
    return rows
```

### Create matcher patterns

A [SpanRuler](https://spacy.io/api/spanruler) component will be based on `patterns.json` (with `q.txt` as phrases). These patterns are aggregated via `set_patterns_from_assets()`. See [code](./corpus_preprocess/setup_span_ruler.py):

```py
def set_patterns_from_assets(path: Path):
    axioms = axiom.collect_patterns(path.joinpath("meta"))
    concepts = create_concept_patterns(path.joinpath("concepts"))
    ents = extract_ents(path.joinpath("ents"))
    return axioms + concepts + ents
```

### Categorize queried segments via patterns found

A [TextCategorizer](https://spacy.io/api/textcategorizer) component can be trained using the results of the span ruler: see sample code:

```py
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, path: str):
        self.nlp = nlp
        options = list({p["id"].split("/")[0] for p in create_patterns(path)})  # type: ignore
        if len(options) == 1:
            options.append(f"not_{options[0]}")
        self.options = options

    def __call__(self, doc) -> Doc:
        default = {op: 0.0 for op in self.options}
        cats = [self.nlp.vocab.strings[s.id].split("/")[0] for s in doc.spans["sc"]]
        doc.cats = default | {k: 1.0 for k, _ in Counter(cats).items()}
        return doc
```

> [!NOTE]
> Note: if textcat is in the pipeline, if only one label is found, will error out, hence need to a _not_ option. If textcat_multilabel is used, then a single category is fine.

## Prerequisites to lexcat-*

item | desc | `project.yml` declaration
-- | -- | --
db | sqlite database to fetch segments[^1] | `db_file`
corpus-assets | A folder to retrieve q.txt and patterns.json files | `patterns_dir`
corpus-preprocess | This toolkit | see usage in `/scripts/build.py` and `/scripts/extract.py`

[^1]: Although it might be better to allow segment access via lawsql's API.

## Installation of lexcat-*

Clone the above repos and activate virtual env with `requirements.txt`:

```sh
python -m venv .venv && \
source .venv/bin/activate && \
python -m pip install -U pip && \
python -W ignore -m pip install -r requirements.txt && \
weasel run init
```

## lexcat-proj

- Results in a model trained on a specific **concept** category
- Need to adjust [project.yml](project.yml)'s **name**, **topic_dir**, and **total_segments** variables (`vars`).
- Running `weasel run all` produces packages/en_lex_`name`_`total_segments`-0.0.0/dist
- The output is based on _q.txt_ and _patterns.json_ files sourced from e.g. ../patterns/`topic_dir`.
- Alternatively, can override CLI arguments, e.g. `weasel run all . --vars.topic_dir <value> --vars.name <value>`

### broad implementation

topic | name | status
-- | -- | --
political | pol | ok
labor | labor | ok
criminal | crim | ok
civil | civ | ok
remedial | rem | -
commercial | com | -
ethics | eths | -
remedial | rem | -

Example use on command line (note `.`)[^2]:

```sh
weasel run all . \
    --vars.topic_dir criminal \
    --vars.name crim \
    --vars.total_segments 5000
```

[^2]: The override of weasel project variables `vars` on the command line [requires tinkering](https://github.com/explosion/spaCy/issues/8818)

### granular implementation

topic | name | status
-- | -- | --
political/review | name=pol_rev | ok
political/sovereignty| name=pol_sov | ok
political/bill_of_rights | name=pol_bill | ok
political/administrative | name=pol_adm | ok

Example use on command line (note `.`):

```sh
weasel run all . \
    --vars.topic_dir political/administrative \
    --vars.name pol_adm \
    --vars.total_segments 250
```

## lexcat-multi

- Results in a model trained on all **concept** categories
- Each category's example files are found in assets
- Running `weasel run all` produces packages/en_`lexcat`-0.0.0/dist
- The output is based on _q.txt_ and _patterns.json_ files sourced from e.g. ../`patterns` (the parent directory)

## Models

There are two models to consider, both will be created under `/training`

### rule-based, weak supervision via keywords

1. The first model is a _rule-based_ temporary model.
2. Basic pipeline makes use of a tokenizer and SpanRuler to make adjustments to _doc.spans_.
3. The pipeline is applied to segments fetched from the database.
4. The model is built via the `scripts/build.py`.
5. The config of this model can be seen in `/training/{name}_ruler/config.cfg`
6. The purpose of this model is seen in the `weasel run bin` to output a `corpus/train.spacy`.

### lexcat, generate model to test on prodigy

1. The second model is the _statistical_ "training" `lexcat`.
2. Utilizes a separate `lexcat_proj/config.cfg` with output `corpus/train.spacy` (from _rule-based_ model).
3. The purpose of this model, found in `training/lexcat/model-best` after `weasel run train`, is to package it for later use.
4. This model becomes a weak supervision model that can be checked by human annotators later.

## Packaged models

Install via filepath, e.g.

```sh
pip install ../lexcat-proj/packages/en_lex_labor_5000-0.0.0/dist/en_lex_labor_5000-0.0.0.tar.gz # poetry add
```

This will enable:

```py
nlp = spacy.load('en_lex_labor_5000')
```

## Gotchas

### weasel

1. See CLI overrides in [weasel, previously spacy projects](https://github.com/explosion/spaCy/issues/8818)
2. Too many warnings so note the `-W ignore` option used in running python command line scripts.
3. Do not name a script/function.py file named `tokenize.py`, this results in `AttributeError: partially initialized module 'inspect' has no attribute 'getmro' (most likely due to a circular import)`
4. Although `project.yml` produces output Markdown formatting, it will not respect full markdown formatting (e.g. headers, tables, enumerations) within `project.yml` fields like description, hence need for this NOTES.md file as a supplement.
5. In creating Language.factories configs, using `Path` as a type results in `Fatal Python error: Segmentation fault`.

