Metadata-Version: 2.1
Name: pdf-struct
Version: 0.3.2
Summary: Logical structure analysis of visually structured documents.
Home-page: https://github.com/stanfordnlp/pdf-struct
Author: Yuta Koreeda
Author-email: yuta.koreeda@hal.hitachi.com
Maintainer: Yuta Koreeda
Maintainer-email: yuta.koreeda@hal.hitachi.com
License: Apache
Platform: UNKNOWN
Classifier: Environment :: Console
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Description-Content-Type: text/markdown
Requires-Dist: click (==7.1.2)
Requires-Dist: numpy (==1.19.1)
Requires-Dist: pdfminer.six (==20200726)
Requires-Dist: regex (==2020.7.14)
Requires-Dist: torch (==1.9.0)
Requires-Dist: tqdm (==4.48.0)
Requires-Dist: transformers (==4.9.1)
Requires-Dist: sklearn (==0.0)
Requires-Dist: joblib (==1.0.0)
Requires-Dist: editdistance (==0.5.3)
Requires-Dist: beautifulsoup4 (==4.9.3)
Requires-Dist: sentencepiece (==0.1.96)
Requires-Dist: wheel
Requires-Dist: twine

# pdf-struct: Logical structure analysis for visually structured documents

This is a tool for extracting fine-grained logical structures (such as boundaries and their hierarchies) from visually structured documents (VSDs) such as PDFs.
pdf-struct is easily customizable to different types of VSDs and it significantly outperformed baselines in identifying different structures in VSDs.
For example, our system obtained a paragraph boundary detection F1 score of 0.953 which is significantly better than a popular PDF-to-text tool with an F1 score of 0.739.
Please note that current pdf-struct has several limitations:

* It is intended for single-column documents. It does not suport multi-column documents.
* Published models are trained on contracts. It may work on general documents, but it has not been tested. Nevertheless, you can train your own model using a corpus of your choice.

Details of pdf-struct can be found in our [paper](https://aclanthology.org/2021.nllp-1.15/) that was published in "Natural Legal Language Processing Workshop 2021".
You can find the dataset for reproducing the paper [here](https://stanfordnlp.github.io/pdf-struct-dataset/).

## Basic Usage

This program runs on Python 3 (tested on 3.8.5).
Install pdf-struct:

```
pip install pdf-struct
```

```
pdf-struct predict --model PDFContractEnFeatureExtractor ${PATH_TO_PDF_FILE}
```

You may choose a pretrained model from https://github.com/stanfordnlp/pdf-struct-models

## Advanced Usage

This section explains the way to create your own dataset and to train your own models.

### Prerequisite

To install dependencies, run:

```bash
pip install -r requirements.txt
```

### Getting data ready

First, place your raw documents in a directory of your choice.
They must have following extensions:

* `*.pdf`: PDF files with embedded text. PDF without embedded text (i.e. those that require OCR) or two columns PDF is not supported.
* `*.txt`: Plain text files that are visually structured with spaces/line breaks.
* `*.hocr`: HOCR files.

You may handle HTML files by turning them into PDF files:

```bash
find my_input_directory/ -type f | \
  grep -P 'html$|htm$|HTML$|HTM$' | \
  while read f; do \
    chrome --headless --disable-gpu --print-to-pdf-no-header --print-to-pdf="data/raw/`basename $f`.pdf" "$f"; \
  done
```


### Creating TSV files for annotation

Create TSV file for annotation. 

```bash
pdf-struct init-dataset ${FILE_TYPE} ${RAW_DOCUMENTS_DIR} ${OUTPUT_DIR}
```

where `${FILE_TYPE}` should be one of `pdf`, `txt` or `hocr`.

This will output tsv files to `${OUTPUT_DIR}`.

### Annotating TSV files

Annotate TSV files that were geenerated with `init-dataset` command.

Each line of TSV file is organized as following:

```
text<tab>pointer<tab>label
```

`text` is extracted text from the input document. It should roughly correspond to a line in the document.

`label` (default empty) denotes the transition relationship between that line and the next line.
It should be one of following:

* c(continuous): Next line is part of a same paragraph
* a(ddress): Next line is part of a same paragraph BUT the line break is meaningful. This is intended to be used for things like addresses.
* b(lock): Next line is a start of a new paragraph BUT in within the same "block"
* s(ame level): Next line is a start of a new block (thus a new paragraph)
* d(rop): Next line is a start of a new block (thus a new paragraph) that is a child of the current block
* e(limine): The current line should be deleted
* x (excluded): The current line should be excluded both from training and evaluation
    - In our experiments, we removed things like temporal two column lines, signatures, titles etc.

In the annotation, we introduced a concept `block`. This is intended for a case where we want to distinguish listings and paragraphs.
e.g.,

```
Each party must:

    1. Blah blah blah ....
    blah blah blah....
      Blah blah blah....
    blah blah blah....

    2. Blah blah blah...
```

Here, a new paragraph within `1.` at the fifth line is definately meaningful and it should not be treated in the same way as the start of `2.` at the eighth line.
We say that relationship between the forth and fith lines (i.e. label for the forth line) is `b`.

That being said, we currently treat b and s label in the same way.
In fact some other labels are merged in the training/evaluation:

* `c` and `a` will be merged
* `b` and `s` will be merged
* `c`, `b`, `d` and `s` with a pointer is merged

`pointer` (default 0) is introduced when the hierarchy goes up.
It should be used along with `c`, `b`, `d` or `s`.
We use pointer along with different labels, because we have some oocasions where we see rise in hierarchy AND the line being a continous paragraph or a different paragraph.

e.g.,

```
Blah blah blah...:<tab>0<tab>d
  a. Blah blah blah...<tab>0<tab>s
  b. Blah blah blah...<tab>-1<tab>s
Blah blah blah...:<tab>0<tab>d
  1. Blah blah blah...<tab>0<tab>d
    a) Blah blah blah...<tab>0<tab>c
     blah blah blah...<tab>0<tab>s
    b) Blah blah blah...<tab>5<tab>c
    but this does not include ...<tab>5<tab>s
                       PAGE 1/2<tab>0<tab>e
  2. Blah blah blah...<tab>0<tab>d
```

As you can see, eighth line use a pointer along with `c` because the nineth line is actually a continous paragraph from the fifth line.
Pointers are 1-indexed (starts from 1) and 0 denotes no pointer.
A pointer can be set to `-1` to return to the most upper hierarchy.
The last line should be annotated with pointer `-1` and label `s` (though it is ignored internally).

### Evaluating models

You can run experiments with following command:

```bash
pdf-struct evaluate ${FILE_TYPE} ${FEATURE_EXTRACTOR_NAME} ${RAW_DOCUMENTS_DIR} ${ANNOTATED_DOCUMENTS_DIR}
```

Refer `pdf-struct evaluate --help` for the list of the feature extractors.
This will run k-folds cross validation over the data.

### Training models

You can train a new model on your dataset.

```bash
pdf-struct train ${FILE_TYPE} ${FEATURE_EXTRACTOR_NAME} ${RAW_DOCUMENTS_DIR} ${ANNOTATED_DOCUMENTS_DIR} ${MODEL_OUTPUT_PATH}
```

You can then feed `${MODEL_OUTPUT_PATH}` to `--path` option of `pdf-struct predict`.

## Customizing feature extractor

Coming soon!


