Metadata-Version: 2.1
Name: edu-segmentation
Version: 0.0.115
Summary: To improve EDU segmentation performance using Segbot. As Segbot has an encoder-decoder model architecture, we can replace bidirectional GRU encoder with generative pretraining models such as BART and T5. Evaluate the new model using the RST dataset by using few-shot based settings (e.g. 100 examples) to train the model, instead of using the full dataset.
Author: Your Name
Author-email: you@example.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: CacheControl (==0.12.11)
Requires-Dist: Jinja2 (==3.1.2)
Requires-Dist: MarkupSafe (==2.1.2)
Requires-Dist: PyYAML (==6.0)
Requires-Dist: Pygments (==2.15.1)
Requires-Dist: attrs (==23.1.0)
Requires-Dist: bleach (==6.0.0)
Requires-Dist: build (==0.10.0)
Requires-Dist: certifi (==2022.12.7)
Requires-Dist: charset-normalizer (==3.1.0)
Requires-Dist: cleo (==2.0.1)
Requires-Dist: click (==8.1.3)
Requires-Dist: colorama (==0.4.6)
Requires-Dist: crashtest (==0.4.1)
Requires-Dist: distlib (==0.3.6)
Requires-Dist: docutils (==0.19)
Requires-Dist: dulwich (==0.21.3)
Requires-Dist: filelock (==3.12.0)
Requires-Dist: fsspec (==2023.4.0)
Requires-Dist: html5lib (==1.1)
Requires-Dist: huggingface-hub (==0.14.1)
Requires-Dist: idna (==3.4)
Requires-Dist: importlib-metadata (==6.6.0)
Requires-Dist: installer (==0.7.0)
Requires-Dist: joblib (==1.2.0)
Requires-Dist: jsonschema (==4.17.3)
Requires-Dist: keyring (==23.13.1)
Requires-Dist: lockfile (==0.12.2)
Requires-Dist: markdown-it-py (==2.2.0)
Requires-Dist: mdurl (==0.1.2)
Requires-Dist: more-itertools (==9.1.0)
Requires-Dist: mpmath (==1.3.0)
Requires-Dist: msgpack (==1.0.5)
Requires-Dist: networkx (==3.1)
Requires-Dist: nltk (==3.8.1)
Requires-Dist: numpy (==1.24.3)
Requires-Dist: packaging (==23.1)
Requires-Dist: pexpect (==4.8.0)
Requires-Dist: pkginfo (==1.9.6)
Requires-Dist: platformdirs (==2.6.2)
Requires-Dist: poetry (==1.4.2)
Requires-Dist: poetry-core (==1.5.2)
Requires-Dist: poetry-plugin-export (==1.3.1)
Requires-Dist: ptyprocess (==0.7.0)
Requires-Dist: pyproject_hooks (==1.0.0)
Requires-Dist: pyrsistent (==0.19.3)
Requires-Dist: pywin32-ctypes (==0.2.0)
Requires-Dist: rapidfuzz (==2.15.1)
Requires-Dist: readme-renderer (==37.3)
Requires-Dist: regex (==2023.3.23)
Requires-Dist: requests (==2.29.0)
Requires-Dist: requests-toolbelt (==0.10.1)
Requires-Dist: rfc3986 (==2.0.0)
Requires-Dist: rich (==13.3.5)
Requires-Dist: shellingham (==1.5.0.post1)
Requires-Dist: six (==1.16.0)
Requires-Dist: sympy (==1.11.1)
Requires-Dist: tokenizers (==0.13.3)
Requires-Dist: tomlkit (==0.11.8)
Requires-Dist: torch (==2.0.0)
Requires-Dist: tqdm (==4.65.0)
Requires-Dist: transformers (==4.28.1)
Requires-Dist: trove-classifiers (==2023.4.25)
Requires-Dist: twine (==4.0.2)
Requires-Dist: typing_extensions (==4.5.0)
Requires-Dist: urllib3 (==1.26.15)
Requires-Dist: virtualenv (>20.4.5)
Requires-Dist: webencodings (==0.5.1)
Requires-Dist: zipp (==3.15.0)
Description-Content-Type: text/markdown

Final Year Project on EDU Segmentation:

To improve EDU segmentation performance using Segbot. As Segbot has an encoder-decoder model architecture, we can replace bidirectional GRU encoder with generative pretraining models such as BART and T5. Evaluate the new model using the RST dataset by using few-shot based settings (e.g. 100 examples) to train the model, instead of using the full dataset.

Segbot: <br>
http://138.197.118.157:8000/segbot/ <br>
https://www.ijcai.org/proceedings/2018/0579.pdf

----
### Installation

To use the EDUSegmentation module, follow these steps:

1. Import the `download` module to download all models:<br>
```
from edu_segmentation.download import download_models
download_models()
```

2. Import the `edu_segmentation` module and its related classes<br>
```
from edu_segmentation.main import EDUSegmentation, ModelFactory, BERTUncasedModel, BERTCasedModel, BARTModel
```

### Usage
The edu_segmentation module provides an easy-to-use interface to perform EDU segmentation using different strategies and models. Follow these steps to use it:

1. Create a segmentation strategy:<br><br>
You can choose between the default segmentation strategy or a conjunction-based segmentation strategy. <br><br>
<strong>Conjunction-based segmentation strategy:</strong> After the text has been EDU-segmented, if there are conjunctions at the start or end of each segment, the conjunctions will be isolated as its own segment.<br><br>
<strong>Default segmentation strategy: </strong> No post-processing occurs after the text has been EDU-segmented <br><br>
```
from edu_segmentation.main import DefaultSegmentation, ConjunctionSegmentation
```

2. Create a model using the `ModelFactory`. <br><br>
Choose from BERT Uncased, BERT Cased, or BART models.

```
model_type = "bert_uncased"  # or "bert_cased", "bart"
model = ModelFactory.create_model(model_type)
```

3. create an instance of `EDUSegmentation` using the chosen model: <br>
```
edu_segmenter = EDUSegmentation(model)
```

4. Segment the text using the chosen strategy: <br>
```
text = "Your input text here."
granularity = "conjunction_words"  # or "default"
conjunctions = ["and", "but", "however"]  # Customize conjunctions if needed
device = 'cpu'  # Choose your device, e.g., 'cuda:0'

segmented_output = edu_segmenter.run(text, granularity, conjunctions, device)
```


### Example

Here's a simple example demonstrating how to use the edu_segmentation module:

```
from edu_segmentation.download import download_models
from edu_segmentation.main import ModelFactory, EDUSegmentation

download_models()

# Create a BERT Uncased model
model = ModelFactory.create_model("bart") # or bert_cased or bert_uncased

# Create an instance of EDUSegmentation using the model
edu_segmenter = EDUSegmentation(model)

# Segment the text using the conjunction-based segmentation strategy
text = "The food is good, but the service is bad."
granularity = "conjunction_words" # or default
conjunctions = ["and", "but", "however"] # customise as needed
device = 'cpu' # or cuda

segmented_output = edu_segmenter.run(text, granularity, conjunctions, device)
print(segmented_output)
```
