Metadata-Version: 2.1
Name: extract_sfm
Version: 2.0
Summary: Knowledge Graph Extraction for SFM dataset
Home-page: https://github.com/Panmani/KGE
Author: Yueen Ma
License: UNKNOWN
Description: # Knowledge Graph Extraction
        
        We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.
        
        This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.
        
        Example:
        ![Example](images/brat_stn.png)
        
        ## Dependencies
        Tensorflow 2.2.0 <br>
        Tensorflow-addons <br>
        SpaCy <br>
        NumPy <br>
        DyNet <br>
        Pathlib <br>
        
        ## Install
        Package: https://pypi.org/project/extract-sfm/
        ```shell
        $ pip install extract_sfm
        ```
        
        
        ## Usage
        
        ### Method 1
        
        Create a python file and write:
        ```python
        import extract_sfm
        
        extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")
        ```
        Then run the python file. This may take a while to finish.
        
        ### Method 2
        
        Download this Github repository
        Under the project root directory, run the python script
        
        ```shell
        $ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES
        ```
        > Note: Use absolute path.
        
        
        ## Website
        1. Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
        2. Install npm dependencies under the "SERVER" directory: express, path, multer
        ```
          $ npm install <package name>
        ```
        3. Run the server by typing in:
        ```
          $ node server.js
        ```
        
        ![Example](images/website.jpeg)
        
        ## Environment Setup
        ```
        tensorflow 2.2.0
          pip install tensorflow
          pip install tensorflow-addons
        
        spaCy (macOS)
          pip install -U spacy
          python3 -m spacy download en_core_web_sm
        
        DyNet
          pip install dynet
        
        pathlib
          pip install pathlib
        ```
        
        
        ## NER Documentation
        ```
        TRAINING
          Dataset:
            1. SFM starter dataset: https://github.com/security-force-monitor/nlp_starter_dataset
            2. CONLL2003: https://github.com/guillaumegenthial/tf_ner/tree/master/data/example
            3. A set of known organizations from the starter dataset
            Note: Title and role were collapsed into one class
        
          Usage:
            1) Prepare data
              $ python process.py
              $ cd SFM_STARTER
              $ python build_vocab.py
              $ python build_glove.py
              $ cd ..
        
            2) Train model
              $ python train.py
        
            3) Make predictions
              $ python pred.py
        
            4) Evaluate model
              $ python eval.py
              $ python eval_class.py
        
          Files:
            process.py: 1) preprocess dataset by recording info in dicts,
                              which are saved in two pickle files: dataset_labels.pickle, dataset_sentences.pickle
                        2) convert SFM starter dataset to a format that can be used by the model,
                              which are in files: {}.words.txt and {}.tags.txt where {} could be train, valid or test.
            pred.py: generates predictions using the trained model
            eval.py: evaluate the predctions made by model, which are generated by running pred.py
            eval_class.py: get precision, recall and f1 score for each class
        
            Other files are from https://github.com/guillaumegenthial/tf_ner
              train.py, tf_metrics.py, SFM_STARTER/build_vocab.py, SFM_STARTER/build_glove.py
        
        PREDICTING
          Usage:
            $ python ner.py <doc_id>.txt
        
          File:
            ner.py: get BRAT format prediction for a text file.
        ```
        
        ## RE Documentation
        ```
        jPTDP:
          Before running the following 3 methods, you need to run a dependency parser first, which some methods relies on.
          Usage: Go to the jPTDP directory and run
            $ python fast_parse.py <path_to_txt>.txt
          The output will be put along side with the input text file in a directory whose name is same as the text file.
        
        
        
        --- METHOD 1: nearest person:
            Assign the non-person name entities to the nearest person that is behind the name entities.
        
            Usage:
              1. To extraction relations in a single text file:
                (extracted relations will be appended to the .ann file)
                $ python relation_np.py <doc_id>.txt <doc_id>.ann
              2. To generate annotations for a set of text file under <directory>
                Set "output_dir" in pipeline.sh to <directory> and run:
                $ source pipeline.sh
        
        
        
        --- METHOD 2: dependency parsing
            Assign the non-person name entities to the closest person where distance is the length of the dependency path between the name entity and the person
            Constraint: If we only choose from one of the two person that appear immediately on the left and the right side, the results could be improved but the drawbacks are also obvious
        
            Usage:
              1. To extraction relations in a single text file:
                (extracted relations will be appended to the .ann file)
                $ python relation_dep.py <jPTDP_buffer_path> <doc_id>.txt <doc_id>.ann
              2. To generate annotations for a set of text file under <directory>
                $ source pipeline.sh <directory>
        
        
        
        --- METHOD 3: neural networks
            Use dependency path and its distance as features to predict which person in the sentence is the best option
            The best model is saved in "model_86.h5"
        
            Usage:
              Predictions are made on files in "pred_path" and are written in place, "pred_path" can be set in config.py
              $ python pred.py
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
