Metadata-Version: 2.1
Name: toolkit-bert-ner
Version: 1.0.2
Summary: Use Google's BERT for Chinese natural language processing tasks such as named entity recognition and provide server services
Home-page: https://github.com/wxl18039675170
Author: Allen WU
Author-email: allen.wu18621039969@gmail.com
License: MIT
Description: # toolkit-bert-ner
        Base Google pre-training model(BERT), then add BiLSTM layer and crf layer, train a Chinese named entity recognition model.
        
        ## Download project and install  
        You can install this project by:  
        ```
        pip install -i https://test.pypi.org/simple/ toolkit-bert-ner==1.0.0
        ```
        
        OR
        ```
        git clone http://git.huimeimt.net:8008/ds/toolkit-bert-ner.git
        cd toolkit-bert-ner/
        python3 setup.py install
        ```
        
        if you do not want to install, you just need clone this project and reference the file of <run.py> to train the model or start the service.   
            
        ## Train model:
        You can use -help to view the relevant parameters of the training named entity recognition model, where data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified.
        ```angular2html
        toolkit-bert-ner-train -help
        ```
        
        train/dev/test dataset is like this:
        ```
        海 O
        钓 O
        比 O
        赛 O
        地 O
        点 O
        在 O
        厦 B-LOC
        门 I-LOC
        与 O
        金 B-LOC
        门 I-LOC
        之 O
        间 O
        的 O
        海 O
        域 O
        。 O
        ```
        The first one of each line is a token, the second is token's label, and the line is divided by a blank line. The maximum length of each sentence is [max_seq_length] params.  
        You can get training data from above two git repos  
        You can training ner model by running below command:  
        ```
        toolkit_bert_ner_training \
            -data_dir {your dataset dir}\
            -output_dir {training output dir}\
            -init_checkpoint {Google BERT model dir}\
            -bert_config_file {bert_config.json under the Google BERT model dir} \
            -vocab_file {vocab.txt under the Google BERT model dir}
        ```
        like my init_checkpoint: 
        ```
        init_checkpoint = {$HOME}/pre-trained-models/chinese_L-12_H-768_A-12/bert_model.ckpt
        ```
        you can special labels using -label_list params, the project get labels from training data.  
        ```
        # using , split
        -labels 'B-LOC, I-LOC ...'
        OR save label in a file like labels.txt, one line one label
        -labels labels.txt
        ```
        
        After training model, the NER model will be saved in {output_dir} which you special above cmd line.  
        ##### My Training environment：Tesla P40 24G mem  
        
        ## As Service
        ```
        toolkit-bert-ner-serving-start -help
        ```
        
        and than you can using below cmd start ner service:
        ```angular2html
        toolkit_bert_ner_serving \
            -model_dir C:\workspace\python\BERT_Base\output\ner2 \
            -bert_model_dir F:\chinese_L-12_H-768_A-12
            -model_pb_dir C:\workspace\python\BERT_Base\model_pb_dir
            -mode NER
        ```
        
        you can using below code test client:  
        #### 1. NER Client
        ```angular2html
        import time
        from bert_base.client import BertClient
        
        with BertClient(show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
            start_t = time.perf_counter()
            str = '1月24日，新华社对外发布了中央对雄安新区的指导意见，洋洋洒洒1.2万多字，17次提到北京，4次提到天津，信息量很大，其实也回答了人们关心的很多问题。'
            rst = bc.encode([str, str])
            print('rst:', rst)
            print(time.perf_counter() - start_t)
        ```
        ```angular2html
        rst = bc.encode([list(str), list(str)], is_tokenized=True)
        ```  
        
        ## License
        MIT.  
        
        ## How to train
        #### 1. Download BERT chinese model:  
         ```
         wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip  
         ```
        #### 2. Put BERT chinese model to $HOME/pre-trained-models/:  
         ```
        mkdir $HOME/pre-trained-models/
        unzip chinese_L-12_H-768_A-12.zip $HOME/pre-trained-models/
         ```
        
        #### 3. Train model
        
        ##### first method 
        ```
          python3 bert_lstm_ner.py   \
                          --task_name="NER"  \ 
                          --do_train=True   \
                          --do_eval=True   \
                          --do_predict=True
                          --data_dir=NERdata   \
                          --vocab_file=checkpoint/vocab.txt  \ 
                          --bert_config_file=checkpoint/bert_config.json \  
                          --init_checkpoint=checkpoint/bert_model.ckpt   \
                          --max_seq_length=128   \
                          --train_batch_size=32   \
                          --learning_rate=2e-5   \
                          --num_train_epochs=3.0   \
                          --output_dir=./output \
         ```       
         ##### OR replace the BERT path and project path in bert_lstm_ner.py
         ```
         if os.name == 'nt': #windows path config
            bert_path = '{your BERT model path}'
            root_path = '{project path}'
        else: # linux path config
            bert_path = '{your BERT model path}'
            root_path = '{project path}'
         ```
         Than Run:
         ```angular2html
        python3 bert_lstm_ner.py
        ```
        
        ### USING BLSTM-CRF OR ONLY CRF FOR DECODE!
        Just alter bert_lstm_ner.py line of 450, the params of the function of add_blstm_crf_layer: crf_only=True or False  
        
        ONLY CRF output layer:
        ```
            blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                                  dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                                  seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
            rst = blstm_crf.add_blstm_crf_layer(crf_only=True)
        ```
          
          
        BiLSTM with CRF output layer
        ```
            blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                                  dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                                  seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
            rst = blstm_crf.add_blstm_crf_layer(crf_only=False)
        ```
        
        ## ONLINE PREDICT
        If model is train finished, just run
        ```angular2html
        python3 terminal_predict.py
        ```
         
         ## Using NER as Service
        
        #### Service 
        Using NER as Service is simple, you just need to run the python script below in the project root path:
        ```angular2html
        python3 runs.py \ 
            -mode NER
            -bert_model_dir /home/macan/ml/data/chinese_L-12_H-768_A-12 \
            -ner_model_dir /home/macan/ml/data/bert_ner \
            -model_pd_dir /home/macan/ml/workspace/BERT_Base/output/predict_optimizer \
            -num_worker 8
        ```
        
        
        #### Client
        The client using methods can reference client_test.py script
        ```angular2html
        import time
        from client.client import BertClient
        
        ner_model_dir = 'C:\workspace\python\BERT_Base\output\predict_ner'
        with BertClient( ner_model_dir=ner_model_dir, show_server_config=False, check_version=False, check_length=False, mode='NER') as bc:
            start_t = time.perf_counter()
            str = '1月24日，新华社对外发布了中央对雄安新区的指导意见，洋洋洒洒1.2万多字，17次提到北京，4次提到天津，信息量很大，其实也回答了人们关心的很多问题。'
            rst = bc.encode([str])
            print('rst:', rst)
            print(time.perf_counter() - start_t)
        ```
        
        
        NOTE: input format you can sometime reference bert as service project.    
        Welcome to provide more client language code like java or others.  
         ## Using yourself data to train
         if you want to use yourself data to train ner model,you just modify  the get_labes func.
         ```angular2html
        def get_labels(self):
                return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]
        ```
        NOTE: "X", “[CLS]”, “[SEP]” These three are necessary, you just replace your data label to this return list.  
        Or you can use last code lets the program automatically get the label from training data
        ```angular2html
        def get_labels(self):
                # 通过读取train文件获取标签的方法会出现一定的风险。
                if os.path.exists(os.path.join(FLAGS.output_dir, 'label_list.pkl')):
                    with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'rb') as rf:
                        self.labels = pickle.load(rf)
                else:
                    if len(self.labels) > 0:
                        self.labels = self.labels.union(set(["X", "[CLS]", "[SEP]"]))
                        with codecs.open(os.path.join(FLAGS.output_dir, 'label_list.pkl'), 'wb') as rf:
                            pickle.dump(self.labels, rf)
                    else:
                        self.labels = ["O", 'B-TIM', 'I-TIM', "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]
                return self.labels
        
        ```
        
        
        ## Reference: 
        + The evaluation codes come from:https://github.com/guillaumegenthial/tf_metrics/blob/master/tf_metrics/__init__.py
        
        + [https://github.com/google-research/bert](https://github.com/google-research/bert)
              
        + [https://github.com/kyzhouhzau/BERT-NER](https://github.com/kyzhouhzau/BERT-NER)
        
        + [https://github.com/zjy-ucas/ChineseNER](https://github.com/zjy-ucas/ChineseNER)
        
        + [https://github.com/hanxiao/bert-as-service](https://github.com/hanxiao/bert-as-service)
        
        + [https://github.com/macanv/BERT-BiLSTM-CRF-NER](https://github.com/macanv/BERT-BiLSTM-CRF-NER)
Keywords: toolkit_bert_ner nlp ner NER named entity recognition bilstm crf tensorflow machine learning sentence encoding embedding serving
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Provides-Extra: cpu
Provides-Extra: gpu
Provides-Extra: http
