Metadata-Version: 2.1
Name: pipeline_diacritizer
Version: 1.0.1
Summary: Command-line application to automatically restore the diacritics of an Arabic text.
Home-page: https://github.com/Hamza5/Pipeline-diacritizer
Author: Hamza Abbad
Author-email: hamza.abbad@whut.edu.cn
License: UNKNOWN
Description: # Multi-components system for automatic Arabic diacritics restoration
        
        ## About
        
        This tool is a command-line application written in Python 3 that automatically add diacritics to
        raw undiacritized Arabic text. To accomplish this task, it uses several techniques: Deep Learning, rule-based and
        statistical corrections. The deep learning part was implemented using Tensorflow. It was released as a support for the
        research paper:
        ["Multi-components system for automatic Arabic diacritization" at ECIR2020](https://ecir2020.org/accepted-papers/).
        
        ## Installation
        
        This tool is available as a Python 3 package `pipeline-diacritizer` installable through `pip`. For installation
        instructions check the
        [**Download and installation** wiki page](https://github.com/Hamza5/Pipeline-diacritizer/wiki/Download-and-installation). 
        
        ## Functions
        
        This tool has 4 main functions: preprocessing of the data, training on the processed data, testing, and
        restoring the diacritics of an undiacritized text. In addition, it can calculates some statistics on a given dataset and
        the ratio of Out-of-Vocabulary words in a testing set according to a train set. 
        
        This is a quick introduction to the most important ones, without mentioning all the possible options for each one. For
        additional options, consider calling any subcommand with the option `--help` or `-h` (ex:
        `python3 pipeline_diacritizer.py train --help`) or [check the wiki](https://github.com/Hamza5/Pipeline-diacritizer/wiki)
        for more details.
        
        ### Preprocessing
        
        Before feeding the new data to this application for training or testing, it needs to be converted to the standard format
        of this application: one sentence per line, where a sentence is delimited by a dot, a comma, or an end of line
        character.
        
        ```
        $ python3 pipeline_diacritizer.py process <source_file> <destination_file>
        ```
        
        If the data is not yet partitioned into training, validation and testing sets, the program can help in this task using
        the following command:
        
        ```
        $ python3 pipeline_diacritizer.py partition <dataset_file>
        ```
        
        ### Training
        
        To run the training and validation on selected training/validation sets, use the next command:
        
        ```
        $ python3 pipeline_diacritizer.py train --train-data <train_file> --val-data <val_file>
        ```
        
        ### Testing
        
        To evaluate the performances of the application on a testing set, use this command:
        
        ```
        $ python3 pipeline_diacritizer.py test <test_file>
        ```
        
        ### Diacritization
        
        The following command restores the diacritics of the Arabic words from the supplied text file and outputs a diacritized
        copy:
        
        ```
        $ python3 pipeline_diacritizer.py diacritize <text_file>
        ```
        
        ### Statistics
        
        To get some statistics about the dataset, such as the count of tokens, arabic words, numbers... use the following
        command:
        
        ```
        $ python3 pipeline_diacritizer.py stat <dataset_file>
        ``` 
        
        ### OoV Counting
        
        To calculate the ratio of the Out-of-Vocabulary words between the train set and the validation/test set, use the next
        command:
        
        ```
        $ python3 pipeline_diacritizer.py oov <train_file> <test_file>
        ```
        
        ## License
        
        Pipeline-diacritizer code is licensed under
        [MIT License](https://github.com/Hamza5/Pipeline-diacritizer/blob/master/LICENSE.txt).
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.4,<3.8
Description-Content-Type: text/markdown
