Metadata-Version: 2.1
Name: openspeechcorpus
Version: 0.1.3
Summary: The CLI for perform actions over the Open Speech Corpus
Home-page: https://github.com/open-speech-org/openspeechcorpus-cli
Author: contraslash S.A.S.
Author-email: ma0@contraslash.com
License: MIT
Project-URL: Bug Reports, https://github.com/open-speech-org/openspeechcorpus-cli/issues
Project-URL: Source, https://github.com/open-speech-org/openspeechcorpus-cli
Project-URL: Contraslash, https://contraslash.com/
Description: # Open Speech Corpus CLI
        
        This repository contains the code required to download audiodata from 
        [openspeechcorpus.com](http://openspeechcorpus.contraslash.com)
        
        Open Speech Corpus is composed by far for three subcorpuses:
        
        - Tales: A crowdsourced corpus based on reading of latin american short tales
        - Aphasia: A crowdsourced corpus based in words categorized in 4 levels of difficulty
        - Isolated words: A crowdsourced corpus based in isolated words
        
        To download files from the Tales Project use
        
        ```bash
        ops  \
            --output_folder tales/ \
            --output_file tales.txt  \
            --corpus tales
        ```
        
        To download files from the Isolated Words Project use
        
        ```bash
        ops  \
            --output_folder isolated_words/ \
            --output_file isolated_words.txt  \
            --corpus words
        ```
        
        To download files from the Aphasia Project use
        
        ```bash
        ops  \
            --output_folder aphasia/ \
            --output_file aphasia.txt  \
            --corpus aphasia
        ```
        
        By default the page size is 500, to modify it use the args `--from` and `--to` i.e:
        
        ```bash
        ops  \
            --from 500 \
            --to 1000 \
            --output_folder aphasia/ \
            --output_file aphasia.txt  \
            --corpus aphasia
        ```
        
        You can download the whole corpus using the flag `--download_all`
        
        ```bash
        ops  \
            --output_folder aphasia/ \
            --output_file aphasia.txt  \
            --corpus aphasia \
            --download_all
        ```
        
        If you use the flag `--download_all` with the flag `--from` the process will start in the specified arg `from` using a
        page size of 500
        
        ## Recursive Convert
        
        The Open Speech Corpus stores its files in mp4 format, which is undesired for most audio processing tasks. To convert 
        the files into a wav format, first install [ffmpeg](https://www.ffmpeg.org/download.html), then you can execute the
        `recursive_convert` utility which receives as first argument the source_folder with the mp4 files and as second argument
        the output folder i.e.:
        
        ```bash
        recursive_convert aphasia aphasia_wav
        ```
        
        ## CMU Sphinx Configuration
        
        The Open Speech Corpus also defines some scripts to generate configurations for 
        [CMU Sphinx](https://cmusphinx.github.io/), to generate a configuration use the command `configure_sphinx`
        
        ```bash
        configure_sphinx simple_words \
            --transcription_file words.txt \
            --etc_folder simple_words/etc \
            --test_size 0.5
        ```
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
