Metadata-Version: 2.1
Name: espnet
Version: 0.9.6
Summary: ESPnet: end-to-end speech processing toolkit
Home-page: http://github.com/espnet/espnet
Author: Shinji Watanabe
Author-email: shinjiw@ieee.org
License: Apache Software License
Description: <div align="left"><img src="doc/image/espnet_logo1.png" width="550"/></div>
        
        # ESPnet: end-to-end speech processing toolkit
        
        |system/pytorch ver.|1.0.1|1.1.0|1.2.0|1.3.1|1.4.0|1.5.1|1.6.0|1.7.0|
        | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
        |ubuntu18/python3.8/pip||||||||[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|
        |ubuntu18/python3.7/pip|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|[![Github Actions](https://github.com/espnet/espnet/workflows/CI/badge.svg)](https://github.com/espnet/espnet/actions)|
        |ubuntu18/python3.6/conda||||||||[![CircleCI](https://circleci.com/gh/espnet/espnet.svg?style=svg)](https://circleci.com/gh/espnet/espnet)|
        |ubuntu20/python3.6/conda||||||||[![CircleCI](https://circleci.com/gh/espnet/espnet.svg?style=svg)](https://circleci.com/gh/espnet/espnet)|
        |debian9/python3.6/conda||||||||[![CircleCI](https://circleci.com/gh/espnet/espnet.svg?style=svg)](https://circleci.com/gh/espnet/espnet)|
        |centos7/python3.6/conda||||||||[![CircleCI](https://circleci.com/gh/espnet/espnet.svg?style=svg)](https://circleci.com/gh/espnet/espnet)|
        |debian9/python3.6/conda||||||||[![debian9](https://github.com/espnet/espnet/workflows/debian9/badge.svg)](https://github.com/espnet/espnet/actions?query=workflow%3Adebian9)|
        |centos7/python3.6/conda||||||||[![centos7](https://github.com/espnet/espnet/workflows/centos7/badge.svg)](https://github.com/espnet/espnet/actions?query=workflow%3Acentos7)|
        |[docs/coverage] python3.8||||||||[![Build Status](https://travis-ci.org/espnet/espnet.svg?branch=master)](https://travis-ci.org/espnet/espnet)|
        
        [![PyPI version](https://badge.fury.io/py/espnet.svg)](https://badge.fury.io/py/espnet)
        [![Python Versions](https://img.shields.io/pypi/pyversions/espnet.svg)](https://pypi.org/project/espnet/)
        [![Downloads](https://pepy.tech/badge/espnet)](https://pepy.tech/project/espnet)
        [![GitHub license](https://img.shields.io/github/license/espnet/espnet.svg)](https://github.com/espnet/espnet)
        [![codecov](https://codecov.io/gh/espnet/espnet/branch/master/graph/badge.svg)](https://codecov.io/gh/espnet/espnet)
        [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
        [![Mergify Status](https://img.shields.io/endpoint.svg?url=https://gh.mergify.io/badges/espnet/espnet&style=flat)](https://mergify.io)
        [![Gitter](https://badges.gitter.im/espnet-en/community.svg)](https://gitter.im/espnet-en/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
        
        [**Docs**](https://espnet.github.io/espnet/)
        | [**Example**](https://github.com/espnet/espnet/tree/master/egs)
        | [**Example (ESPnet2)**](https://github.com/espnet/espnet/tree/master/egs2)
        | [**Docker**](https://github.com/espnet/espnet/tree/master/docker)
        | [**Notebook**](https://github.com/espnet/notebook)
        | [**Tutorial (2019)**](https://github.com/espnet/interspeech2019-tutorial)
        
        ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech.
        ESPnet uses [chainer](https://chainer.org/) and [pytorch](http://pytorch.org/) as a main deep learning engine,
        and also follows [Kaldi](http://kaldi-asr.org/) style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.
        
        ## Key Features
        
        ### Kaldi style complete recipe
        - Support numbers of `ASR` recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
        - Support numbers of `TTS` recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
        - Support numbers of `ST` recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
        - Support numbers of `MT` recipes (IWSLT'16, the above ST recipes etc.)
        - Support speech separation and recognition recipe (WSJ-2mix)
        - Support voice conversion recipe (VCC2020 baseline) (new!)
        
        
        ### ASR: Automatic Speech Recognition
        - **State-of-the-art performance** in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
        - **Hybrid CTC/attention** based end-to-end ASR
          - Fast/accurate training with CTC/attention multitask training
          - CTC/attention joint decoding to boost monotonic alignment decoding
          - Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
        - Attention: Dot product, location-aware attention, variants of multihead
        - Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
        - Batch GPU decoding
        - **Transducer** based end-to-end ASR
          - Available: RNN-based encoder/decoder and Transformer-based encoder/decoder w/ customizable architecture.
          - Also support: mixed RNN/Transformer architecture, attention mechanism (RNN decoder), VGG2L (RNN/Transformer encoder), Conformer (Transformer encoder), TDNN (Transformer encoder), Causal Conv1d (Transformer decoder) and various decoding algorithms.
          > Please refer to the [tutorial page](https://espnet.github.io/espnet/tutorial.html#transducer) for complete documentation.
        - CTC segmentation
        - Non-autoregressive based on Mask CTC
        - ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
        
        ### TTS: Text-to-speech
        - Tacotron2
        - Transformer-TTS
        - FastSpeech
        - FastSpeech2 (in ESPnet2)
        - Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
        - Multi-speaker model with pretrained speaker embedding
        - Multi-speaker model with GST (in ESPnet2)
        - Phoneme-based training (En, Jp, and Zn)
        - Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)
        
        You can try demo online now!
        - Real-time TTS demo with ESPnet2  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)
        - Real-time TTS demo with ESPnet1  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)
        
        To train the neural vocoder, please check the following repositories:
        - [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
        - [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)
        
        > **NOTE**:
        > - We are moving on ESPnet2-based development for TTS.
        > - If you are beginner, we recommend using [ESPnet2-TTS](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1).
        
        ### ST: Speech Translation & MT: Machine Translation
        - **State-of-the-art performance** in several ST benchmarks (comparable/superior to cascaded ASR and MT)
        - Transformer based end-to-end ST (new!)
        - Transformer based end-to-end MT (new!)
        
        ### VC: Voice conversion
        - Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
        - End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)
        
        ### DNN Framework
        - Flexible network architecture thanks to chainer and pytorch
        - Flexible front-end processing thanks to [kaldiio](https://github.com/nttcslab-sp/kaldiio) and HDF5 support
        - Tensorboard based monitoring
        
        ### ESPnet2
        See [ESPnet2](https://espnet.github.io/espnet/espnet2_tutorial.html).
        
        - Indepedent from Kaldi/Chainer
        - On the fly feature extraction and text processing when training
        - Multi GPUs training on single/multi nodes (Distributed training)
        - A template recipe which can be applied for all corpora
        - Possible to train any size of corpus without cpu memory error
        - (Under development) [ESPnet Model Zoo](https://github.com/espnet/espnet_model_zoo)
        
        ## Installation
        - If you intend to do full experiments including DNN training, then see [Installation](https://espnet.github.io/espnet/installation.html).
        - If you just need the Python module only:
            ```sh
            pip install espnet
            # To install latest
            # pip install git+https://github.com/espnet/espnet
            ```
        
            You need to install some packages.
        
            ```sh
            pip install torch
            pip install chainer==6.0.0 cupy==6.0.0    # [Option] If you'll use ESPnet1
            pip install torchaudio                    # [Option] If you'll use enhancement task
            pip install torch_optimizer               # [Option] If you'll use additional optimizers in ESPnet2
            ```
        
            There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.
        
        ## Usage
        See [Usage](https://espnet.github.io/espnet/tutorial.html).
        
        ## Docker Container
        
        go to [docker/](docker/) and follow [instructions](https://espnet.github.io/espnet/docker.html).
        
        ## Contribution
        Thank you for taking times for ESPnet! Any contributions to ESPNet are welcome and feel free to ask any questions or requests to [issues](https://github.com/espnet/espnet/issues).
        If it's the first contribution to ESPnet for you,  please follow the [contribution guide](CONTRIBUTING.md).
        
        ## Results and demo
        
        You can find useful tutorials and demos in [Interspeech 2019 Tutorial](https://github.com/espnet/interspeech2019-tutorial)
        
        ### ASR results
        
        <details><summary>expand</summary><div>
        
        
        We list the character error rate (CER) and word error rate (WER) of major ASR tasks.
        
        | Task                   | CER (%) | WER (%) | Pretrained model|
        | -----------            | :----:  | :----:  | :----:                                                                                                                                                                |
        | Aishell dev/test            | 4.6/5.1    | N/A     | [link](https://github.com/espnet/espnet/blob/master/egs/aishell/asr1/RESULTS.md#conformer-kernel-size--15--specaugment--lm-weight--00-result) |
        | **ESPnet2** Aishell dev/test            | 4.4/4.7    | N/A     | [link](https://github.com/espnet/espnet/tree/master/egs2/aishell/asr1#conformer--specaug--speed-perturbation-featsraw-n_fft512-hop_length128) |
        | Common Voice dev/test       | 1.7/1.8     | 2.2/2.3     | [link](https://github.com/espnet/espnet/blob/master/egs/commonvoice/asr1/RESULTS.md#first-results-default-pytorch-transformer-setting-with-bpe-100-epochs-single-gpu) |
        | CSJ eval1/eval2/eval3              | 5.7/3.8/4.2     | N/A     | [link](https://github.com/espnet/espnet/blob/master/egs/csj/asr1/RESULTS.md#pytorch-backend-transformer-without-any-hyperparameter-tuning)                            |
        | **ESPnet2** CSJ eval1/eval2/eval3              | 4.5/3.3/3.6     | N/A     | [link](https://github.com/espnet/espnet/tree/master/egs2/csj/asr1#initial-conformer-results)                            |
        | HKUST dev              | 23.5    | N/A     | [link](https://github.com/espnet/espnet/blob/master/egs/hkust/asr1/RESULTS.md#transformer-only-20-epochs)                                                             |
        | Librispeech dev_clean/dev_other/test_clean/test_other  | N/A     | 1.9/4.9/2.1/4.9     | [link](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-conformer-with-specaug--speed-perturbation-8-gpus--transformer-lm-4-gpus)             |
        | Switchboard (eval2000) callhm/swbd           | N/A     | 14.0/6.8     | [link](https://github.com/espnet/espnet/blob/master/egs/swbd/asr1/RESULTS.md#conformer-with-bpe-2000-specaug-speed-perturbation-transformer-lm-decoding)   |
        | TEDLIUM2 dev/test           | N/A     | 8.6/7.2     | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium2/asr1/RESULTS.md#conformer-large-model--specaug--speed-perturbation--rnnlm)   |
        | TEDLIUM3 dev/test           | N/A     | 9.6/7.6     | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium3/asr1/RESULTS.md)                   |
        | WSJ dev93/eval92              | 3.2/2.1     | 7.0/4.7     | N/A |
        |  **ESPnet2** WSJ dev93/eval92              | 2.7/1.8     | 6.6/4.6     | [link](https://github.com/espnet/espnet/tree/master/egs2/wsj/asr1#using-transformer-lm-asr-model-is-same-as-the-above-lm_weight12-ctc_weight03-beam_size20) |
        
        Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by [RWTH](https://arxiv.org/pdf/1805.03294.pdf).
        
        If you want to check the results of the other recipes, please check `egs/<name_of_recipe>/asr1/RESULTS.md`.
        
        </div></details>
        
        
        ### ASR demo
        
        <details><summary>expand</summary><div>
        
        You can recognize speech in a WAV file using pretrained models.
        Go to a recipe directory and run `utils/recog_wav.sh` as follows:
        ```sh
        # go to recipe directory and source path of espnet tools
        cd egs/tedlium2/asr1 && . ./path.sh
        # let's recognize speech!
        recog_wav.sh --models tedlium2.transformer.v1 example.wav
        ```
        where `example.wav` is a WAV file to be recognized.
        The sampling rate must be consistent with that of data used in training.
        
        Available pretrained models in the demo script are listed as below.
        
        | Model                                                                                            | Notes                                                      |
        | :------                                                                                          | :------                                                    |
        | [tedlium2.rnn.v1](https://drive.google.com/open?id=1UqIY6WJMZ4sxNxSugUqp3mrGb3j6h7xe)            | Streaming decoding based on CTC-based VAD                  |
        | [tedlium2.rnn.v2](https://drive.google.com/open?id=1cac5Uc09lJrCYfWkLQsF8eapQcxZnYdf)            | Streaming decoding based on CTC-based VAD (batch decoding) |
        | [tedlium2.transformer.v1](https://drive.google.com/open?id=1cVeSOYY1twOfL9Gns7Z3ZDnkrJqNwPow)    | Joint-CTC attention Transformer trained on Tedlium 2       |
        | [tedlium3.transformer.v1](https://drive.google.com/open?id=1zcPglHAKILwVgfACoMWWERiyIquzSYuU)    | Joint-CTC attention Transformer trained on Tedlium 3       |
        | [librispeech.transformer.v1](https://drive.google.com/open?id=1BtQvAnsFvVi-dp_qsaFP7n4A_5cwnlR6) | Joint-CTC attention Transformer trained on Librispeech     |
        | [commonvoice.transformer.v1](https://drive.google.com/open?id=1tWccl6aYU67kbtkm8jv5H6xayqg1rzjh) | Joint-CTC attention Transformer trained on CommonVoice     |
        | [csj.transformer.v1](https://drive.google.com/open?id=120nUQcSsKeY5dpyMWw_kI33ooMRGT2uF)         | Joint-CTC attention Transformer trained on CSJ             |
        | [csj.rnn.v1](https://drive.google.com/open?id=1ALvD4nHan9VDJlYJwNurVr7H7OV0j2X9)                 | Joint-CTC attention VGGBLSTM trained on CSJ                |
        
        </div></details>
        
        ### ST results
        
        <details><summary>expand</summary><div>
        
        We list 4-gram BLEU of major ST tasks.
        
        #### end-to-end system
        | Task | BLEU | Pretrained model |
        | ---- | :----: | :----: |
        | Fisher-CallHome Spanish fisher_test (Es->En)      | 51.03 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) |
        | Fisher-CallHome Spanish callhome_evltest (Es->En) | 20.44 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) |
        | Libri-trans test (En->Fr)                         | 16.70 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/st1/RESULTS.md#train_spfr_lc_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans-1) |
        | How2 dev5 (En->Pt)                                | 45.68 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/st1/RESULTS.md#trainpt_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans-1) |
        | Must-C tst-COMMON (En->De)                        | 22.91 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/st1/RESULTS.md#train_spen-dede_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans) |
        | Mboshi-French dev (Fr->Mboshi)                    | 6.18  | N/A  |
        
        #### cascaded system
        | Task | BLEU | Pretrained model |
        | ---- | :----: | :----: |
        | Fisher-CallHome Spanish fisher_test (Es->En)      | 42.16 | N/A  |
        | Fisher-CallHome Spanish callhome_evltest (Es->En) | 19.82 | N/A  |
        | Libri-trans test (En->Fr)                         | 16.96 | N/A  |
        | How2 dev5 (En->Pt)                                | 44.90 | N/A  |
        | Must-C tst-COMMON (En->De)                        | 23.65 | N/A  |
        
        If you want to check the results of the other recipes, please check `egs/<name_of_recipe>/st1/RESULTS.md`.
        
        </div></details>
        
        
        ### ST demo
        
        <details><summary>expand</summary><div>
        
        (**New!**) We made a new real-time E2E-ST + TTS demonstration in Google Colab.
        Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!
        
        [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/st_demo.ipynb)
        
        ---
        
        You can translate speech in a WAV file using pretrained models.
        Go to a recipe directory and run `utils/translate_wav.sh` as follows:
        ```sh
        # go to recipe directory and source path of espnet tools
        cd egs/fisher_callhome_spanish/st1 && . ./path.sh
        # download example wav file
        wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
        # let's translate speech!
        translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav
        ```
        where `test.wav` is a WAV file to be translated.
        The sampling rate must be consistent with that of data used in training.
        
        Available pretrained models in the demo script are listed as below.
        
        | Model                                                                                            | Notes                                                      |
        | :------                                                                                          | :------                                                    |
        | [fisher_callhome_spanish.transformer.v1](https://drive.google.com/open?id=1hawp5ZLw4_SIHIT3edglxbKIIkPVe8n3)            | Transformer-ST trained on Fisher-CallHome Spanish Es->En                  |
        
        </div></details>
        
        
        ### MT results
        
        <details><summary>expand</summary><div>
        
        | Task | BLEU | Pretrained model |
        | ---- | :----: | :----: |
        | Fisher-CallHome Spanish fisher_test (Es->En)      | 61.45 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) |
        | Fisher-CallHome Spanish callhome_evltest (Es->En) | 29.86 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) |
        | Libri-trans test (En->Fr)                         | 18.09 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/mt1/RESULTS.md#trainfr_lcrm_tc_pytorch_train_pytorch_transformer_bpe1000) |
        | How2 dev5 (En->Pt)                                | 58.61 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/mt1/RESULTS.md#trainpt_tc_tc_pytorch_train_pytorch_transformer_bpe8000) |
        | Must-C tst-COMMON (En->De)                        | 27.63 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/mt1/RESULTS.md#summary-4-gram-bleu) |
        | IWSLT'14 test2014 (En->De)                        | 24.70 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
        | IWSLT'14 test2014 (De->En)                        | 29.22 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
        | IWSLT'16 test2014 (En->De)                        | 24.05 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
        | IWSLT'16 test2014 (De->En)                        | 29.13 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |
        
        </div></details>
        
        ### TTS results
        
        <details><summary>ESPnet2</summary><div>
        
        You can listen to the generated samples in the following url.
        - [ESPnet2 TTS generated samples](https://drive.google.com/drive/folders/1H3fnlBbWMEkQUfrHqosKN_ZX_WjO29ma?usp=sharing)
        
        > Note that in the generation we use Griffin-Lim (`wav/`) and [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) (`wav_pwg/`).
        
        You can download pretrained models via `espnet_model_zoo`.
        - [ESPnet model zoo](https://github.com/espnet/espnet_model_zoo)
        - [Pretrained model list](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv)
        
        You can download pretrained vocoders via `kan-bayashi/ParallelWaveGAN`.
        - [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)
        - [Pretrained vocoder list](https://github.com/kan-bayashi/ParallelWaveGAN#results)
        
        </div></details>
        
        <details><summary>ESPnet1</summary><div>
        
        > NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.
        
        You can listen to our samples in demo HP [espnet-tts-sample](https://espnet.github.io/espnet-tts-sample/).
        Here we list some notable ones:
        
        - [Single English speaker Tacotron2](https://drive.google.com/open?id=18JgsOCWiP_JkhONasTplnHS7yaF_konr)
        - [Single Japanese speaker Tacotron2](https://drive.google.com/open?id=1fEgS4-K4dtgVxwI4Pr7uOA1h4PE-zN7f)
        - [Single other language speaker Tacotron2](https://drive.google.com/open?id=1q_66kyxVZGU99g8Xb5a0Q8yZ1YVm2tN0)
        - [Multi English speaker Tacotron2](https://drive.google.com/open?id=18S_B8Ogogij34rIfJOeNF8D--uG7amz2)
        - [Single English speaker Transformer](https://drive.google.com/open?id=14EboYVsMVcAq__dFP1p6lyoZtdobIL1X)
        - [Single English speaker FastSpeech](https://drive.google.com/open?id=1PSxs1VauIndwi8d5hJmZlppGRVu2zuy5)
        - [Multi English speaker Transformer](https://drive.google.com/open?id=1_vrdqjM43DdN1Qz7HJkvMQ6lCMmWLeGp)
        - [Single Italian speaker FastSpeech](https://drive.google.com/open?id=13I5V2w7deYFX4DlVk1-0JfaXmUR2rNOv)
        - [Single Mandarin speaker Transformer](https://drive.google.com/open?id=1mEnZfBKqA4eT6Bn0eRZuP6lNzL-IL3VD)
        - [Single Mandarin speaker FastSpeech](https://drive.google.com/open?id=1Ol_048Tuy6BgvYm1RpjhOX4HfhUeBqdK)
        - [Multi Japanese speaker Transformer](https://drive.google.com/open?id=1fFMQDF6NV5Ysz48QLFYE8fEvbAxCsMBw)
        - [Single English speaker models with Parallel WaveGAN](https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx)
        - [Single English speaker knowledge distillation-based FastSpeech](https://drive.google.com/open?id=1wG-Y0itVYalxuLAHdkAHO7w1CWFfRPF4)
        
        You can download all of the pretrained models and generated samples:
        - [All of the pretrained E2E-TTS models](https://drive.google.com/open?id=1k9RRyc06Zl0mM2A7mi-hxNiNMFb_YzTF)
        - [All of the generated samples](https://drive.google.com/open?id=1bQGuqH92xuxOX__reWLP4-cif0cbpMLX)
        
        Note that in the generated samples we use the following vocoders: Griffin-Lim (**GL**), WaveNet vocoder (**WaveNet**), Parallel WaveGAN (**ParallelWaveGAN**), and MelGAN (**MelGAN**).
        The neural vocoders are based on following repositories.
        - [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN): Parallel WaveGAN / MelGAN / Multi-band MelGAN
        - [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder): 16 bit mixture of Logistics WaveNet vocoder
        - [kan-bayashi/PytorchWaveNetVocoder](https://github.com/kan-bayashi/PytorchWaveNetVocoder): 8 bit Softmax WaveNet Vocoder with the noise shaping
        
        If you want to build your own neural vocoder, please check the above repositories.
        [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) provides [the manual](https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features) about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.
        
        Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!
        
        | Model link                                                                                           | Lang  | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Model type                                                              |
        | :------                                                                                              | :---: | :----:  | :--------:     | :---------------:      | :------                                                                 |
        | [ljspeech.wavenet.softmax.ns.v1](https://drive.google.com/open?id=1eA1VcRS9jzFa-DovyTgJLQ_jmwOLIi8L) | EN    | 22.05k  | None           | 1024 / 256 / None      | [Softmax WaveNet](https://github.com/kan-bayashi/PytorchWaveNetVocoder) |
        | [ljspeech.wavenet.mol.v1](https://drive.google.com/open?id=1sY7gEUg39QaO1szuN62-Llst9TrFno2t)        | EN    | 22.05k  | None           | 1024 / 256 / None      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |
        | [ljspeech.parallel_wavegan.v1](https://drive.google.com/open?id=1tv9GKyRT4CDsvUWKwH3s_OfXkiTi0gw7)   | EN    | 22.05k  | None           | 1024 / 256 / None      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |
        | [ljspeech.wavenet.mol.v2](https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr)        | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |
        | [ljspeech.parallel_wavegan.v2](https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB)   | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |
        | [ljspeech.melgan.v1](https://drive.google.com/open?id=1ipPWYl8FBNRlBFaKj1-i23eQpW_W_YcR)             | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN)                |
        | [ljspeech.melgan.v3](https://drive.google.com/open?id=1_a8faVA5OGCzIcJNw4blQYjfG4oA9VEt)             | EN    | 22.05k  | 80-7600        | 1024 / 256 / None      | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN)                |
        | [libritts.wavenet.mol.v1](https://drive.google.com/open?id=1jHUUmQFjWiQGyDd7ZeiCThSjjpbF_B4h)        | EN    | 24k     | None           | 1024 / 256 / None      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |
        | [jsut.wavenet.mol.v1](https://drive.google.com/open?id=187xvyNbmJVZ0EZ1XHCdyjZHTXK9EcfkK)            | JP    | 24k     | 80-7600        | 2048 / 300 / 1200      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |
        | [jsut.parallel_wavegan.v1](https://drive.google.com/open?id=1OwrUQzAmvjj1x9cDhnZPp6dqtsEqGEJM)       | JP    | 24k     | 80-7600        | 2048 / 300 / 1200      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |
        | [csmsc.wavenet.mol.v1](https://drive.google.com/open?id=1PsjFRV5eUP0HHwBaRYya9smKy5ghXKzj)           | ZH    | 24k     | 80-7600        | 2048 / 300 / 1200      | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder)                  |
        | [csmsc.parallel_wavegan.v1](https://drive.google.com/open?id=10M6H88jEUGbRWBmU1Ff2VaTmOAeL8CEy)      | ZH    | 24k     | 80-7600        | 2048 / 300 / 1200      | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN)      |
        
        If you want to use the above pretrained vocoders, please exactly match the feature setting with them.
        
        </div></details>
        
        ### TTS demo
        
        <details><summary>ESPnet2</summary><div>
        
        You can try the real-time demo in Google Colab.
        Please access the notebook from the following button and enjoy the real-time synthesis!
        
        - Real-time TTS demo with ESPnet2  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)
        
        English, Japanese, and Mandarin models are available in the demo.
        
        </div></details>
        
        <details><summary>ESPnet1</summary><div>
        
        > NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.
        
        You can try the real-time demo in Google Colab.
        Please access the notebook from the following button and enjoy the real-time synthesis.
        
        - Real-time TTS demo with ESPnet1  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)
        
        We also provide shell script to perform synthesize.
        Go to a recipe directory and run `utils/synth_wav.sh` as follows:
        
        ```sh
        # go to recipe directory and source path of espnet tools
        cd egs/ljspeech/tts1 && . ./path.sh
        # we use upper-case char sequence for the default model.
        echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
        # let's synthesize speech!
        synth_wav.sh example.txt
        
        # also you can use multiple sentences
        echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
        echo "TEXT TO SPEECH IS A TECHQNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
        synth_wav.sh example_multi.txt
        ```
        
        You can change the pretrained model as follows:
        
        ```sh
        synth_wav.sh --models ljspeech.fastspeech.v1 example.txt
        ```
        
        Waveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN).
        You can change the pretrained vocoder model as follows:
        
        ```sh
        synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt
        ```
        
        WaveNet vocoder provides very high quality speech but it takes time to generate.
        
        See more details or available models via `--help`.
        
        ```sh
        synth_wav.sh --help
        ```
        
        </div></details>
        
        ### VC results
        
        <details><summary>expand</summary><div>
        
        - Transformer and Tacotron2 based VC
        
        You can listen to some samples on the [demo webpage](https://unilight.github.io/Publication-Demos/publications/transformer-vc/).
        
        - Cascade ASR+TTS as one of the baseline systems of VCC2020
        
        The [Voice Conversion Challenge 2020](http://www.vc-challenge.org/) (VCC2020) adopts ESPnet to build an end-to-end based baseline system.
        In VCC2020, the objective is intra/cross lingual nonparallel VC.
        You can download converted samples of the cascade ASR+TTS baseline system [here](https://drive.google.com/drive/folders/1oeZo83GrOgtqxGwF7KagzIrfjr8X59Ue?usp=sharing).
        
        </div></details>
        
        ### CTC Segmentation demo
        
        <details><summary>expand</summary><div>
        
        [CTC segmentation](https://arxiv.org/abs/2007.09127) determines utterance segments within audio files.
        Aligned utterance segments constitute the labels of speech datasets.
        
        As demo, we align start and end of utterances within the audio file `ctc_align_test.wav`, using the example script `utils/ctc_align_wav.sh`.
        For preparation, set up a data directory:
        
        ```sh
        cd egs/tedlium2/align1/
        # data directory
        align_dir=data/demo
        mkdir -p ${align_dir}
        # wav file
        base=ctc_align_test
        wav=../../../test_utils/${base}.wav
        # recipe files
        echo "batchsize: 0" > ${align_dir}/align.yaml
        
        cat << EOF > ${align_dir}/utt_text
        ${base} THE SALE OF THE HOTELS
        ${base} IS PART OF HOLIDAY'S STRATEGY
        ${base} TO SELL OFF ASSETS
        ${base} AND CONCENTRATE
        ${base} ON PROPERTY MANAGEMENT
        EOF
        ```
        
        Here, `utt_text` is the file containing the list of utterances.
        Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:
        
        ```sh
        # pre-trained ASR model
        model=wsj.transformer_small.v1
        mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf
        
        ../../../utils/asr_align_wav.sh \
            --models ${model} \
            --align_dir ${align_dir} \
            --align_config ${align_dir}/align.yaml \
            ${wav} ${align_dir}/utt_text
        ```
        
        Segments are written to `aligned_segments` as a list of file/utterance name, utterance start and end times in seconds and a confidence score.
        The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:
        
        ```sh
        min_confidence_score=-5
        awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments
        ```
        
        The demo script `utils/ctc_align_wav.sh` uses an already pretrained ASR model (see list above for more models).
        It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files;
        rather than using Transformer models that have a high memory consumption on longer audio data.
        The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed.
        A full example recipe is in `egs/tedlium2/align1/`.
        
        </div></details>
        
        
        ## References
        
        [1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-End Speech Processing Toolkit," *Proc. Interspeech'18*, pp. 2207-2211 (2018)
        
        [2] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," *Proc. ICASSP'17*, pp. 4835--4839 (2017)
        
        [3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition," *IEEE Journal of Selected Topics in Signal Processing*, vol. 11, no. 8, pp. 1240-1253, Dec. 2017
        
        ## Citations
        
        ```
        @inproceedings{watanabe2018espnet,
          author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
          title={{ESPnet}: End-to-End Speech Processing Toolkit},
          year={2018},
          booktitle={Proceedings of Interspeech},
          pages={2207--2211},
          doi={10.21437/Interspeech.2018-1456},
          url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
        }
        @inproceedings{hayashi2020espnet,
          title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
          author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
          booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
          pages={7654--7658},
          year={2020},
          organization={IEEE}
        }
        @inproceedings{inaguma-etal-2020-espnet,
            title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
            author = "Inaguma, Hirofumi  and
              Kiyono, Shun  and
              Duh, Kevin  and
              Karita, Shigeki  and
              Yalta, Nelson  and
              Hayashi, Tomoki  and
              Watanabe, Shinji",
            booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
            month = jul,
            year = "2020",
            address = "Online",
            publisher = "Association for Computational Linguistics",
            url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
            pages = "302--311",
        }
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown
Provides-Extra: test
Provides-Extra: doc
