[GSoC 2017 with CMUSphinx] Post 8#: Grapheme to Phoneme Conversion
Nowthat we had collected pronunciation dictionaries for several languages. Since the dictionaries are collected from web data, most of them are not large enough to cover all the words in speech recognition. To solve that problem, one solution is to generate pronunciation for out-of-vocabulary words using some specific rules, that is grapheme to phoneme conversion. CMUSphinx provides a grapheme to phoneme (G2P) toolkit which is based on recurrent neural network (RNN) with long short-term memory units (LSTM) implemented in Tensorflow framework and provides a state of the art accuracy of conversion.
During this week I mainly used CMUSphinx g2p-seq2seq toolkit to extand the collected English dictionaries by training G2P model.
Next step, I will train the acoustic model on voxforge dataset to compare the difference between CMUDict and collected dictionary. The pretrained G2P model on collected dictionary will be used to generate pronunciation for out-of-vocabulary words in transcriptions.
During this week I mainly used CMUSphinx g2p-seq2seq toolkit to extand the collected English dictionaries by training G2P model.
- Installation
tensorflow
and tensorflow-gpu
are two different packages in pip, be sure to change tensorflow
to tensorflow-gpu
in g2p-seq2seq/setup.py#L41 if you use a GPU supported version. Install the toolkit by python setup install --prefix=/path/to/install_dir/
and test by python setup.py test
, using TensorFlow 1.2 will fail on test because of both RNN API and seq2seq issues.- Preparation
g2p-seq2seq
from command line. If not, please make sure that the installation directory is in system PATH
environmental variable. There exists a pretrained G2P model of 2-layer LSTM with 512 hidden units, which can be downloaded at sourceforge. The model is trained on unstressed CMUDict, which is different from stressed one at GitHub, the corresponding unstressed dictionary is available at sourceforge. After download and uncompress the pretrained model, one should be able to follow the simple examples at GitHub.- Training
g2p-seq2seq --train train.dict --model /path/to/model
to train G2P model. The default learning_rate
and learning_rate_decay_factor
is 0.5 and 0.8 respectively, which are large for training. We can use --learning_rate
and --learning_rate_decay_factor
to set them to 0.1 and 0.5 respectively, and use --size
and --num_layers
to change model architecture, for exmpale, running g2p-seq2seq --train train.dict --max_steps 0 --size 512 --num_layers 3 --learning_rate 0.1 --learning_rate_decay_factor 0.5 --model g2p-512-3
to train a 3-layer LSTM with 512 hiddern units. The training will use the whole available GPUs by default. To run on specific or just one GPU, set CUDA_VISIBLE_DEVICES
environmental variable before training. Here's the training result of 2-layer LSTM model on collected English dictionary:Loading vocabularies from g2p-512-2 Creating 2 layers of 512 units. Reading model parameters from g2p-512-2 Beginning calculation word error rate (WER) on test sample. Words: 4507 Errors: 2248 WER: 0.499 Accuracy: 0.501
- Testing
oov.vocab
with one word per line. Then, generate corresponding pronunciations using the pretrained model by running g2p-seq2seq --decode oov.vocab --output oov.dict --model g2p-seq2seq-cmudict
, oov.dict
is the output pronunciation dictionary. Finally, we can run cat train.dict oov.dict | sort -u > dictionary.dict
to merge the original dictionary and generated dictionary for training acoustic model.
Next step, I will train the acoustic model on voxforge dataset to compare the difference between CMUDict and collected dictionary. The pretrained G2P model on collected dictionary will be used to generate pronunciation for out-of-vocabulary words in transcriptions.