[GSoC 2017 with CMUSphinx] Post 7#: Pronunciation Dictionary Collection
This week I mainly updated
{{ru-IPA}}
module and fixed some bugs in the existing code. The toolkit has been used to collect pronunciation dictionaries for ten languages: English, French, German, Spanish, Italian, Russian, Hindi, Greek, Dutch and Mandarin, which can be downloaded at Dropbox.Wiktionary
{{ru-IPA}}
Lua module has been converted into Python scripts. Due to the size of the code is large, there is still some bugs in this module which caused unit tests failed. I have integraed this module with {{fr-IPA}}
, {{hi-IPA}}
and {{zh-pron}}
into the existing Parser
to collect pronunciation dictionary first. Detailed errors in {{ru-IPA}}
module can be located later.To use the exsiting toolit to collect a pronunciation dictionary for a certain language, here's an exmaple code snippet:
# First, import the module and create an instance of Wiktionary class, # specify the lang parameter to the language type. >>> from pywiktionary import Wiktionary >>> wikt = Wiktionary(lang="LANGUAGE", XSAMPA=True) # extract IPA from Wiktionary XML dump file >>> dump_file = "/path/to/enwiktionary-latest-pages-articles-multistream.xml" >>> pron = wikt.extract_IPA(dump_file) # clean the result and select entries with IPA >>> dic = [] >>> for each in pron: >>> if isinstance(each["pronunciation"], list): >>> dic.append(each)
The collected ten dictionaries can be downloaded at Dropbox, here's the summary of the dictionaries:
Language | English | French | German | Spanish | Italian |
---|---|---|---|---|---|
# Entry | 45675 | 41819 | 21418 | 6851 | 4718 |
Language | Russian | Hindi | Greek | Dutch | Mandarin |
# Entry | 338784 | 1702 | 2090 | 8646 | 68261 |
Some languages have few words with IPA, like Greek and Hindi. For Hindi, a solution is applying
{{hi-IPA}}
module on words without pronunciation. Those attempts will be adapted on the next version of pronunciation dictionaries.There's some bugs in the code when I collected pronunciation dictionaries at first, here's some details:
- Nested brackets failed to be parsed. Nested brackets in IPA, like être:
{{IPA|[aɛ̯t{{x2i|X}}]|lang=fr}}
, returns wrong result[aɛ̯t{{x2i
instead of[aɛ̯tχ]
because the Python regex in parser doesn't handle nested brackets, which is needed. This issue has been solved through a recursive regex pattern"(?
in the Parser. The new pattern will match a{{(?:[^{}]+|(?&brackets))*}})" {{...}}
pattern or match this pattern recursively in the same pattern. - Unicode language section titles failed to be parsed. Unicode language section title like
Norwegian Bokmål
in hijab failed to be parsed. Pronunciations of Norwegian Bokmål will be considered in the previous language section, Italian. This issue has been solved through supporting unicode wiki text section title in the Parser. The unicode support is usingregex
package to replace the default Pythonre
library and using\p{L}
perl like regex pattern to match letters instead of\w
. - Some pronunciation sections in Chinese and Japanese failed to be parsed. Most pronunciation sections in Wiktionary is using
"* ..."
format to list pronunciations, like English present in Wiktionary, while some pronunciation sections in Chinese and Japanese don't, like Chinese 一 and Japanese 一. The solution is not to match"* ..."
pattern in pronunciation section, only match"{{ ... }}"
template in the Parser.
Next step, I will get some statistics on IPA distribution in the collected English dictionary and start to train acoustic model using collected
en.dict.v0.json
to compare the performance on benchmarks with existing dictionary.