[GSoC 2017 with CMUSphinx] Post 7#: Pronunciation Dictionary Collection


This week I mainly updated {{ru-IPA}} module and fixed some bugs in the existing code. The toolkit has been used to collect pronunciation dictionaries for ten languages: English, French, German, Spanish, Italian, Russian, Hindi, Greek, Dutch and Mandarin, which can be downloaded at Dropbox.

Wiktionary {{ru-IPA}} Lua module has been converted into Python scripts. Due to the size of the code is large, there is still some bugs in this module which caused unit tests failed. I have integraed this module with {{fr-IPA}}, {{hi-IPA}} and {{zh-pron}} into the existing Parser to collect pronunciation dictionary first. Detailed errors in {{ru-IPA}} module can be located later.

To use the exsiting toolit to collect a pronunciation dictionary for a certain language, here's an exmaple code snippet:

# First, import the module and create an instance of Wiktionary class,
# specify the lang parameter to the language type.
>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(lang="LANGUAGE", XSAMPA=True)

# extract IPA from Wiktionary XML dump file
>>> dump_file = "/path/to/enwiktionary-latest-pages-articles-multistream.xml"
>>> pron = wikt.extract_IPA(dump_file)

# clean the result and select entries with IPA
>>> dic = []
>>> for each in pron:
>>>     if isinstance(each["pronunciation"], list):
>>>         dic.append(each)

The collected ten dictionaries can be downloaded at Dropbox, here's the summary of the dictionaries:

   Language       English       French       German       Spanish       Italian   
   # Entry       45675       41819       21418       6851       4718   
   Language       Russian       Hindi       Greek       Dutch       Mandarin   
   # Entry       338784       1702       2090       8646       68261   

Some languages have few words with IPA, like Greek and Hindi. For Hindi, a solution is applying {{hi-IPA}} module on words without pronunciation. Those attempts will be adapted on the next version of pronunciation dictionaries.


There's some bugs in the code when I collected pronunciation dictionaries at first, here's some details:
  1. Nested brackets failed to be parsed. Nested brackets in IPA, like être: {{IPA|[aɛ̯t{{x2i|X}}]|lang=fr}}, returns wrong result [aɛ̯t{{x2i instead of [aɛ̯tχ] because the Python regex in parser doesn't handle nested brackets, which is needed. This issue has been solved through a recursive regex pattern "(?{{(?:[^{}]+|(?&brackets))*}})" in the Parser. The new pattern will match a {{...}} pattern or match this pattern recursively in the same pattern.

  2. Unicode language section titles failed to be parsed. Unicode language section title like Norwegian Bokmål in hijab failed to be parsed. Pronunciations of Norwegian Bokmål will be considered in the previous language section, Italian. This issue has been solved through supporting unicode wiki text section title in the Parser. The unicode support is using regex package to replace the default Python re library and using \p{L} perl like regex pattern to match letters instead of  \w.

  3. Some pronunciation sections in Chinese and Japanese failed to be parsed. Most pronunciation sections in Wiktionary is using "* ..." format to list pronunciations, like English present in Wiktionary, while some pronunciation sections in Chinese and Japanese don't, like Chinese 一 and Japanese 一. The solution is not to match "* ..." pattern in pronunciation section, only match "{{ ... }}" template in the Parser.

Next step, I will get some statistics on IPA distribution in the collected English dictionary and start to train acoustic model using collected en.dict.v0.json to compare the performance on benchmarks with existing dictionary.

Popular Posts