[GSoC 2017 with CMUSphinx] Post 3#: IPA Extraction and X-SAMPA Conversion

This week I finished the code for IPA extraction and IPA to X-SAMPA conversion, run in the whole Wiktionary XML dump and prepared to release the first version of the toolkit.


Progresses:

  • Finish the code for wikitext parsing and IPA extraction. There are mainly two classes in the code at present, `Wiktionary` and `Parser`:
    • `Wiktionary`: Provides functions to set the language (default is parsing all languages) and whether to convert IPA to X-SAMPA (default is false). Provides interface `Wiktionary.get_entry_pronunciation(wiki_text)` to parse wikitexts (for enwiktionary dump) or `Wiktionary.lookup(word)` to search for pronunciation of word in https://en.wiktionary.org/ through MediaWiki API.
    • `Parser`: The main part of the toolkit. `Parser.parse(wiki_text)` searches the level 3 heading "===Pronunciation===" or level 4 heading "====Pronunciation====", finds the "{{IPA}}" or "{{*-IPA}}" templates and extracts and content of IPA templates. The IPA pronunciation in {{IPA}} template can be extracted easily, but IPA pronunciation in {{*-IPA}} template like {{ru-IPA}} template is expanded using template call in lua scripts, relying on Module:ru-pron. Those lua scripts can not be easily used locally, so the solution for such {{*-IPA}} templates is expanding them online using MediaWiki API, which is much slower. This issue shall be fixed later.

Plans for next week:

  • Complete the docstrings in Python code comments and write a detailed README for the toolkit.
  • Complete the distribution of the toolkit and release the first version.
  • Check the IPA and X-SAMPA results in details for several languages and collect the target pronunciation dictionaries.
  • Prepare for the first evaluation.

Popular Posts