[GSoC 2017 with CMUSphinx] Post 1#: Parsing from Wiktionary XML Dump

This week I made a survey on MediaWiki XML dump parsing, parsed enwiktionary dump and integrated the implemented pronunciation extraction functions to the new framework.

The code for "Collect Pronunciation Dictionaries from Wiktionary" project is available at https://github.com/abuccts/enwiktionary2cmudict and discussions can be made at the corresponding gitter room.

Progresses:
  • Survey on MediaWiki XML dump parsing tools:
    • wikt2dict: wikt2dict is a Wiktionary parser tool for many languages, not works at present and it's built for extracting translation part only.
    • dictionary-builder: dictionary-builder is built to unmarshall very large xml document with very low memory footprint. It allows you to build dictionaries based on Wiktionary entries. The tool is written in Java, works fine with enwiktionary dump and converts the whole XML dump to more than 600,000 wiktionary entry files in one pass. But it only supports one language each pass and integrating Java code to this Python project is not flexible and may only be used as a blackbox to be executed once.
    • SPICE: An interspeech'10 paper "Wiktionary as a Source for Automatic Pronunciation Extraction" is somewhat similar to this project, which has a toolkit called Rapid Language Adaptation Toolkit (RLAT), but it is far from now and isn't online.
    • python-mwxml: python-mwxml is a set of utilities in Python 3 for processing MediaWiki XML dump data. It works fine, better than previous tools, and has a good performance, but it only supports Python 3 and opposes Python 2 support. I will use this tool to parse enwiktionary dump from XML to wikitext and make it compatible with Python 2 later.
  • I have integrated the wikitext pronunciation part parser to the new project code. Tested in enwiktionary dump and works fine at the first thousands of entries. Need to be improved for different languages' support.

Plans for next week:
  • Now we have extracted words' pronunciation part from Wiktionary and the problem is to convert those formats like IPA to CMUBET compatible format. I will make a survey on mapping from target language phoneme to CMUBET-nearphoneme. Phonemic pronunciations should be mapped to a phoneme dicts which are similar to CMUBET so that pronunciation dictionaries can be used in CMUSphinx.
  • Collect a full list of target languages’ words.from enwiktionary dump.
  • Test for different languages' IPA extraction and improve the buggy one.
  • Make my fork of mwxml compatible with Python 2 (at low priority and can be fixed later).


Popular Posts