[GSoC 2017 with CMUSphinx] Post 2#: Survey of Phoneme Sets

This week I collected a full words list from Wiktionary XML dump, improved the existing Python code to PEP8 style and made a survey on phoneme sets to be used in collected dictionaries.

Progresses:
  • I collected a full words list from enwiktionary 20170601 XML dump using enwiktionary2cmudict/test.py script. It took about 30 mins to traverse the whole 4.8GB XML file. The processing is not smooth, there seems to be some unexpected errors in XML file which caused mwxml interrupted at the half. It raised MalformedXML error when iterated at around No. 2,113,400~2,113,500 pages due to unexpected tag found. A workaround is changing the raise error statement to a return statement and return an invalid Page class. Then the whole traversal can be finished. Here's some statistics in Wiktionary XML dump:
    • There are 5,245,844 entries with "namespace=0" in XML dump, close to the statistic Wiktionary reports (5,240,252 entries).
    • Entries number of the selected ten languages:
      Figure 1. Number of  Entries in Wiktionary XML Dump
    • Number of entries with pronunciation of the selected ten languages:
      Figure 2. Number of Entries with Pronunciation in Wiktionary XML Dump
  • To make the code easier to read and use, I improved the code style to PEP8 this week: changed the code layout, used the specific namespace and refactored some unrecommended usages. Comments and docstrings need to be updated in the future. Pylint is also included to check the code style.
  • Survey on phoneme sets:
    • IPA: The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin alphabet, devised by linguists to accurately and uniquely represent each of the wide variety of sounds (phones or phonemes) used in spoken human language. Wiktionary mainly uses IPA for pronunciation because it is used for all languages. Category:Pronunciation_by_language records different IPA representations for different languages.
    • enPR: enPR symbols are used to represent the various pronunciations of the English language, so enPR is also common in enwiktionary. There is a simple mapping between enPR and IPA for English. So it's not necessary to use enPR in collected dictionaries.
    • ARPAbet: ARPAbet represents each phoneme of General American English with a distinct sequence of ASCII characters. The phoneme set of current English cmudict uses is based on ARPAbet symbols. There is also a simple mapping between ARPAbet and standard IPA symbol set.
    • SAMPA: The Speech Assessment Methods Phonetic Alphabet (SAMPA) is a computer-readable phonetic script using 7-bit printable ASCII characters based on IPA. It supports multiple languages varies from language to language.
    • X-SAMPA: The Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) is designed to unify the individual language SAMPA alphabets, and extend SAMPA to cover the entire range of characters in the IPA. It remaps IPA into 7-bit ASCII and supports all languages.
    • CMUSphinx dict: Different dicts used in CMUSphinx have different phoneme sets, but all of them are letter-only instead of special symbols supported. Since the pronunciation source we can collected from Wiktionary is IPA only, we are going to map the IPA or X-SAMPA remapped symbols to letter-only symbols so that they can be used in CMUSphinx.

Plans for next week:
  • Package and prepare to distribute the existing Python project.
  • Extract all the words' IPA from selected ten languages and convert them to X-SAMPA.
  • Phoneme converters from IPA/X-SAMPA to dictionary phoneme sets for several languages.

Popular Posts