[GSoC 2017 with CMUSphinx] Post 6#: IPA Template Generation


UPDATE: Because CMUdict refers to only North American English pronunciation dict, and the current project name "enwiktionary2cmudict" is too long, we decided to change the name to "wikt2pron". The project's code can be found at GitHub.

This week I mainly converted several Lua pronunciation modules at Wiktionary to Python scripts and add testcases so that those modules can be used in this project. {{IPA}} templates contains the content of entry's IPA, while the other IPA related templates like {{ru-IPA}} don't. Those special templates generate IPA according to the text using a complex Lua module. The current solution of dealing with those templates is expanding special templates other than {{IPA}} through Wiktionary API which costs a lot of time and may be blocked by Wiktionary. After converted Wiktionary Lua module to offline Python script and integraed to the existing code, pronunciation parsed from XML dump can be processed quickly.

This post introduces some tips on converting Lua script to Python script and some examples on how to expand text to IPA in {{*-IPA}} templates.

Tips on converting Lua script to Python script:
  • string.gsub(s, pattern, replace [, n])

    As shown in fr-pron Module, there are many string.gsub functions which substitute regex pattern in the string in Lua code. The alternative function which also substitutes regex pattern in string is re.sub(pattern, repl, string, count=0, flags=0).
    The string.gsub function in Lua is very flexible, the replace parameter can be another function, which can be used like this:
    text = string.gsub(text, 'm([bp])⁀', function(bp)
      local capbp = {b='B', p='P'}
      return 'm' .. capbp[bp] .. '⁀'
     end)
    The replace function can be implemented using a lambda expression in Python, which looks like this:
    >>> text = re.sub("m([bp])⁀", lambda x: "m" + x.group(1).upper() + "⁀", text)
    When the replace function in string.gsub becomes complex, like:
    word = string.gsub(word, '(.?)ं(.)', function(succ, prev)
      return succ .. (succ..prev == "a" and "्म" or
      (succ == "" and match(prev, '[' .. vowel .. ']') and "̃" or nasal_assim[succ] or "n")) .. prev
     end)
    An anonymous method can be used as the replace argument, like this:
    >>> def repl(match):
    >>>    succ, prev = match.group(1), match.group(2)
    >>>    if succ + prev == "a":
    >>>        return succ + "्म" + prev
    >>>    if succ == "" and re.match("[" + vowel + "]", prev):
    >>>        return succ + "̃" + prev
    >>>    if succ in nasal_assim.keys():
    >>>        return succ + nasal_assim[succ] + prev
    >>>    return succ + "n" + prev
    >>>
    >>> word = re.sub("(.?)ं(.)", repl, word)
    Regex patterns in Lua is very similar to those in Python, except that Lua uses % to escape characters while Python uses \. So pattern referenced to group "%1abc%2" can be converted to r"\1abc\2" in Python.

    Please also note that the string.gsub function used in Wiktionary Lua Module is mw.ustring.gsub which is extended for unicode string. So we use regex package which is backwards-compatible with the standard re module in Python like this: import regex as re.

  • Lua tables

    Many varibles in Lua pronunciation modules are defined using tables, for example,
    local remove_diaeresis_from_vowel =
     {['ä']='a', ['ë']='e', ['ï']='i', ['ö']='o', ['ü']='u', ['ÿ']='i'}
    Lua tables are similar to Python dictionaries, but not the same. The obvious one is the definition should be converted to
    >>> remove_diaeresis_from_vowel =
    >>>     {'ä': a, 'ë': 'e', 'ï': 'i', 'ö': 'o', 'ü': 'u', 'ÿ': 'i'}
    An important difference in convertion is that the key can be undefined when refering to a table in Lua which returns nil while Python dict will raise an error, for example, remove_diaeresis_from_vowel['x'] will return nil in Lua but raise a KeyError in Python. So referring a table in a condition in Lua should be checked before referring in Python.
    function repl (vowel)
     local undo_diaeresis = remove_diaeresis_from_vowel[vowel]
     return undo_diaeresis and 'gu' .. undo_diaeresis or 'g' .. vowel
    end)
    remove_diaeresis_from_vowel dict should check if vowel varible is the dict's key.
    >>> def repl(vowel):
    >>>     if vowel in remove_diaeresis_from_vowel.keys() \
    >>>         and remove_diaeresis_from_vowel[vow]:
    >>>         return "gu" + remove_diaeresis_from_vowel[vow]
    >>>     return "g" + vow
Examples on how to expand text to IPA in {{*-IPA}} templates:
  1.  {{hi-IPA}}
    >>> from IPA import hi_pron
    >>> text = "इकट्ठा"
    >>> hi_pron.to_IPA(text)
    "ɪ.kəʈ.ʈʰɑː"
  2.  {{fr-IPA}}
    >>> from IPA import fr_pron
    >>> text = "arable"
    >>> fr_pron.to_IPA(text)
    "a.ʁabl"
    
    # For verbs, using `pos` parameter
    >>> from IPA import fr_pron
    >>> text = "portions"
    >>> fr_pron.to_IPA(text, pos="v")
    "pɔʁ.tjɔ̃"
  3.  {{zh-pron}}
    >>> from IPA import cmn_pron
    >>> text = "pīnyīn"
    >>> cmn_pron.to_IPA(text)
    "pʰin⁵⁵ in⁵⁵"

Next step:
  1. Continue converting the last ru-pron Lua Module to Python.
  2. Integrate the current {{IPA}} template modules to the parser.
  3. Extract all IPA pronunciations for selected 10 languages from Wiktionary XML dump.


Popular Posts