[GSoC 2017 with CMUSphinx] Post 6#: IPA Template Generation
UPDATE: Because CMUdict refers to only North American English pronunciation dict, and the current project name "enwiktionary2cmudict" is too long, we decided to change the name to "wikt2pron". The project's code can be found at GitHub.
This week I mainly converted several Lua pronunciation modules at Wiktionary to Python scripts and add testcases so that those modules can be used in this project.
{{IPA}}
templates contains the content of entry's IPA, while the other IPA related templates like {{ru-IPA}}
don't. Those special templates generate IPA according to the text using a complex Lua module. The current solution of dealing with those templates is expanding special templates other than {{IPA}}
through Wiktionary API which costs a lot of time and may be blocked by Wiktionary. After converted Wiktionary Lua module to offline Python script and integraed to the existing code, pronunciation parsed from XML dump can be processed quickly.This post introduces some tips on converting Lua script to Python script and some examples on how to expand text to IPA in
{{*-IPA}}
templates.Tips on converting Lua script to Python script:
string.gsub(s, pattern, replace [, n])
As shown infr-pron
Module, there are manystring.gsub
functions which substitute regex pattern in the string in Lua code. The alternative function which also substitutes regex pattern in string isre.sub(pattern, repl, string, count=0, flags=0)
.
Thestring.gsub
function in Lua is very flexible, thereplace
parameter can be another function, which can be used like this:
text = string.gsub(text, 'm([bp])⁀', function(bp) local capbp = {b='B', p='P'} return 'm' .. capbp[bp] .. '⁀' end)
Thereplace
function can be implemented using a lambda expression in Python, which looks like this:
>>> text = re.sub("m([bp])⁀", lambda x: "m" + x.group(1).upper() + "⁀", text)
When thereplace
function instring.gsub
becomes complex, like:
word = string.gsub(word, '(.?)ं(.)', function(succ, prev) return succ .. (succ..prev == "a" and "्म" or (succ == "" and match(prev, '[' .. vowel .. ']') and "̃" or nasal_assim[succ] or "n")) .. prev end)
An anonymous method can be used as thereplace
argument, like this:
>>> def repl(match): >>> succ, prev = match.group(1), match.group(2) >>> if succ + prev == "a": >>> return succ + "्म" + prev >>> if succ == "" and re.match("[" + vowel + "]", prev): >>> return succ + "̃" + prev >>> if succ in nasal_assim.keys(): >>> return succ + nasal_assim[succ] + prev >>> return succ + "n" + prev >>> >>> word = re.sub("(.?)ं(.)", repl, word)
Regex patterns in Lua is very similar to those in Python, except that Lua uses%
to escape characters while Python uses\
. So pattern referenced to group"%1abc%2"
can be converted tor"\1abc\2"
in Python.
Please also note that thestring.gsub
function used in Wiktionary Lua Module ismw.ustring.gsub
which is extended for unicode string. So we use regex package which is backwards-compatible with the standardre
module in Python like this:import regex as re
.
- Lua tables
Many varibles in Lua pronunciation modules are defined using tables, for example,
local remove_diaeresis_from_vowel = {['ä']='a', ['ë']='e', ['ï']='i', ['ö']='o', ['ü']='u', ['ÿ']='i'}
Lua tables are similar to Python dictionaries, but not the same. The obvious one is the definition should be converted to
>>> remove_diaeresis_from_vowel = >>> {'ä': a, 'ë': 'e', 'ï': 'i', 'ö': 'o', 'ü': 'u', 'ÿ': 'i'}
An important difference in convertion is that the key can be undefined when refering to a table in Lua which returnsnil
while Python dict will raise an error, for example,remove_diaeresis_from_vowel['x']
will returnnil
in Lua but raise a KeyError in Python. So referring a table in a condition in Lua should be checked before referring in Python.
function repl (vowel) local undo_diaeresis = remove_diaeresis_from_vowel[vowel] return undo_diaeresis and 'gu' .. undo_diaeresis or 'g' .. vowel end)
remove_diaeresis_from_vowel
dict should check ifvowel
varible is the dict's key.
>>> def repl(vowel): >>> if vowel in remove_diaeresis_from_vowel.keys() \ >>> and remove_diaeresis_from_vowel[vow]: >>> return "gu" + remove_diaeresis_from_vowel[vow] >>> return "g" + vow
{{*-IPA}}
templates:
{{hi-IPA}}
>>> from IPA import hi_pron >>> text = "इकट्ठा" >>> hi_pron.to_IPA(text) "ɪ.kəʈ.ʈʰɑː"
{{fr-IPA}}
>>> from IPA import fr_pron >>> text = "arable" >>> fr_pron.to_IPA(text) "a.ʁabl" # For verbs, using `pos` parameter >>> from IPA import fr_pron >>> text = "portions" >>> fr_pron.to_IPA(text, pos="v") "pɔʁ.tjɔ̃"
{{zh-pron}}
>>> from IPA import cmn_pron >>> text = "pīnyīn" >>> cmn_pron.to_IPA(text) "pʰin⁵⁵ in⁵⁵"
Next step:
- Continue converting the last
ru-pron
Lua Module to Python. - Integrate the current
{{IPA}}
template modules to the parser. - Extract all IPA pronunciations for selected 10 languages from Wiktionary XML dump.