Cited by Lee Sonogan
Abstract by Heting Gao1, Junrui Ni1, Yang Zhang2, Kaizhi Qian2, Shiyu Chang2, Mark Hasegawa-Johnson1
Many existing languages are too sparsely resourced for mono-lingual deep learning networks to achieve high accuracy. Mul-tilingual phonetic recognition systems mitigate data sparsity is-sues by training models on data from multiple languages and learning a speech-to-phone or speech-to-text model universal to all languages. However, despite their good performance on the seen training languages, multilingual systems have poor per-formance on unseen languages. This paper argues that in the real world, even an unseen language has metadata: linguists can tell us the language name, its language family and, usually, its phoneme inventory. Even with no transcribed speech, it is pos-sible to train a language embedding using only data from lan-guage typologies (phylogenetic node and phoneme inventory) that reduces ASR error rates. Experiments on a 20-language corpus show that our methods achieve phonetic token error rate (PTER) reduction on all the unseen test languages. An ablation study shows that using the wrong language embedding usually harms PTER if the two languages are from different language families. However, even the wrong language embedding often improves PTER if the language embedding belongs to another member of the same language family.
Publication: University of Illinois at Urbana-Champaign – 2MIT-IBM Watson AI Lab (Peer-Reviewed Journal)
Pub Date: 2021 Doi: http://www.isle.illinois.edu/speech_web_lg/pubs/2021/gao2021zero.pdf
Keywords: speech recognition, phonetic recognition, exter-
nal linguistic knowledge