主題模型於語音辨識使用之改進
No Thumbnail Available
Date
2010
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本論文探討自然語言中詞與詞之間在各種不同條件下的共同出現關係,並推導出許多不同的語言模型來描述之,進而運用於中文大詞彙連續語音辨識。當我們想要探索語言中兩個詞彼此間的共同出現關係(Co-occurrence Relationships),傳統的做法是由整個訓練語料中統計這兩個詞在一個固定長度的移動窗(Fixed-size Moving Window)內的共同出現頻數(Frequency),據此以估測出兩個詞之間的聯合機率分布。有別於僅從整個訓練語料中的共同出現頻數來推測任兩個詞之間的關係,本論文嘗試分析兩個詞在不同條件下共同出現的情形,進而推導出多種描述詞與詞關係的語言模型以及其估測方式;像是在不同的主題、文件或文件群的情況下,它們是否皆經常共同出現。本論文的實驗語料收錄自台灣的中文廣播新聞,由一系列的大詞彙連續語音辨識實驗結果顯示,我們所提出的各式語言模型皆可以明顯地提昇基礎語音辨識系統的效能。
This thesis investigates word-word co-occurrence relationships embedded in a natural language. A variety of language models deduced from such relationships are leveraged for Mandarin large vocabulary continuous speech recognition (LVCSR). When measuring the co-occurrence relationship between a given pair of words in a language, the most common approach is to estimate the joint probability of these two words by simply computing how many times the two words occur within some fixed-size window of each other that moves along the entire training corpus. Apart from doing this, in this study, we discuss the co-occurrence relationships between any pair of words under various conditions such as topics, documents, document clusters, to name a few, and hence derive several language models used to characterize such relationships. All experiments are conducted on a Mandarin broadcast news corpus compiled in Taiwan, and the associated results seem to demonstrate the feasibility of the proposed approaches.
This thesis investigates word-word co-occurrence relationships embedded in a natural language. A variety of language models deduced from such relationships are leveraged for Mandarin large vocabulary continuous speech recognition (LVCSR). When measuring the co-occurrence relationship between a given pair of words in a language, the most common approach is to estimate the joint probability of these two words by simply computing how many times the two words occur within some fixed-size window of each other that moves along the entire training corpus. Apart from doing this, in this study, we discuss the co-occurrence relationships between any pair of words under various conditions such as topics, documents, document clusters, to name a few, and hence derive several language models used to characterize such relationships. All experiments are conducted on a Mandarin broadcast news corpus compiled in Taiwan, and the associated results seem to demonstrate the feasibility of the proposed approaches.
Description
Keywords
中文大詞彙連續語音辨識, 共同出現關係, 語言模型, large vocabulary continuous speech recognition, co-occurrence relationships, language model