
表題:言語モデル性能指標を基準にした文分割 / Sentence Segmention based on Language Model Performance





提案手法による言語モデル性能に改善の可能性を調べるため、比較的少量の講演音声の書き起こしデータを対象として、提案手法による言語モデルと既存の分割手法による言語モデルの性能を文字当たりパープレキシティによって比較したところ、既存の形態素解析システム茶筌による分割に対して構築された言語モデルは 26.410、提案されている分割法による言語モデルは 24.032 であり、提案手法は茶筌による分割に匹敵する性能を示した。また、学習に用いるデータの量を変化させて同様の実験を行い、提案手法は既存手法と比べて、データの増加に対してより大きな性能の向上が見られることを確認した。

Statistical language model is a fundamental method in a number of applications of natural language prosessing, such as speech recognition and statistical machine translation.
In construction of a language model, word as a linguistic unit is commonly used to segment a sentence. However, it is reported that higher performance of language models can be realized with segmentation methods adapted to the target domain or style.
Performance of a statistical language model can be measured with Perplexity, the mean number of probable choices in a context given by the model. Vieweing a language model as an information source, logarithm of Perplexity essentialy equivalent to entropy of the source.
In this research, we propose a segmentation method approximately minimizing the Perplexity based on the Minimum Description Length, which encodes the learning data with a coding based on a dictionary constructed in the process and approximation of 1st order markov property.
A language model constructed with the proposed method achieved comparable performance to a language model based on an existing dictionary-based segmentation method.
Future work should be carried out to reduce the computational time, to evaluate the perfomance in more realistic sclae data, to extend the model to 2nd order markov property and to enhance the search range to N-best.