【学术报告预告】端木三：双语平行语料的采集、处理和研究用途

作者：科研处来源：今日语言学时间： 2017-12-06

字号：小中大

　　题目：双语平行语料的采集、处理和研究用途

　　主讲人：端木三教授（University of Michigan 美国密西根大学语言学系）

　　时间：12月7日（周四）下午14 : 00

　　地点：建国门内大街5号中国社会科学院科研大楼6层语言所大会议室

　　摘要：

　　濒危语言的语保工作，要求我们采集大量的语音数据，并寻找能够迅速处理这些数据的新方法。Abney & Bird (2010; 2011)、Bird (2010) 提出了一个构思框架。这个框架的核心是采集双语平行语料，然后将其转换为双语平行文本，最后通过洛塞特（Rosetta Stone）方法，根据已知语的文本破解未知语的文本。

　　以上构思尚未得到实际验证。比如，洛塞特方法依赖的是“文字+文字”的平行文本，而从未知语录音得到文本的只是实际发音音标（窄式音标）。所以，处理“文字+窄式音标”这样的平行文本，比处理“文字+文字”的平行文本，显然要困难得多。

　　我们用 The Buckeye Corpus （美式英语，40小时 40位发言人的自然话语，有窄式转写标注），将汉语假想为已知语，将英语假想为未知语，获得汉语文字+英语窄式音标的平行文本，来进行尝试，以探讨技术难点及解决方法。同时，我们也讨论这种语料的研究价值，比如用于外语教学，以及对音系规则的研究。

　　参考文献：

　　Abney, Steven, and Steven Bird. 2010. The Human Language Project: Building a universal corpus of the World’s languages. 88-97, Uppsala, Sweden. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 88-97, Uppsala, Sweden.

　　Abney, Steven, and Steven Bird. 2011. Towards a data model for the Universal Corpus. Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 120-127, Portland, USA.

　　Bird, Steven. 2010. A scalable method for preserving oral literature from small languages. In Proceedings of the 12th International Conference on Asia-Pacific Digital Libraries, pages 5–14.

搜索

【学术报告预告】端木三：双语平行语料的采集、处理和研究用途