Research on Bilingual Corpus Based Machine Translation

2014 ◽  
Vol 687-691 ◽  
pp. 1683-1686
Author(s):  
Shuang Wang

This thesis proposes several methods for bilingual corpus form different websites, such as Automatic acquisition of bilingual corpus base on "iciba" web, CNKI and Patent network. It introduced methods, procedures of the acquisition of a variety of corpus. We proposed different methods to obtain the bilingual corpus for different characteristics of different sites, and achieved fast and accurate automatic access of a large-scale bilingual corpus. When we obtain the bilingual corpus based on "iciba" web, the main method is Nutch crawler, which is relatively good, and has an accurate retrieve and a good correlation. In addition, we give up the idea of bilingual corpus obtained from the entire Internet, but we use an entirely new access, that is to access to the basic information of scholarly thesis’s in the CNKI to obtain the large-scale high-quality English-Chinese bilingual corpus. We obtain GB level of large-scale bilingual aligned corpus in the end, which is very accurate by the manual evaluation. And the corpus makes preparation for the further cross-language information retrieval research.

Author(s):  
Diana Irina Tanase ◽  
Epaminondas Kapetanios

Combining existing advancements in cross-language information retrieval (CLIR) with the new usercentered Web paradigm could allow tapping into Web-based multilingual clusters of language information that are rich, up-to-date in terms of language usage, that increase in size, and have the potential to cater for all languages. In this chapter, we set out to explore existing CLIR systems and their limitations, and we argue that in the current context of a widely adopted social Web, the future of large-scale CLIR and iCLIR systems is linked to the use of the Web as a lexical resource, as a distribution infrastructure, and as a channel of communication between users. Such a synergy will lead to systems that grow organically as more users with different linguistic skills join the network, and that improve in terms of language translations disambiguation and coverage.


Sign in / Sign up

Export Citation Format

Share Document