Constructing a Large-Scale English-Persian Parallel Corpus

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.

Download Full-text

Research on the Creation of Small-Scale English-Chinese Parallel Corpus for Manufacturing Systems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.421.725 ◽

2013 ◽

Vol 421 ◽

pp. 725-730

Author(s):

Song Bin Bao

Keyword(s):

Teaching And Learning ◽

Manufacturing Systems ◽

Large Scale ◽

Small Scale ◽

Chinese Word Segmentation ◽

Translation Memory ◽

Parallel Corpus ◽

Novel Method ◽

Chinese Texts ◽

Education In China

English, which is specially used in the field of manufacturing systems, belongs to ESP (English for specific purposes). In order to improve the effect of ESP education in China, it is very necessary to create an English-Chinese parallel corpus for aiding ESP teaching and learning. In this paper, a novel method is presented to create a small-scale English-Chinese parallel corpus by means of TMS (translation memory system). Firstly, the suitable English and Chinese texts are collected from network, publication and human translation; secondly, The English and Chinese texts are aligned and formatted by using the related TMS functions; then Chinese texts are split into words by using ICWSS (Intelligent Chinese Word Segmentation System); finally, the English-Chinese corpus is stored in cloud database. This small-scale English-Chinese parallel corpus can be searched through ParaConc and meet the basic needs of ESP teaching and learning. Since the method does not need to design new algorithm nor develop new software system, the construction of the corpus is much easier and more flexible compared to general large-scale corpus.

Download Full-text

The Web as a Parallel Corpus

Computational Linguistics ◽

10.1162/089120103322711578 ◽

2003 ◽

Vol 29 (3) ◽

pp. 349-380 ◽

Cited By ~ 178

Author(s):

Philip Resnik ◽

Noah A. Smith

Keyword(s):

Language Processing ◽

Large Scale ◽

Structural Features ◽

Classification Performance ◽

Internet Archive ◽

Parallel Corpora ◽

Parallel Corpus ◽

Original Algorithm ◽

Parallel Text ◽

The Web

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Download Full-text

Automatic Acquisition of Large-Scale Academic Bilingual Parallel Corpus from the Web

2009 International Conference on Asian Language Processing ◽

10.1109/ialp.2009.75 ◽

2009 ◽

Author(s):

Han Yong ◽

Li Yu ◽

He Xiaoning ◽

Yang Muyun ◽

Lei Guohua

Keyword(s):

Large Scale ◽

Parallel Corpus ◽

Automatic Acquisition ◽

The Web

Download Full-text

Large-scale Word Alignment Using Soft Dependency Cohesion Constraints

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00228 ◽

2013 ◽

Vol 1 ◽

pp. 291-300 ◽

Cited By ~ 1

Author(s):

Zhiguo Wang ◽

Chengqing Zong

Keyword(s):

Large Scale ◽

Target Language ◽

Model Parameters ◽

Word Alignment ◽

Soft Constraint ◽

Alignment Quality ◽

Source Language ◽

Discriminative Models ◽

Translation Quality ◽

Gibbs Sampling Algorithm

Dependency cohesion refers to the observation that phrases dominated by disjoint dependency subtrees in the source language generally do not overlap in the target language. It has been verified to be a useful constraint for word alignment. However, previous work either treats this as a hard constraint or uses it as a feature in discriminative models, which is ineffective for large-scale tasks. In this paper, we take dependency cohesion as a soft constraint, and integrate it into a generative model for large-scale word alignment experiments. We also propose an approximate EM algorithm and a Gibbs sampling algorithm to estimate model parameters in an unsupervised manner. Experiments on large-scale Chinese-English translation tasks demonstrate that our model achieves improvements in both alignment quality and translation quality.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Mapping global water projects: improving access to donor investment information on the web

Water Policy ◽

10.2166/wp.2003.0012 ◽

2003 ◽

Vol 5 (3) ◽

pp. 203-212

Author(s):

J. Lisa Jorgensona

Keyword(s):

Water Resources ◽

Web Sites ◽

Large Scale ◽

Technical Assistance ◽

Water Withdrawal ◽

Small Scale ◽

Water Projects ◽

Essential Information ◽

Water Project ◽

The Web

This paper discusses a series of discusses how web sites now report international water project information, and maps the combined donor investment in more than 6000 water projects, active since 1995. The maps show donor investment: • has addressed water scarcity, • has improved access to improvised water resources, • correlates with growth in GDP, • appears to show a correlation with growth in net private capital flow, • does NOT appear to correlate with growth in GNI. Evaluation indicates problems in the combined water project portfolios for major donor organizations: •difficulties in grouping projects over differing Sector classifications, food security, or agriculture/irrigation is the most difficult. • inability to map donor projects at the country or river basin level because 60% of the donor projects include no location data (town, province, watershed) in the title or abstracts available on the web sites. • no means to identify donor projects with utilization of water resources from training or technical assistance. • no information of the source of water (river, aquifer, rainwater catchment). • an identifiable quantity of water (withdrawal amounts, or increased water efficiency) is not provided. • differentiation between large scale verses small scale projects. Recommendation: Major donors need to look at how the web harvests and combines their information, and look at ways to agree on a standard template for project titles to include more essential information. The Japanese (JICA) and the Asian Development Bank provide good models.

Download Full-text

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good ◽

10.1145/3411170.3411258 ◽

2020 ◽

Author(s):

Rita Tse ◽

Silvia Mirri ◽

Su-Kit Tang ◽

Giovanni Pau ◽

Paola Salomoni

Keyword(s):

Machine Translation ◽

Parallel Corpus ◽

The Web

Download Full-text

Stochastic Process for Analyzing Speech on the Web with Consideration of Media Mediation in Large-scale Broadcast Events in Japan

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378344 ◽

2020 ◽

Author(s):

Yasuko Kawahata

Keyword(s):

Stochastic Process ◽

Large Scale ◽

The Web

Download Full-text

Evaluation of dietary intake assessed by the Dutch self-administered web-based dietary 24-h recall tool (Compl-eat™) against interviewer-administered telephone-based 24-h recalls

Journal of Nutritional Science ◽

10.1017/jns.2017.45 ◽

2017 ◽

Vol 6 ◽

Cited By ~ 13

Author(s):

Saskia Meijboom ◽

Martinette T. van Houts-Streppel ◽

Corine Perenboom ◽

Els Siebelink ◽

Anne M. van de Wiel ◽

...

Keyword(s):

Dietary Intake ◽

Large Scale ◽

Alcohol Intake ◽

Lower Number ◽

Vitamin B ◽

Food Groups ◽

Web Based ◽

The Web

AbstractSelf-administered web-based 24-h dietary recalls (24 hR) may save a lot of time and money as compared with interviewer-administered telephone-based 24 hR interviews and may therefore be useful in large-scale studies. Within the Nutrition Questionnaires plus (NQplus) study, the web-based 24 hR tool Compl-eat™ was developed to assess Dutch participants’ dietary intake. The aim of the present study was to evaluate the performance of this tool against the interviewer-administered telephone-based 24 hR method. A subgroup of participants of the NQplus study (20–70 years, n 514) completed three self-administered web-based 24 hR and three telephone 24 hR interviews administered by a dietitian over a 1-year period. Compl-eat™ as well as the dietitians guided the participants to report all foods consumed the previous day. Compl-eat™ on average underestimated the intake of energy by 8 %, of macronutrients by 10 % and of micronutrients by 13 % as compared with telephone recalls. The agreement between both methods, estimated using Lin's concordance coefficients (LCC), ranged from 0·15 for vitamin B1 to 0·70 for alcohol intake (mean LCC 0·38). The lower estimations by Compl-eat™ can be explained by a lower number of total reported foods and lower estimated intakes of the food groups, fats, oils and savoury sauces, sugar and confectionery, dairy and cheese. The performance of the tool may be improved by, for example, adding an option to automatically select frequently used foods and including more recall cues. We conclude that Compl-eat™ may be a useful tool in large-scale Dutch studies after suggested improvements have been implemented and evaluated.

Download Full-text

CREATION OF THE WEB-MAP OF THE BOGUSLAVSKY DISTRICT

GEOGRAPHY AND TOURISM ◽

10.17721/2308-135x.2019.47.129-139 ◽

2019 ◽

pp. 129-139

Author(s):

Tamara Mykolayivna Kurach ◽

Iryna Aleksandrovna Pidlisetskaya

Keyword(s):

Cultural Heritage ◽

Theoretical Basis ◽

Large Scale ◽

Web Mapping ◽

Heritage Sites ◽

Web Map ◽

Thematic Mapping ◽

Interactive Map ◽

The Web

The goal is to develop a tourist interactive map "Landmarks of Bohuslav". The methodology. The methodological and theoretical basis of the study is modern geographical and cartographic science in the field of thematic mapping with the involvement of web-mapping technologies. Results. A large-scale tourist web map of the cultural heritage of the Boguslavsky region - “Sights of Boguslavshchina” was created. Scientific novelty. Approbation of the methodology and technology for the development of interactive large-scale web maps of tourism topics involving the Leaflet JavaScript library. Practical value. An interactive tourist web map of the historical and cultural heritage sites “Sights of Bohuslavshchina” will be published on the website of the health-improving institution of sanatorium-type “Chaika”. Convenient using, visualization, prompt receipt of information will help to increase the attractiveness of tourist Boguslavschina routes.

Download Full-text