scholarly journals Constructing a Large-Scale English-Persian Parallel Corpus

2009 ◽  
Vol 54 (1) ◽  
pp. 181-188 ◽  
Author(s):  
Tayebeh Mosavi Miangah

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.

2013 ◽  
Vol 421 ◽  
pp. 725-730
Author(s):  
Song Bin Bao

English, which is specially used in the field of manufacturing systems, belongs to ESP (English for specific purposes). In order to improve the effect of ESP education in China, it is very necessary to create an English-Chinese parallel corpus for aiding ESP teaching and learning. In this paper, a novel method is presented to create a small-scale English-Chinese parallel corpus by means of TMS (translation memory system). Firstly, the suitable English and Chinese texts are collected from network, publication and human translation; secondly, The English and Chinese texts are aligned and formatted by using the related TMS functions; then Chinese texts are split into words by using ICWSS (Intelligent Chinese Word Segmentation System); finally, the English-Chinese corpus is stored in cloud database. This small-scale English-Chinese parallel corpus can be searched through ParaConc and meet the basic needs of ESP teaching and learning. Since the method does not need to design new algorithm nor develop new software system, the construction of the corpus is much easier and more flexible compared to general large-scale corpus.


2003 ◽  
Vol 29 (3) ◽  
pp. 349-380 ◽  
Author(s):  
Philip Resnik ◽  
Noah A. Smith

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.


2013 ◽  
Vol 1 ◽  
pp. 291-300 ◽  
Author(s):  
Zhiguo Wang ◽  
Chengqing Zong

Dependency cohesion refers to the observation that phrases dominated by disjoint dependency subtrees in the source language generally do not overlap in the target language. It has been verified to be a useful constraint for word alignment. However, previous work either treats this as a hard constraint or uses it as a feature in discriminative models, which is ineffective for large-scale tasks. In this paper, we take dependency cohesion as a soft constraint, and integrate it into a generative model for large-scale word alignment experiments. We also propose an approximate EM algorithm and a Gibbs sampling algorithm to estimate model parameters in an unsupervised manner. Experiments on large-scale Chinese-English translation tasks demonstrate that our model achieves improvements in both alignment quality and translation quality.


2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


Water Policy ◽  
2003 ◽  
Vol 5 (3) ◽  
pp. 203-212
Author(s):  
J. Lisa Jorgensona

This paper discusses a series of discusses how web sites now report international water project information, and maps the combined donor investment in more than 6000 water projects, active since 1995. The maps show donor investment:  • has addressed water scarcity,  • has improved access to improvised water resources,  • correlates with growth in GDP,  • appears to show a correlation with growth in net private capital flow,  • does NOT appear to correlate with growth in GNI. Evaluation indicates problems in the combined water project portfolios for major donor organizations: •difficulties in grouping projects over differing Sector classifications, food security, or agriculture/irrigation is the most difficult.  • inability to map donor projects at the country or river basin level because 60% of the donor projects include no location data (town, province, watershed) in the title or abstracts available on the web sites.  • no means to identify donor projects with utilization of water resources from training or technical assistance.  • no information of the source of water (river, aquifer, rainwater catchment).  • an identifiable quantity of water (withdrawal amounts, or increased water efficiency) is not provided.  • differentiation between large scale verses small scale projects. Recommendation: Major donors need to look at how the web harvests and combines their information, and look at ways to agree on a standard template for project titles to include more essential information. The Japanese (JICA) and the Asian Development Bank provide good models.


2017 ◽  
Vol 6 ◽  
Author(s):  
Saskia Meijboom ◽  
Martinette T. van Houts-Streppel ◽  
Corine Perenboom ◽  
Els Siebelink ◽  
Anne M. van de Wiel ◽  
...  

AbstractSelf-administered web-based 24-h dietary recalls (24 hR) may save a lot of time and money as compared with interviewer-administered telephone-based 24 hR interviews and may therefore be useful in large-scale studies. Within the Nutrition Questionnaires plus (NQplus) study, the web-based 24 hR tool Compl-eat™ was developed to assess Dutch participants’ dietary intake. The aim of the present study was to evaluate the performance of this tool against the interviewer-administered telephone-based 24 hR method. A subgroup of participants of the NQplus study (20–70 years, n 514) completed three self-administered web-based 24 hR and three telephone 24 hR interviews administered by a dietitian over a 1-year period. Compl-eat™ as well as the dietitians guided the participants to report all foods consumed the previous day. Compl-eat™ on average underestimated the intake of energy by 8 %, of macronutrients by 10 % and of micronutrients by 13 % as compared with telephone recalls. The agreement between both methods, estimated using Lin's concordance coefficients (LCC), ranged from 0·15 for vitamin B1 to 0·70 for alcohol intake (mean LCC 0·38). The lower estimations by Compl-eat™ can be explained by a lower number of total reported foods and lower estimated intakes of the food groups, fats, oils and savoury sauces, sugar and confectionery, dairy and cheese. The performance of the tool may be improved by, for example, adding an option to automatically select frequently used foods and including more recall cues. We conclude that Compl-eat™ may be a useful tool in large-scale Dutch studies after suggested improvements have been implemented and evaluated.


2019 ◽  
pp. 129-139
Author(s):  
Tamara Mykolayivna Kurach ◽  
Iryna Aleksandrovna Pidlisetskaya

The goal is to develop a tourist interactive map "Landmarks of Bohuslav". The methodology. The methodological and theoretical basis of the study is modern geographical and cartographic science in the field of thematic mapping with the involvement of web-mapping technologies. Results. A large-scale tourist web map of the cultural heritage of the Boguslavsky region - “Sights of Boguslavshchina” was created. Scientific novelty. Approbation of the methodology and technology for the development of interactive large-scale web maps of tourism topics involving the Leaflet JavaScript library. Practical value. An interactive tourist web map of the historical and cultural heritage sites “Sights of Bohuslavshchina” will be published on the website of the health-improving institution of sanatorium-type “Chaika”. Convenient using, visualization, prompt receipt of information will help to increase the attractiveness of tourist Boguslavschina routes.


Sign in / Sign up

Export Citation Format

Share Document