Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching

Mapping Intimacies ◽

10.31235/osf.io/rek3w ◽

2017 ◽

Author(s):

Arab World English Journal ◽

Hind M. Alotaibi

Keyword(s):

Language Teaching ◽

Data Driven ◽

Text Segmentation ◽

Web Interface ◽

King Saud University ◽

Parallel Corpora ◽

Parallel Corpus ◽

Source Language ◽

User Friendly ◽

Ongoing Project

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.

Download Full-text

Developing annotation solutions for online Data Driven Learning

ReCALL ◽

10.1017/s0958344009000093 ◽

2009 ◽

Vol 21 (1) ◽

pp. 55-75 ◽

Cited By ~ 6

Author(s):

Pascual Pérez-Paredes ◽

Jose M. Alcaraz-Calero

Keyword(s):

Corpus Linguistics ◽

Language Teaching ◽

Language Education ◽

Data Driven ◽

Online Data ◽

Analysis And Design ◽

Language Classroom ◽

Youth Language ◽

User Friendly

AbstractAlthough annotation is a widely-researched topic in Corpus Linguistics (CL), its potential role in Data Driven Learning (DDL) has not been addressed in depth by Foreign Language Teaching (FLT) practitioners. Furthermore, most of the research in the use of DDL methods pays little attention to annotation in the design and implementation of corpus-based/driven language teaching.In this paper, we set out to examine the process of development of SACODEYL Annotator, an application that seeks to assist SACODEYL system users in annotating XML multilingual corpora. First, we discuss the role of annotation in DDL and the dominating paradigm in general corpus applications. In the context of the language classroom, we argue that it is essential that corpora should be pedagogically motivated (Braun, 2005 and 2007a). Then, we move on to deal with the analysis and design stages of our annotation solution by illustrating its main features. Some of these include a user friendly hierarchical and extensible taxonomy tree to facilitate the learner-oriented annotation of the corpora; real-time graphics representation of the annotated corpus matching the XML TEI-compliant (Text Encoding Initiative) standard, as well as an intuitive management of the different data sections and associated metadata.SACODEYL (System Aided Compilation and Open Distribution of European Youth Language) is an EU funded MINERVA project which aims to develop an ICT-based system for the assisted compilation and open distribution of multimedia European teen talk in the context of language education. This research lays emphasis on the functionalities of the application within the SACODEYL context. However, our paper addresses similarly the needs of potential multimedia language corpus administrators in general on the lookout for powerful annotation assisting software. SACODEYL Annotator is free to use and can be downloaded from our website.

Download Full-text

Deep Learning-based Roman-Urdu to Urdu Transliteration

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421520017 ◽

2020 ◽

pp. 2152001

Author(s):

Mehreen Alam ◽

Sibt ul Hussain

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Research Problem ◽

Attention Mechanism ◽

Data Driven ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Source Language ◽

Data Driven Approach ◽

Modern Machine

Attention-based encoder-decoder models have superseded conventional techniques due to their unmatched performance on many neural machine translation problems. Usually, the encoders and decoders are two recurrent neural networks where the decoder is directed to focus on relevant parts of the source language using attention mechanism. This data-driven approach leads to generic and scalable solutions with no reliance on manual hand-crafted features. To the best of our knowledge, none of the modern machine translation approaches has been applied to address the research problem of Urdu machine transliteration. Ours is the first attempt to apply the deep neural network-based encoder-decoder using attention mechanism to address the aforementioned problem using Roman-Urdu and Urdu parallel corpus. To this end, we present (i) the first ever Roman-Urdu to Urdu parallel corpus of 1.1 million sentences, (ii) three state of the art encoder-decoder models, and (iii) a detailed empirical analysis of these three models on the Roman-Urdu to Urdu parallel corpus. Overall, attention-based model gives state-of-the-art performance with the benchmark of 70 BLEU score. Our qualitative experimental evaluation shows that our models generate coherent transliterations which are grammatically and logically correct.

Download Full-text

Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Meta Journal des traducteurs ◽

10.7202/1006182ar ◽

2011 ◽

Vol 56 (2) ◽

pp. 374-390 ◽

Cited By ~ 23

Author(s):

Lieve Macken ◽

Orphée De Clercq ◽

Hans Paulussen

Keyword(s):

Research Community ◽

Web Interface ◽

Parallel Corpora ◽

Parallel Corpus ◽

Text Type ◽

Language Technology ◽

Text Types ◽

French And English ◽

Class Information ◽

Research Domains

This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).

Download Full-text

MetaADEDB 2.0: a comprehensive database on adverse drug events

Bioinformatics ◽

10.1093/bioinformatics/btaa973 ◽

2020 ◽

Author(s):

Zhuohang Yu ◽

Zengrui Wu ◽

Weihua Li ◽

Guixia Liu ◽

Yun Tang

Keyword(s):

Safety Assessment ◽

Adverse Drug Events ◽

Adverse Event Reporting System ◽

Adverse Event Reporting ◽

Supplementary Information ◽

Online Database ◽

Web Interface ◽

Drug Discovery And Development ◽

Comprehensive Information ◽

User Friendly

Abstract Summary MetaADEDB is an online database we developed to integrate comprehensive information on adverse drug events (ADEs). The first version of MetaADEDB was released in 2013 and has been widely used by researchers. However, it has not been updated for more than seven years. Here, we reported its second version by collecting more and newer data from the U.S. FDA Adverse Event Reporting System (FAERS) and Canada Vigilance Adverse Reaction Online Database, in addition to the original three sources. The new version consists of 744 709 drug–ADE associations between 8498 drugs and 13 193 ADEs, which has an over 40% increase in drug–ADE associations compared to the previous version. Meanwhile, we developed a new and user-friendly web interface for data search and analysis. We hope that MetaADEDB 2.0 could provide a useful tool for drug safety assessment and related studies in drug discovery and development. Availability and implementation The database is freely available at: http://lmmd.ecust.edu.cn/metaadedb/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Units of Meaning, Parallel Corpora, and their Implications for Language Teaching

Applied Corpus Linguistics ◽

10.1163/9789004333772_010 ◽

2004 ◽

pp. 171-189 ◽

Cited By ~ 4

Keyword(s):

Language Teaching ◽

Parallel Corpora

Download Full-text

VIP-HL: Semi-automated ACMG/AMP variant interpretation platform for genetic hearing loss

10.22541/au.160682407.73344037/v1 ◽

2020 ◽

Author(s):

Jiguang Peng ◽

Jiale Xiang ◽

Xiangqian Jin ◽

Junhua Meng ◽

Nana Song ◽

...

Keyword(s):

Hearing Loss ◽

Expert Panel ◽

Sequence Variant ◽

Variant Interpretation ◽

Web Interface ◽

Online Tool ◽

Genetic Hearing Loss ◽

Genetics And Genomics ◽

Evidence Based Guidelines ◽

User Friendly

The American College of Medical Genetics and Genomics, and the Association for Molecular Pathology (ACMG/AMP) have proposed a set of evidence-based guidelines to support sequence variant interpretation. The ClinGen hearing loss expert panel (HL-EP) introduced further specifications into the ACMG/AMP framework for genetic hearing loss. This study developed a tool named VIP-HL, aiming to semi-automate the HL ACMG/AMP rules. VIP-HL aggregates information from external databases to automate 13 out of 24 ACMG/AMP rules specified by HL-EP, namely PVS1, PS1, PM1, PM2, PM4, PM5, PP3, BA1, BS1, BS2, BP3, BP4, and BP7. We benchmarked VIP-HL using 50 variants where 83 rules were activated by the ClinGen HL-EP. VIP-HL concordantly activated 96% (80/83) rules, significantly higher than that of by InterVar (47%; 39/83). Of 4948 ClinVar star 2+ variants from 142 deafness-related genes, VIP-HL achieved an overall variant interpretation concordance in 88.0% (4353/4948). VIP-HL is an integrated online tool for reliable automated variant classification in hearing loss genes. It assists curators in variant interpretation and provides a platform for users to share classifications with each other. VIP-HL is available with a user-friendly web interface at http://hearing.genetics.bgi.com/.

Download Full-text

The Microbe Directory: An annotated, searchable inventory of microbes’ characteristics

Gates Open Research ◽

10.12688/gatesopenres.12772.1 ◽

2018 ◽

Vol 2 ◽

pp. 3 ◽

Cited By ~ 5

Author(s):

Heba Shaaban ◽

David A. Westfall ◽

Rawhi Mohammad ◽

David Danko ◽

Daniela Bezdan ◽

...

Keyword(s):

Biofilm Formation ◽

Large Scale ◽

Research Effort ◽

Gram Stain ◽

Web Interface ◽

Ongoing Effort ◽

Student Researchers ◽

User Friendly ◽

Optimal Ph ◽

Online Web

The Microbe Directory is a collective research effort to profile and annotate more than 7,500 unique microbial species from the MetaPhlAn2 database that includes bacteria, archaea, viruses, fungi, and protozoa. By collecting and summarizing data on various microbes’ characteristics, the project comprises a database that can be used downstream of large-scale metagenomic taxonomic analyses, allowing one to interpret and explore their taxonomic classifications to have a deeper understanding of the microbial ecosystem they are studying. Such characteristics include, but are not limited to: optimal pH, optimal temperature, Gram stain, biofilm-formation, spore-formation, antimicrobial resistance, and COGEM class risk rating. The database has been manually curated by trained student-researchers from Weill Cornell Medicine and CUNY—Hunter College, and its analysis remains an ongoing effort with open-source capabilities so others can contribute. Available in SQL, JSON, and CSV (i.e. Excel) formats, the Microbe Directory can be queried for the aforementioned parameters by a microorganism’s taxonomy. In addition to the raw database, The Microbe Directory has an online counterpart (https://microbe.directory/) that provides a user-friendly interface for storage, retrieval, and analysis into which other microbial database projects could be incorporated. The Microbe Directory was primarily designed to serve as a resource for researchers conducting metagenomic analyses, but its online web interface should also prove useful to any individual who wishes to learn more about any particular microbe.

Download Full-text

COVID-19 preVIEW: Semantic Search to Explore COVID-19 Research Preprints

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210124 ◽

2021 ◽

Author(s):

Lisa Langnickel ◽

Roman Baum ◽

Johannes Darms ◽

Sumit Madan ◽

Juliane Fluck

Keyword(s):

Search Engine ◽

Access Point ◽

Semantic Search ◽

Web Interface ◽

Disease Trajectory ◽

Human Genes ◽

Central Access ◽

Semantic Information Retrieval ◽

User Friendly ◽

Semantic Search Engine

During the current COVID-19 pandemic, the rapid availability of profound information is crucial in order to derive information about diagnosis, disease trajectory, treatment or to adapt the rules of conduct in public. The increased importance of preprints for COVID-19 research initiated the design of the preprint search engine preVIEW. Conceptually, it is a lightweight semantic search engine focusing on easy inclusion of specialized COVID-19 textual collections and provides a user friendly web interface for semantic information retrieval. In order to support semantic search functionality, we integrated a text mining workflow for indexing with relevant terminologies. Currently, diseases, human genes and SARS-CoV-2 proteins are annotated, and more will be added in future. The system integrates collections from several different preprint servers that are used in the biomedical domain to publish non-peer-reviewed work, thereby enabling one central access point for the users. In addition, our service offers facet searching, export functionality and an API access. COVID-19 preVIEW is publicly available at https://preview.zbmed.de.

Download Full-text

Semantics, contrastive linguistics and parallel corpora

Cognitive Studies | Études cognitives ◽

10.11649/cs.2014.009 ◽

2014 ◽

pp. 85-100

Author(s):

Violetta Koseska

Keyword(s):

Lexical Semantics ◽

Semantic Annotation ◽

Semantic Structure ◽

Automatic Annotation ◽

Parallel Corpora ◽

Parallel Corpus ◽

Linguistic Form ◽

Semantic Categories ◽

Contrastive Linguistics

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.

Download Full-text

Corpus linguistics and its aplications in higher education

Revista Alicantina de Estudios Ingleses ◽

10.14198/raei.2010.23.04 ◽

2010 ◽

pp. 51 ◽

Cited By ~ 3

Author(s):

Miguel Fuster Márquez ◽

Begoña Clavel Arroitia

Keyword(s):

Higher Education ◽

Corpus Linguistics ◽

Teaching Practices ◽

Language Teaching ◽

Applied Linguistics ◽

Data Driven ◽

Future Success ◽

Different Types ◽

Theoretical Linguistics ◽

Relevant Factors

The aim of this paper is to review and analyse relevant factors related to the implementation of corpus linguistics (CL) in higher education. First we set out to describe underlying principles of CL and its developments in relation to theoretical linguistics and its applications in modern teaching practices. Then we attempt to establish how different types of corpora have contributed to the development of direct and indirect approaches in language teaching. We single out Data Driven Learning (DDL) due to its relevance in applied linguistics literature, and examine in detail advantages and drawbacks. Finally, we outline problems concerning the implementation of CL in the classroom since awareness of the limitations of CL is vital for its future success.

Download Full-text