Parallel Data Extraction using Word Embeddings

Mapping Intimacies ◽

10.5121/csit.2020.101521 ◽

2020 ◽

Author(s):

Pintu Lohar ◽

Andy Way

Keyword(s):

Information Retrieval ◽

Data Extraction ◽

Training Data ◽

User Generated Content ◽

Data Sets ◽

Word Embeddings ◽

Parallel Corpus ◽

Comparable Corpora ◽

Bilingual Dictionary ◽

Parallel Data

Building a robust MT system requires a sufficiently large parallel corpus to be available as training data. In this paper, we propose to automatically extract parallel sentences from comparable corpora without using any MT system or even any parallel corpus at all. Instead, we use crosslingual information retrieval (CLIR), average word embeddings, text similarity and a bilingual dictionary, thus saving a significant amount of time and effort as no MT system is involved in this process. We conduct experiments on two different kinds of data: (i) formal texts from news domain, and (ii) user-generated content (UGC) from hotel reviews. The automatically extracted sentence pairs are then added to the already available parallel training data and the extended translation models are built from the concatenated data sets. Finally, we compare the performance of our new extended models against the baseline models built from the available data. The experimental evaluation reveals that our proposed approach is capable of improving the translation outputs for both the formal texts and UGC.

Download Full-text

Disambiguation of single noun translations extracted from bilingual comparable corpora

Terminology ◽

10.1075/term.7.1.06nak ◽

2001 ◽

Vol 7 (1) ◽

pp. 63-83 ◽

Cited By ~ 4

Author(s):

Hiroshi Nakagawa

Keyword(s):

Information Retrieval ◽

Language Translation ◽

Target Language ◽

Comparable Corpora ◽

Source Language ◽

Bilingual Dictionary ◽

Second Stage ◽

Cross Language Information Retrieval ◽

Machine Readable ◽

Cross Language

Bilingual machine readable dictionaries are important and indispensable resources of information for cross-language information retrieval, and machine translation. Recently, these cross-language informational activities have begun to focus on specific academic or technological domains. In this paper, we describe a bilingual dictionary acquisition system which extracts translations from non-parallel but comparable corpora of a specific academic domain and disambiguates the extracted translations. The proposed method is two-fold. At the first stage, candidate terms are extracted from a Japanese and English corpus, respectively, and ranked according to their importance as terms. At the second stage, ambiguous translations are resolved by selecting the target language translation which is the nearest in rank to the source language term. Finally, we evaluate the proposed method in an experiment.

Download Full-text

Unsupervised Word Translation with Adversarial Autoencoder

Computational Linguistics ◽

10.1162/coli_a_00374 ◽

2020 ◽

Vol 46 (2) ◽

pp. 257-288

Author(s):

Tasnim Mohiuddin ◽

Shafiq Joty

Keyword(s):

Machine Translation ◽

Superior Performance ◽

Data Sets ◽

Word Embeddings ◽

Shared Space ◽

Parallel Data ◽

Adversarial Training ◽

Word Translation ◽

Input Reconstruction ◽

Adversarial Model

Crosslingual word embeddings learned from monolingual embeddings have a crucial role in many downstream tasks, ranging from machine translation to transfer learning. Adversarial training has shown impressive success in learning crosslingual embeddings and the associated word translation task without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this article, we investigate adversarial autoencoder for unsupervised word translation and propose two novel extensions to it that yield more stable training and improved results. Our method includes regularization terms to enforce cycle consistency and input reconstruction, and puts the target encoders as an adversary against the corresponding discriminator. We use two types of refinement procedures sequentially after obtaining the trained encoders and mappings from the adversarial training, namely, refinement with Procrustes solution and refinement with symmetric re-weighting. Extensive experimentations with high- and low-resource languages from two different data sets show that our method achieves better performance than existing adversarial and non-adversarial approaches and is also competitive with the supervised system. Along with performing comprehensive ablation studies to understand the contribution of different components of our adversarial model, we also conduct a thorough analysis of the refinement procedures to understand their effects.

Download Full-text

Effective training data extraction method to improve influenza outbreak prediction from online news articles (Preprint)

10.2196/preprints.23305 ◽

2020 ◽

Author(s):

Beakcheol Jang ◽

Inhwan Kim

Keyword(s):

Extraction Method ◽

Prediction Accuracy ◽

Data Extraction ◽

Extraction Process ◽

Training Data ◽

Influenza Surveillance ◽

Word Embeddings ◽

Influenza Outbreak ◽

Effective Training ◽

Sorting Process

BACKGROUND Each year, influenza affects 3 to 5 million people and causes 290,000 to 650,000 fatalities worldwide. To reduce the fatalities caused by influenza, several countries have established influenza surveillance systems to collect early-warning data. However, proper and timely warnings are hindered by a 1 to 2 weeks delay between the actual disease outbreaks and the publication of surveillance data. To avoid this delay of traditional monitoring methods, novel methods have been proposed for influenza surveillance and prediction by using real-time internet data (such as search queries, microblogging, and news). Some of the currently popular approaches extract online data and use machine learning to predict influenza occurrences in a classification mode. However, many of these methods extract training data subjectively, and it is difficult to capture the latent characteristics of the data correctly. There is a critical need to devise new approaches that focus on extracting training data by reflecting the latent characteristics of the data. OBJECTIVE In this paper, we propose an effective training data extraction method that reflects the hidden features and improves the performance by filtering and selecting only the keywords related to influenza before the prediction. METHODS Although the word embeddings provide a distributed representation of words by encoding the hidden relationships between various tokens, we enhance the word embeddings by selecting keywords related to the influenza outbreak and sorting the extracted keywords using the Pearson correlation coefficient (PCC) in order of correlation with the influenza outbreak. The keyword extraction process is followed by a predictive model based on long short-term memory (LSTM) that predicts the influenza outbreak. To assess the performance of the proposed predictive model, we use and compare a variety of word embeddings. RESULTS Word embeddings without our proposed sorting process showed 0.8705 prediction accuracy when 50.2 keywords were selected on average. On the other hand, word embeddings using our proposed sorting process showed 0.8868 prediction accuracy and 12.6% prediction accuracy improvement although smaller amount of training data are selected with only 20.6 keywords on average. CONCLUSIONS The sorting process empowers the embedding process, which improves the feature extraction process because it acts as a knowledge base for the prediction component. The model outperforms other current approaches that use flat extraction before prediction.

Download Full-text

Extraction of Bilingual Dictionary from Comparable Corpora for Resource Scarce Languages

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.8629 ◽

2020 ◽

Vol 17 (1) ◽

pp. 54-60

Author(s):

B. S. Sowmya Lakshmi ◽

B. R. Shambhavi

Keyword(s):

Indian Language ◽

Parallel Corpora ◽

Parallel Corpus ◽

Comparable Corpora ◽

Bilingual Dictionary ◽

Low Resource ◽

Technological University ◽

Language Corpus ◽

Language Pair

Visvesvaraya Technological University, Belagavi, Karnataka, India One of the promising resources to extract dictionaries are said to be parallel corpora. Majority of the substantial works are based on parallel corpora, whereas for the resource scarce language pairs building a parallel corpus is a challenging task. To prevail over this issue, researchers found comparable corpora could be an alternative to extract dictionary. Proposed approach is to extract dictionary for a low resource language pair English and Kannada using comparable corpora obtained from Wikipedia dumps and corpus received from Indian Language Corpus Initiative (ILCI). Dictionary constructed comprises of both translation and transliteration entities with term level associations from English to Kannada. Resultant dictionary is of size 77545 tokens with precision score of 0.79. Proposed work is independent of language and could be expanded to other language pairs.

Download Full-text

Training Data Extraction and Object Detection in Surveillance Scenario

Sensors ◽

10.3390/s20092689 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2689 ◽

Cited By ~ 1

Author(s):

Artur Wilkowski ◽

Maciej Stefańczyk ◽

Włodzimierz Kasprzak

Keyword(s):

Public Space ◽

Data Extraction ◽

Criminal Activity ◽

Training Data ◽

Support Vector ◽

Small Data ◽

Data Sets ◽

Surveillance Systems ◽

Automatic Data ◽

Small Data Sets

Police and various security services use video analysis for securing public space, mass events, and when investigating criminal activity. Due to a huge amount of data supplied to surveillance systems, some automatic data processing is a necessity. In one typical scenario, an operator marks an object in an image frame and searches for all occurrences of the object in other frames or even image sequences. This problem is hard in general. Algorithms supporting this scenario must reconcile several seemingly contradicting factors: training and detection speed, detection reliability, and learning from small data sets. In the system proposed here, we use a two-stage detector. The first region proposal stage is based on a Cascade Classifier while the second classification stage is based either on a Support Vector Machines (SVMs) or Convolutional Neural Networks (CNNs). The proposed configuration ensures both speed and detection reliability. In addition to this, an object tracking and background-foreground separation algorithm is used, supported by the GrabCut algorithm and a sample synthesis procedure, in order to collect rich training data for the detector. Experiments show that the system is effective, useful, and applicable to practical surveillance tasks.

Download Full-text

Parallel sentence extraction to improve cross-language information retrieval from Wikipedia

Journal of Information Science ◽

10.1177/0165551521992754 ◽

2021 ◽

pp. 016555152199275

Author(s):

Juryong Cheon ◽

Youngjoong Ko

Keyword(s):

Information Retrieval ◽

Language Resources ◽

Query Translation ◽

Factors Affecting ◽

Parallel Corpora ◽

Parallel Corpus ◽

Bilingual Dictionary ◽

Sentence Extraction ◽

Cross Language Information Retrieval ◽

Cross Language

Translation language resources, such as bilingual word lists and parallel corpora, are important factors affecting the effectiveness of cross-language information retrieval (CLIR) systems. In particular, when large domain-appropriate parallel corpora are not available, developing an effective CLIR system is particularly difficult. Furthermore, creating a large parallel corpus is costly and requires considerable effort. Therefore, we here demonstrate the construction of parallel corpora from Wikipedia as well as improved query translation, wherein the queries are used for a CLIR system. To do so, we first constructed a bilingual dictionary, termed WikiDic. Then, we evaluated individual language resources and combinations of them in terms of their ability to extract parallel sentences; the combinations of our proposed WikiDic with the translation probability from the Web’s bilingual example sentence pairs and WikiDic was found to be best suited to parallel sentence extraction. Finally, to evaluate the parallel corpus generated from this best combination of language resources, we compared its performance in query translation for CLIR to that of a manually created English–Korean parallel corpus. As a result, the corpus generated by our proposed method achieved a better performance than did the manually created corpus, thus demonstrating the effectiveness of the proposed method for automatic parallel corpus extraction. Not only can the method demonstrated herein be used to inform the construction of other parallel corpora from language resources that are readily available, but also, the parallel sentence extraction method will naturally improve as Wikipedia continues to be used and its content develops.

Download Full-text

Japanese-English Cross Language Information Retrieval based on Comparable Corpora and Bilingual Dictionary

Journal of Natural Language Processing ◽

10.5715/jnlp.5.4_77 ◽

1998 ◽

Vol 5 (4) ◽

pp. 77-93

Author(s):

AKITOSHI OKUMURA ◽

KAI ISHIKAWA ◽

KENJI SATOH

Keyword(s):

Information Retrieval ◽

Comparable Corpora ◽

Bilingual Dictionary ◽

Cross Language Information Retrieval ◽

Cross Language ◽

Japanese English

Download Full-text

Translation of Legal Terms: Bilingual Dictionary vs Parallel Corpus

Science and Education a New Dimension ◽

10.31174/send-ph2020-225viii67-10 ◽

2020 ◽

Vol VIII(225) (67) ◽

pp. 46-49

Author(s):

S. A. Matvieieva

Keyword(s):

Parallel Corpus ◽

Bilingual Dictionary

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text