Scalable Reordering Models for SMT based on Multiclass SVM

Abdullah Alrajeh; Mahesan Niranjan

doi:10.1515/pralin-2015-0004

Scalable Reordering Models for SMT based on Multiclass SVM

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2015-0004 ◽

2015 ◽

Vol 103 (1) ◽

pp. 65-84 ◽

Cited By ~ 1

Author(s):

Abdullah Alrajeh ◽

Mahesan Niranjan

Keyword(s):

Large Scale ◽

Statistical Machine Translation ◽

Classification Problem ◽

Training Data ◽

Support Vector ◽

Recent Developments ◽

Dual Coordinate Descent ◽

Baseline System ◽

Multiclass Svm ◽

Translation Systems

Abstract In state-of-the-art phrase-based statistical machine translation systems, modelling phrase reorderings is an important need to enhance naturalness of the translated outputs, particularly when the grammatical structures of the language pairs differ significantly. Posing phrase movements as a classification problem, we exploit recent developments in solving large-scale multiclass support vector machines. Using dual coordinate descent methods for learning, we provide a mechanism to shrink the amount of training data required for each iteration. Hence, we produce significant computational saving while preserving the accuracy of the models. Our approach is a couple of times faster than maximum entropy approach and more memory-efficient (50% reduction). Experiments were carried out on an Arabic-English corpus with more than a quarter of a billion words. We achieve BLEU score improvements on top of a strong baseline system with sparse reordering features.

Source-Side Discontinuous Phrases for Machine Translation: A Comparative Study on Phrase Extraction and Search

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2013-0002 ◽

2013 ◽

Vol 99 (1) ◽

pp. 17-38

Author(s):

Matthias Huck ◽

Erik Scharwächter ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Large Scale ◽

Search Algorithm ◽

Statistical Machine Translation ◽

Empirical Evaluation ◽

Training Data ◽

Beam Search ◽

Phrase Extraction ◽

System Configurations ◽

Translation Systems

Abstract Standard phrase-based statistical machine translation systems generate translations based on an inventory of continuous bilingual phrases. In this work, we extend a phrase-based decoder with the ability to make use of phrases that are discontinuous in the source part. Our dynamic programming beam search algorithm supports separate pruning of coverage hypotheses per cardinality and of lexical hypotheses per coverage, as well as coverage constraints that impose restrictions on the possible reorderings. In addition to investigating these aspects, which are related to the decoding procedure, we also concentrate our attention on the question of how to obtain source-side discontinuous phrases from parallel training data. Two approaches (hierarchical and discontinuous extraction) are presented and compared. On a large-scale Chinese!English translation task, we conduct a thorough empirical evaluation in order to study a number of system configurations with source-side discontinuous phrases, and to compare them to setups which employ continuous phrases only.

Linguistically Annotated Reordering: Evaluation and Analysis

Computational Linguistics ◽

10.1162/coli_a_00009 ◽

2010 ◽

Vol 36 (3) ◽

pp. 535-568 ◽

Cited By ~ 2

Author(s):

Deyi Xiong ◽

Min Zhang ◽

Aiti Aw ◽

Haizhou Li

Keyword(s):

Large Scale ◽

Statistical Machine Translation ◽

Training Data ◽

Linguistic Knowledge ◽

Analysis Method ◽

Complementary Information ◽

New Approach ◽

Parse Trees ◽

New Challenges ◽

Insight Into

Linguistic knowledge plays an important role in phrase movement in statistical machine translation. To efficiently incorporate linguistic knowledge into phrase reordering, we propose a new approach: Linguistically Annotated Reordering (LAR). In LAR, we build hard hierarchical skeletons and inject soft linguistic knowledge from source parse trees to nodes of hard skeletons during translation. The experimental results on large-scale training data show that LAR is comparable to boundary word-based reordering (BWR) (Xiong, Liu, and Lin 2006), which is a very competitive lexicalized reordering approach. When combined with BWR, LAR provides complementary information for phrase reordering, which collectively improves the BLEU score significantly. To further understand the contribution of linguistic knowledge in LAR to phrase reordering, we introduce a syntax-based analysis method to automatically detect constituent movement in both reference and system translations, and summarize syntactic reordering patterns that are captured by reordering models. With the proposed analysis method, we conduct a comparative analysis that not only provides the insight into how linguistic knowledge affects phrase movement but also reveals new challenges in phrase reordering.

AUGMENTATIVE AND ALTERNATIVE COMMUNICATION METHOD BASED ON TONGUE CLICKING FOR MUTE DISABILITIES

IIUM Engineering Journal ◽

10.31436/iiumej.v20i1.1021 ◽

2019 ◽

Vol 20 (1) ◽

pp. 119-128

Author(s):

NIK NUR WAHIDAH NIK HASHIM ◽

MUHAMMAD AMIRUL AMIN AZMI ◽

HAZLINA MD. YUSOF

Keyword(s):

Amplitude Modulation ◽

Augmentative And Alternative Communication ◽

Training Data ◽

Support Vector ◽

Classification Rate ◽

Data Set ◽

Zero Crossing ◽

Svm Classification ◽

Development Data ◽

Multiclass Svm

This paper presents a pilot study for a novel application of converting tongue clicking sound to words for people with the inability to speak. 15 features of speech that are related to speech timing patterns, amplitude modulation, zero crossing and peak detection were extracted. The experiments were conducted with three different patterns using binary Support Vector Machine (SVM) classification with 10 recordings as training data and 10 recordings as development data. Peak size outperformed all features with 85% classification rate for pattern P1-P3 whereas multiple features produced 100% classification rate for P1-P2 and P2-P3. A GUI based system was developed to validate the trained classifier. Multiclass SVM were constructed based on the best features obtained from binary SVM classification outcome, namely peak size and skewness amplitude modulation, and then tested on 15 recordings. The GUI based multiclass SVM obtained a satisfying performance of 67% correct classification of the test data set. ABSTRAK: Kertas ini membentangkan panduan kajian kepada aplikasi terkini dalam menukar bunyi klik pada lidah kepada perkataan untuk orang yang mempunyai kehilangan upaya dalam bertutur. 15 ciri khas berkaitan pertuturan adalah pola masa, modulasi nilai tertinggi, tiada titik persilangan dan nilai terpilih yang dikesan. Eksperimen telah dijalankan dengan tiga corak berlainan menggunakan perduaan Mesin Vektor Sokongan (SVM) klasifikasi dengan 10 rakaman sebagai data terlatih dan 10 rakaman sebagai data yang dibina. Saiz tertinggi yang melebihi semua ciri-ciri pada 85% kadar klasifikasi dilihat pada corak P1-P3, sedangkan ciri-ciri pelbagai telah terhasil pada 100% kadar klasifikasi P1-P2 dan P2-P3. Sistem berdasarkan GUI telah dibina bagi menilai ciri terlatih. Kelas pelbagai SVM telah dibina berdasarkan ciri-ciri terbaik dan dihasilkan daripada klasifikasi perduaan SVM, iaitu saiz tertinggi dan modulasi saiz tertinggi tidak linear, dan telah diuji dengan 15 rakaman. Kelas pelbagai SVM yang didapati melalui GUI ini adalah memberangsangkan iaitu 67% klasifikasi adalah tepat pada set data yang diuji.

A new parallel data geometry analysis algorithm to select training data for support vector machine

AIMS Mathematics ◽

10.3934/math.2021806 ◽

2021 ◽

Vol 6 (12) ◽

pp. 13931-13953

Author(s):

Yunfeng Shi ◽

◽

Shu Lv ◽

Kaibo Shi ◽

◽

...

Keyword(s):

Support Vector Machine ◽

Large Scale ◽

Computational Cost ◽

Training Data ◽

Support Vector ◽

Training Set ◽

Redundant Data ◽

Parallel Data ◽

Low Efficiency ◽

Geometry Analysis

<abstract><p>Support vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency and become impractical. Due to the sparsity of SVM in the sample space, this paper presents a new parallel data geometry analysis (PDGA) algorithm to reduce the training set of SVM, which helps to improve the efficiency of SVM training. The PDGA introduce Mahalanobis distance to measure the distance from each sample to its centroid. And based on this, proposes a method that can identify non support vectors and outliers at the same time to help remove redundant data. When the training set is further reduced, cosine angle distance analysis method is proposed to determine whether the samples are redundant data, ensure that the valuable data are not removed. Different from the previous data geometry analysis methods, the PDGA algorithm is implemented in parallel, which greatly saving the computational cost. Experimental results on artificial dataset and 6 real datasets show that the algorithm can adapt to different sample distributions. Which significantly reduce the training time and memory requirements without sacrificing the classification accuracy, and its performance is obviously better than the other five competitive algorithms.</p></abstract>

Neural machine translation of low-resource languages using SMT phrase pair injection

Natural Language Engineering ◽

10.1017/s1351324920000303 ◽

2020 ◽

pp. 1-22

Author(s):

Sukanta Sen ◽

Mohammed Hasanuzzaman ◽

Asif Ekbal ◽

Pushpak Bhattacharyya ◽

Andy Way

Keyword(s):

Machine Translation ◽

Large Scale ◽

Production Systems ◽

Statistical Machine Translation ◽

Training Data ◽

Original Training ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Better Than

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.

A Hierarchical Approach to Protein Fold Prediction

Journal of Integrative Bioinformatics ◽

10.1515/jib-2011-185 ◽

2011 ◽

Vol 8 (1) ◽

pp. 66-77 ◽

Cited By ~ 1

Author(s):

Tabrez Anwar Shamim Mohammad ◽

Hampapathalu Adimurthy Nagarajaram

Keyword(s):

Prediction Accuracy ◽

Large Scale ◽

Fold Recognition ◽

Classification Problem ◽

New Method ◽

Support Vector ◽

Protein Fold ◽

Hierarchical Approach ◽

Structural Class ◽

Fold Prediction

Summary Fold recognition, assigning novel proteins to known structures, forms an important component of the overall protein structure discovery process. The available methods for protein fold recognition are limited by the low fold-coverage and/or low prediction accuracies. We describe here a new Support Vector Machine (SVM)-based method for protein fold prediction with high prediction accuracy and high fold-coverage. The new method of fold prediction with high fold-coverage was developed by training and testing on a large number of folds in order to make the method suitable for large scale fold predictions. However, presence of large number of folds in the training set made the classification task difficult as a consequence of increased complexity involved in binary classifications of SVMs. In order to overcome this complexity we adopted a hierarchical approach where fold-prediction is made in two steps. At the first step structural class of the query is predicted and at the second step fold is predicted within the predicted structural class. This decreased the complexity of the classification problem and also improved the overall fold prediction accuracy. To the best of our knowledge this is the first taxonomic fold recognition method to cover over 700 protein-folds and gives prediction accuracy of around 70% on a benchmark dataset. Since the new method gives rise to state of the art prediction performance and hence can be very useful for structural characterization of proteins discovered in various genomes.

Klasifikasi Jenis Pantun Dengan Metode Support Vector Machines (SVM)

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i5.2313 ◽

2020 ◽

Vol 4 (5) ◽

pp. 915-922

Author(s):

Helena Nurramdhani Irmanda ◽

Ria Astriratma

Keyword(s):

Support Vector Machines ◽

Training Data ◽

Classification Model ◽

Support Vector ◽

Processing Stage ◽

Stop Word ◽

Vector Machines ◽

Testing Data ◽

Extraction Stage ◽

Multiclass Svm

This study aims to create a model for categorizing pantun types and analyze the accuracy of support vector machines (SVM). The first stage is collecting pantun that have been labeled with pantun category. The pantun categories consist of pantun for children, pantun for young people, and pantun for elder. After collecting data, the next stage is pre-processing. This pre-processing stage makes data ready to be processed on the extraction stage. The pre-processing stage consists of text segmentation, case folding, tokenization, stop word removal, and stemming. The feature extraction stage is intended to analyze potential information and represent terms as a vector. Separating training data and testing data is necessary to be conducted before the classification process. Then the classification process is done by using multiclass SVM. The results of the classification are evaluated to obtain accuracy and will be analyzed whether the classification model is proper to be used. The results showed that SVM classified the types of pantun with accuracy of 81,91%.

Support Vector Machine Classifier with WHM Offset for Unbalanced Data

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2008.p0094 ◽

2008 ◽

Vol 12 (1) ◽

pp. 94-101 ◽

Cited By ~ 3

Author(s):

Boyang Li ◽

◽

Jinglu Hu ◽

Kotaro Hirasawa

Keyword(s):

Support Vector Machine ◽

Real World ◽

Support Vector Machine Classifier ◽

Classification Problem ◽

Training Data ◽

Support Vector ◽

Svm Classifier ◽

Real World Data ◽

Unbalanced Classification ◽

Improved Support Vector Machine

We propose an improved support vector machine (SVM) classifier by introducing a new offset, for solving the real-world unbalanced classification problem. The new offset is calculated based on the unbalanced support vectors resulting from the unbalanced training data. We developed a weighted harmonic mean (WHM) algorithm to further reduce the effects of noise on offset calculation. We apply the proposed approach to classify real-world data. Results of simulation demonstrate the effectiveness of our proposed approach.

DOA esitmation based on support vector machine — Large scale multiclass classification problem

2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) ◽

10.1109/icspcc.2011.6061689 ◽

2011 ◽

Cited By ~ 1

Author(s):

Du Jin-xiang ◽

Feng Xi-an ◽

Ma Yan

Keyword(s):

Support Vector Machine ◽

Large Scale ◽

Classification Problem ◽

Multiclass Classification ◽

Support Vector

Tree Kernels for Semantic Role Labeling

Computational Linguistics ◽

10.1162/coli.2008.34.2.193 ◽

2008 ◽

Vol 34 (2) ◽

pp. 193-224 ◽

Cited By ~ 58

Author(s):

Alessandro Moschitti ◽

Daniele Pighin ◽

Roberto Basili

Keyword(s):

Language Processing ◽

Large Scale ◽

Kernel Functions ◽

Feature Representation ◽

Training Data ◽

Support Vector ◽

Feature Engineering ◽

Semantic Role ◽

Learning Approaches ◽

Semantic Role Labeling

The availability of large scale data sets of manually annotated predicate-argument structures has recently favored the use of machine learning approaches to the design of automated semantic role labeling (SRL) systems. The main research in this area relates to the design choices for feature representation and for effective decompositions of the task in different learning models. Regarding the former choice, structural properties of full syntactic parses are largely employed as they represent ways to encode different principles suggested by the linking theory between syntax and semantics. The latter choice relates to several learning schemes over global views of the parses. For example, re-ranking stages operating over alternative predicate-argument sequences of the same sentence have shown to be very effective. In this article, we propose several kernel functions to model parse tree properties in kernel-based machines, for example, perceptrons or support vector machines. In particular, we define different kinds of tree kernels as general approaches to feature engineering in SRL. Moreover, we extensively experiment with such kernels to investigate their contribution to individual stages of an SRL architecture both in isolation and in combination with other traditional manually coded features. The results for boundary recognition, classification, and re-ranking stages provide systematic evidence about the significant impact of tree kernels on the overall accuracy, especially when the amount of training data is small. As a conclusive result, tree kernels allow for a general and easily portable feature engineering method which is applicable to a large family of natural language processing tasks.