Machine Learning

Author(s):  
Raymond J. Mooney

This chapter introduces symbolic machine learning in which decision trees, rules, or case-based classifiers are induced from supervised training examples. It describes the representation of knowledge assumed by each of these approaches and reviews basic algorithms for inducing such representations from annotated training examples and using the acquired knowledge to classify future instances. It also briefly reviews unsupervised learning, in which new concepts are formed from unannotated examples by clustering them into coherent groups. These techniques can be applied to learn knowledge required for a variety of problems in computational linguistics ranging from part-of-speech tagging and syntactic parsing to word sense disambiguation and anaphora resolution. Applications to a variety of these problems are reviewed.

Author(s):  
Raymond J. Mooney

This article introduces the type of symbolic machine learning in which decision trees, rules, or case-based classifiers are induced from supervised training examples. It describes the representation of knowledge assumed by each of these approaches and reviews basic algorithms for inducing such representations from annotated training examples and using the acquired knowledge to classify future instances. Machine learning is the study of computational systems that improve performance on some task with experience. Most machine learning methods concern the task of categorizing examples described by a set of features. These techniques can be applied to learn knowledge required for a variety of problems in computational linguistics ranging from part-of-speech tagging and syntactic parsing to word-sense disambiguation and anaphora resolution. Finally, this article reviews the applications to a variety of these problems, such as morphology, part-of-speech tagging, word-sense disambiguation, syntactic parsing, semantic parsing, information extraction, and anaphora resolution.


Author(s):  
Mark Stevenson ◽  
Yorick Wilks

Word-sense disambiguation (WSD) is the process of identifying the meanings of words in context. This article begins with discussing the origins of the problem in the earliest machine translation systems. Early attempts to solve the WSD problem suffered from a lack of coverage. The main approaches to tackle the problem were dictionary-based, connectionist, and statistical strategies. This article concludes with a review of evaluation strategies for WSD and possible applications of the technology. WSD is an ‘intermediate’ task in language processing: like part-of-speech tagging or syntactic analysis, it is unlikely that anyone other than linguists would be interested in its results for their own sake. ‘Final’ tasks produce results of use to those without a specific interest in language and often make use of ‘intermediate’ tasks. WSD is a long-standing and important problem in the field of language processing.


2021 ◽  
Vol 8 (5) ◽  
pp. 1039
Author(s):  
Ilham Firmansyah ◽  
Putra Pandu Adikara ◽  
Sigit Adinugroho

<p class="Abstrak">Bahasa manusia adalah bahasa yang digunakan oleh manusia dalam bentuk tulisan maupun suara. Banyak teknologi/aplikasi yang mengolah bahasa manusia, bidang tersebut bernama <em>Natural Language Processing </em>yang merupakan ilmu yang mempelajari untuk mengolah dan mengekstraksi bahasa manusia pada perkembangan teknologi. Salah satu proses pada <em>Natural Language Processing </em>adalah <em>Part-Of-Speech Tagging</em>. <em>Part-Of-Speech Tagging </em>adalah klasifikasi kelas kata pada sebuah kalimat secara otomatis oleh teknologi, proses ini salah satunya berfungsi untuk mengetahui kata-kata yang memiliki lebih dari satu makna/arti (ambiguitas). <em>Part-Of-Speech Tagging</em> merupakan dasar dari <em>Natural Language Processing</em> lainnya, seperti penerjemahan mesin (<em>machine translation</em>), penghilangan ambiguitas makna kata (<em>word sense disambiguation</em>), dan analisis sentimen. <em>Part-Of-Speech Tagging</em> dilakukan pada bahasa manusia, salah satunya adalah bahasa Madura. Bahasa Madura adalah bahasa daerah yang digunakan oleh suku Madura dan memiliki morfologi yang mirip dengan bahasa Indonesia. Penelitian pada <em>Part-Of-Speech Tagging </em>pada bahasa Madura ini menggunakan algoritme Viterbi, terdapat 3 proses untuk implementasi algoritme Viterbi pada pada <em>Part-Of-Speech Tagging</em> bahasa Madura, yaitu <em>pre-processing </em>pada data<em> training </em>dan <em>testing</em>, perhitungan data latih dengan <em>Hidden Markov Model </em>dan klasifikasi kelas kata menggunakan algoritme Viterbi. Kelas kata (<em>tagset</em>) yang digunakan untuk klasifikasi kata pada bahasa Madura sebanyak 19 kelas, kelas kata tersebut dirancang oleh pakar. Pengujian sistem pada penelitian ini menggunakan perhitungan <em>Multiclass Confusion Matrix</em>. Hasil pengujian sistem mendapatkan nilai <em>micro average</em> <em>accuracy </em>sebesar 0,96 dan nilai <em>micro average</em> <em>precision </em>dan <em>recall </em>yang sama sebesar 0,68. <em>Precision</em> dan <em>recall</em> masih dapat ditingkatkan dengan menambahkan data yang lebih banyak lagi untuk pelatihan.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>Natural language is a form of language used by human, either in writing or speaking form. There is a specific field in computer science that processes natural language, which is called Natural Language Processing. It is a study of how to process and extract natural language on technology development. Part-Of-Speech Tagging is a method to assign a predefined set of tags (word classes) into a word or a phrase. This process is useful to understand the true meaning of a word with ambiguous meaning, which may have different meanings depending on the context. Part-Of-Speech Tagging is the basis of the other Natural Language Processing methods, such as machine translation, word sense disambiguation, and sentiment analysis. Part-Of-Speech Tagging used in natural languages, such as Madurese language. Madurese language is a local language used by Madurese and has a similar morphology as Indonesian language. Part-Of-Speech Tagging research on Madurese language using Viterbi algorithm, consists of 3 processes, which are training and testing corpus pre-processing, training the corpus by Hidden Markov Model, and tag classification using Viterbi algorithm. The number of tags used for words classification (tagsets) on Madurese language are 19 class, those tags were designed by an expert. Performance assessment was conducted using Multiclass Confusion Matrix calculation. The system achieved a micro average accuracy score of 0,96, and micro average precision score is equal to recall of 0,68. Precision and recall can still be improved by adding more data for training.</em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


Terminology ◽  
2001 ◽  
Vol 7 (1) ◽  
pp. 31-48 ◽  
Author(s):  
Jorge Vivaldi ◽  
Horacio Rodríguez

Two different reasons suggest that combining the performance of several term extractors could lead to an improvement in overall system accuracy. On the one hand, there is no clear agreement on whether to follow statistical, linguistic or hybrid approaches for (semi-) automatic term extraction. On the other hand, combining different knowledge sources (e.g. classifiers) has proved successful in improving the performance of individual sources on several NLP tasks (some of them closely related to or involved in term extraction), such as context-sensitive spelling correction, part-of-speech tagging, word sense disambiguation, parsing, text classification and filtering, etc. In this paper, we present a proposal for combining a number of different term extraction techniques in order to improve the accuracy of the resulting system. The approach has been applied to the domain of medicine for the Spanish language. A number of tests have been carried out with encouraging results.


Author(s):  
Marina Sokolova ◽  
Stan Szpakowicz

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.


2002 ◽  
Vol 8 (4) ◽  
pp. 293-310 ◽  
Author(s):  
DAVID YAROWSKY ◽  
RADU FLORIAN

This paper presents a comprehensive empirical exploration and evaluation of a diverse range of data characteristics which influence word sense disambiguation performance. It focuses on a set of six core supervised algorithms, including three variants of Bayesian classifiers, a cosine model, non-hierarchical decision lists, and an extension of the transformation-based learning model. Performance is investigated in detail with respect to the following parameters: (a) target language (English, Spanish, Swedish and Basque); (b) part of speech; (c) sense granularity; (d) inclusion and exclusion of major feature classes; (e) variable context width (further broken down by part-of-speech of keyword); (f) number of training examples; (g) baseline probability of the most likely sense; (h) sense distributional entropy; (i) number of senses per keyword; (j) divergence between training and test data; (k) degree of (artificially introduced) noise in the training data; (l) the effectiveness of an algorithm's confidence rankings; and (m) a full keyword breakdown of the performance of each algorithm. The paper concludes with a brief analysis of similarities, differences, strengths and weaknesses of the algorithms and a hierarchical clustering of these algorithms based on agreement of sense classification behavior. Collectively, the paper constitutes the most comprehensive survey of evaluation measures and tests yet applied to sense disambiguation algorithms. And it does so over a diverse range of supervised algorithms, languages and parameter spaces in single unified experimental framework.


Author(s):  
GEORGIOS PETASIS ◽  
GEORGIOS PALIOURAS ◽  
VANGELIS KARKALETSIS ◽  
CONSTANTINE D. SPYROPOULOS ◽  
ION ANDROUTSOPOULOS

2011 ◽  
Vol 18 (4) ◽  
pp. 521-548 ◽  
Author(s):  
SANDRA KÜBLER ◽  
EMAD MOHAMED

AbstractThis paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573–80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.


2016 ◽  
Vol 42 (2) ◽  
pp. 245-275 ◽  
Author(s):  
Diana McCarthy ◽  
Marianna Apidianaki ◽  
Katrin Erk

Word sense disambiguation and the related field of automated word sense induction traditionally assume that the occurrences of a lemma can be partitioned into senses. But this seems to be a much easier task for some lemmas than others. Our work builds on recent work that proposes describing word meaning in a graded fashion rather than through a strict partition into senses; in this article we argue that not all lemmas may need the more complex graded analysis, depending on their partitionability. Although there is plenty of evidence from previous studies and from the linguistics literature that there is a spectrum of partitionability of word meanings, this is the first attempt to measure the phenomenon and to couple the machine learning literature on clusterability with word usage data used in computational linguistics. We propose to operationalize partitionability as clusterability, a measure of how easy the occurrences of a lemma are to cluster. We test two ways of measuring clusterability: (1) existing measures from the machine learning literature that aim to measure the goodness of optimal k-means clusterings, and (2) the idea that if a lemma is more clusterable, two clusterings based on two different “views” of the same data points will be more congruent. The two views that we use are two different sets of manually constructed lexical substitutes for the target lemma, on the one hand monolingual paraphrases, and on the other hand translations. We apply automatic clustering to the manual annotations. We use manual annotations because we want the representations of the instances that we cluster to be as informative and “clean” as possible. We show that when we control for polysemy, our measures of clusterability tend to correlate with partitionability, in particular some of the type-(1) clusterability measures, and that these measures outperform a baseline that relies on the amount of overlap in a soft clustering.


2020 ◽  
Vol 34 (05) ◽  
pp. 7724-7731
Author(s):  
Daniel Fernández-González ◽  
Carlos Gómez-Rodríguez

One of the most complex syntactic representations used in computational linguistics and NLP are discontinuous constituent trees, crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efficiently parsing syntactic relations between words in a sentence. This kind of sequence-to-sequence models achieve outstanding accuracies in building non-projective dependency trees, but its potential has not been proved yet on a more difficult task. We propose a novel neural network architecture that, by means of Pointer Networks, is able to generate the most accurate discontinuous constituent representations to date, even without the need of Part-of-Speech tagging information. To do so, we internally model discontinuous constituent structures as augmented non-projective dependency structures. The proposed approach achieves state-of-the-art results on the two widely-used NEGRA and TIGER benchmarks, outperforming previous work by a wide margin.


Sign in / Sign up

Export Citation Format

Share Document