Machine Learning

Author(s):  
Raymond J. Mooney

This article introduces the type of symbolic machine learning in which decision trees, rules, or case-based classifiers are induced from supervised training examples. It describes the representation of knowledge assumed by each of these approaches and reviews basic algorithms for inducing such representations from annotated training examples and using the acquired knowledge to classify future instances. Machine learning is the study of computational systems that improve performance on some task with experience. Most machine learning methods concern the task of categorizing examples described by a set of features. These techniques can be applied to learn knowledge required for a variety of problems in computational linguistics ranging from part-of-speech tagging and syntactic parsing to word-sense disambiguation and anaphora resolution. Finally, this article reviews the applications to a variety of these problems, such as morphology, part-of-speech tagging, word-sense disambiguation, syntactic parsing, semantic parsing, information extraction, and anaphora resolution.

Author(s):  
Raymond J. Mooney

This chapter introduces symbolic machine learning in which decision trees, rules, or case-based classifiers are induced from supervised training examples. It describes the representation of knowledge assumed by each of these approaches and reviews basic algorithms for inducing such representations from annotated training examples and using the acquired knowledge to classify future instances. It also briefly reviews unsupervised learning, in which new concepts are formed from unannotated examples by clustering them into coherent groups. These techniques can be applied to learn knowledge required for a variety of problems in computational linguistics ranging from part-of-speech tagging and syntactic parsing to word sense disambiguation and anaphora resolution. Applications to a variety of these problems are reviewed.


Author(s):  
Mark Stevenson ◽  
Yorick Wilks

Word-sense disambiguation (WSD) is the process of identifying the meanings of words in context. This article begins with discussing the origins of the problem in the earliest machine translation systems. Early attempts to solve the WSD problem suffered from a lack of coverage. The main approaches to tackle the problem were dictionary-based, connectionist, and statistical strategies. This article concludes with a review of evaluation strategies for WSD and possible applications of the technology. WSD is an ‘intermediate’ task in language processing: like part-of-speech tagging or syntactic analysis, it is unlikely that anyone other than linguists would be interested in its results for their own sake. ‘Final’ tasks produce results of use to those without a specific interest in language and often make use of ‘intermediate’ tasks. WSD is a long-standing and important problem in the field of language processing.


2002 ◽  
Vol 8 (4) ◽  
pp. 293-310 ◽  
Author(s):  
DAVID YAROWSKY ◽  
RADU FLORIAN

This paper presents a comprehensive empirical exploration and evaluation of a diverse range of data characteristics which influence word sense disambiguation performance. It focuses on a set of six core supervised algorithms, including three variants of Bayesian classifiers, a cosine model, non-hierarchical decision lists, and an extension of the transformation-based learning model. Performance is investigated in detail with respect to the following parameters: (a) target language (English, Spanish, Swedish and Basque); (b) part of speech; (c) sense granularity; (d) inclusion and exclusion of major feature classes; (e) variable context width (further broken down by part-of-speech of keyword); (f) number of training examples; (g) baseline probability of the most likely sense; (h) sense distributional entropy; (i) number of senses per keyword; (j) divergence between training and test data; (k) degree of (artificially introduced) noise in the training data; (l) the effectiveness of an algorithm's confidence rankings; and (m) a full keyword breakdown of the performance of each algorithm. The paper concludes with a brief analysis of similarities, differences, strengths and weaknesses of the algorithms and a hierarchical clustering of these algorithms based on agreement of sense classification behavior. Collectively, the paper constitutes the most comprehensive survey of evaluation measures and tests yet applied to sense disambiguation algorithms. And it does so over a diverse range of supervised algorithms, languages and parameter spaces in single unified experimental framework.


2021 ◽  
Vol 8 (5) ◽  
pp. 1039
Author(s):  
Ilham Firmansyah ◽  
Putra Pandu Adikara ◽  
Sigit Adinugroho

<p class="Abstrak">Bahasa manusia adalah bahasa yang digunakan oleh manusia dalam bentuk tulisan maupun suara. Banyak teknologi/aplikasi yang mengolah bahasa manusia, bidang tersebut bernama <em>Natural Language Processing </em>yang merupakan ilmu yang mempelajari untuk mengolah dan mengekstraksi bahasa manusia pada perkembangan teknologi. Salah satu proses pada <em>Natural Language Processing </em>adalah <em>Part-Of-Speech Tagging</em>. <em>Part-Of-Speech Tagging </em>adalah klasifikasi kelas kata pada sebuah kalimat secara otomatis oleh teknologi, proses ini salah satunya berfungsi untuk mengetahui kata-kata yang memiliki lebih dari satu makna/arti (ambiguitas). <em>Part-Of-Speech Tagging</em> merupakan dasar dari <em>Natural Language Processing</em> lainnya, seperti penerjemahan mesin (<em>machine translation</em>), penghilangan ambiguitas makna kata (<em>word sense disambiguation</em>), dan analisis sentimen. <em>Part-Of-Speech Tagging</em> dilakukan pada bahasa manusia, salah satunya adalah bahasa Madura. Bahasa Madura adalah bahasa daerah yang digunakan oleh suku Madura dan memiliki morfologi yang mirip dengan bahasa Indonesia. Penelitian pada <em>Part-Of-Speech Tagging </em>pada bahasa Madura ini menggunakan algoritme Viterbi, terdapat 3 proses untuk implementasi algoritme Viterbi pada pada <em>Part-Of-Speech Tagging</em> bahasa Madura, yaitu <em>pre-processing </em>pada data<em> training </em>dan <em>testing</em>, perhitungan data latih dengan <em>Hidden Markov Model </em>dan klasifikasi kelas kata menggunakan algoritme Viterbi. Kelas kata (<em>tagset</em>) yang digunakan untuk klasifikasi kata pada bahasa Madura sebanyak 19 kelas, kelas kata tersebut dirancang oleh pakar. Pengujian sistem pada penelitian ini menggunakan perhitungan <em>Multiclass Confusion Matrix</em>. Hasil pengujian sistem mendapatkan nilai <em>micro average</em> <em>accuracy </em>sebesar 0,96 dan nilai <em>micro average</em> <em>precision </em>dan <em>recall </em>yang sama sebesar 0,68. <em>Precision</em> dan <em>recall</em> masih dapat ditingkatkan dengan menambahkan data yang lebih banyak lagi untuk pelatihan.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>Natural language is a form of language used by human, either in writing or speaking form. There is a specific field in computer science that processes natural language, which is called Natural Language Processing. It is a study of how to process and extract natural language on technology development. Part-Of-Speech Tagging is a method to assign a predefined set of tags (word classes) into a word or a phrase. This process is useful to understand the true meaning of a word with ambiguous meaning, which may have different meanings depending on the context. Part-Of-Speech Tagging is the basis of the other Natural Language Processing methods, such as machine translation, word sense disambiguation, and sentiment analysis. Part-Of-Speech Tagging used in natural languages, such as Madurese language. Madurese language is a local language used by Madurese and has a similar morphology as Indonesian language. Part-Of-Speech Tagging research on Madurese language using Viterbi algorithm, consists of 3 processes, which are training and testing corpus pre-processing, training the corpus by Hidden Markov Model, and tag classification using Viterbi algorithm. The number of tags used for words classification (tagsets) on Madurese language are 19 class, those tags were designed by an expert. Performance assessment was conducted using Multiclass Confusion Matrix calculation. The system achieved a micro average accuracy score of 0,96, and micro average precision score is equal to recall of 0,68. Precision and recall can still be improved by adding more data for training.</em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


Terminology ◽  
2001 ◽  
Vol 7 (1) ◽  
pp. 31-48 ◽  
Author(s):  
Jorge Vivaldi ◽  
Horacio Rodríguez

Two different reasons suggest that combining the performance of several term extractors could lead to an improvement in overall system accuracy. On the one hand, there is no clear agreement on whether to follow statistical, linguistic or hybrid approaches for (semi-) automatic term extraction. On the other hand, combining different knowledge sources (e.g. classifiers) has proved successful in improving the performance of individual sources on several NLP tasks (some of them closely related to or involved in term extraction), such as context-sensitive spelling correction, part-of-speech tagging, word sense disambiguation, parsing, text classification and filtering, etc. In this paper, we present a proposal for combining a number of different term extraction techniques in order to improve the accuracy of the resulting system. The approach has been applied to the domain of medicine for the Spanish language. A number of tests have been carried out with encouraging results.


2007 ◽  
Vol 29 ◽  
pp. 79-103 ◽  
Author(s):  
C. Orasan ◽  
R. J. Evans

In anaphora resolution for English, animacy identification can play an integral role in the application of agreement restrictions between pronouns and candidates, and as a result, can improve the accuracy of anaphora resolution systems. In this paper, two methods for animacy identification are proposed and evaluated using intrinsic and extrinsic measures. The first method is a rule-based one which uses information about the unique beginners in WordNet to classify NPs on the basis of their animacy. The second method relies on a machine learning algorithm which exploits a WordNet enriched with animacy information for each sense. The effect of word sense disambiguation on the two methods is also assessed. The intrinsic evaluation reveals that the machine learning method reaches human levels of performance. The extrinsic evaluation demonstrates that animacy identification can be beneficial in anaphora resolution, especially in the cases where animate entities are identified with high precision.


Author(s):  
Marina Sokolova ◽  
Stan Szpakowicz

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.


2017 ◽  
Vol 21 (1) ◽  
pp. 515-522 ◽  
Author(s):  
Muhammad Abid ◽  
Asad Habib ◽  
Jawad Ashraf ◽  
Abdul Shahid

Author(s):  
GEORGIOS PETASIS ◽  
GEORGIOS PALIOURAS ◽  
VANGELIS KARKALETSIS ◽  
CONSTANTINE D. SPYROPOULOS ◽  
ION ANDROUTSOPOULOS

Sign in / Sign up

Export Citation Format

Share Document