Parsing | ScienceGate

Parsing algorithms that process the input from left to right and construct a single derivation have often been considered inadequate for natural language parsing because of the massive ambiguity typically found in natural language grammars. Nevertheless, it has been shown that such algorithms, combined with treebank-induced classifiers, can be used to build highly accurate disambiguating parsers, in particular for dependency-based syntactic representations. In this article, we first present a general framework for describing and analyzing algorithms for deterministic incremental dependency parsing, formalized as transition systems. We then describe and analyze two families of such algorithms: stack-based and list-based algorithms. In the former family, which is restricted to projective dependency structures, we describe an arc-eager and an arc-standard variant; in the latter family, we present a projective and a non-projective variant. For each of the four algorithms, we give proofs of correctness and complexity. In addition, we perform an experimental evaluation of all algorithms in combination with SVM classifiers for predicting the next parsing action, using data from thirteen languages. We show that all four algorithms give competitive accuracy, although the non-projective list-based algorithm generally outperforms the projective algorithms for languages with a non-negligible proportion of non-projective constructions. However, the projective algorithms often produce comparable results when combined with the technique known as pseudo-projective parsing. The linear time complexity of the stack-based algorithms gives them an advantage with respect to efficiency both in learning and in parsing, but the projective list-based algorithm turns out to be equally efficient in practice. Moreover, when the projective algorithms are used to implement pseudo-projective parsing, they sometimes become less efficient in parsing (but not in learning) than the non-projective list-based algorithm. Although most of the algorithms have been partially described in the literature before, this is the first comprehensive analysis and evaluation of the algorithms within a unified framework.

Download Full-text

Splittability of Bilexical Context-Free Grammars is Undecidable

Computational Linguistics ◽

10.1162/coli_a_00079 ◽

2011 ◽

Vol 37 (4) ◽

pp. 867-879

Author(s):

Mark-Jan Nederhof ◽

Giorgio Satta

Keyword(s):

Dynamic Programming ◽

Natural Language ◽

Input String ◽

Running Time ◽

Natural Language Parsing ◽

Central Interest ◽

The Right ◽

Programming Algorithms ◽

Context Free ◽

Context Free Grammars

Bilexical context-free grammars (2-LCFGs) have proved to be accurate models for statistical natural language parsing. Existing dynamic programming algorithms used to parse sentences under these models have running time of O(∣w∣4), where w is the input string. A 2-LCFG is splittable if the left arguments of a lexical head are always independent of the right arguments, and vice versa. When a 2-LCFGs is splittable, parsing time can be asymptotically improved to O(∣w∣3). Testing this property is therefore of central interest to parsing efficiency. In this article, however, we show the negative result that splittability of 2-LCFGs is undecidable.

Download Full-text

Relation Extraction With Clause-Based Open Information Extraction

10.32920/17303840.v1 ◽

2021 ◽

Author(s):

Duc Thuan Vo

Keyword(s):

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Question Answering ◽

Relation Extraction ◽

Linguistic Knowledge ◽

Dependency Parsing ◽

Grammatical Structure ◽

Open Information Extraction ◽

Wide Range

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.<div>Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted relations can be erroneous at times and fail to have a meaningful interpretation. As such, we first propose refinements to the grammatical structure of syntactic and dependency parsing with clause structures and clause types in an effort to generate propositions that can be deemed as meaningful extractable relations. Second, considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. Third, we employ matrix factorization and collaborative filtering for relation extraction. To avoid the need for manually predefined schemas, we employ the notion of universal schemas that is formed as a collection of patterns derived from Open Information Extraction tools as well as from relation schemas of pre-existing datasets. While previous systems have trained relations only for entities, we exploit advanced features from relation characteristics such as clause types and semantic topics for predicting new relation instances. Finally, we present an event network representation for temporal and causal event relation extraction that benefits from existing Open IE systems to generate a set of triple relations that are then used to build an event network. The event network is bootstrapped by labeling the temporal and causal disposition of events that are directly linked to each other. The event network can be systematically traversed to identify temporal and causal relations between indirectly connected events. <br></div>

Download Full-text

Head-Driven Statistical Models for Natural Language Parsing

Computational Linguistics ◽

10.1162/089120103322753356 ◽

2003 ◽

Vol 29 (4) ◽

pp. 589-637 ◽

Cited By ~ 151

Author(s):

Michael Collins

Keyword(s):

Natural Language ◽

Statistical Models ◽

Wall Street Journal ◽

Parse Tree ◽

Top Down ◽

Wall Street ◽

Natural Language Parsing ◽

The Difference ◽

Context Free ◽

Probabilistic Context

This article describes three statistical models for natural language parsing. The models extend methods from probabilistic context-free grammars to lexicalized grammars, leading to approaches in which a parse tree is represented as the sequence of decisions corresponding to a head-centered, top-down derivation of the tree. Independence assumptions then lead to parameters that encode the X-bar schema, subcategorization, ordering of complements, placement of adjuncts, bigram lexical dependencies, wh-movement, and preferences for close attachment. All of these preferences are expressed by probabilities conditioned on lexical heads. The models are evaluated on the Penn Wall Street Journal Treebank, showing that their accuracy is competitive with other models in the literature. To gain a better understanding of the models, we also give results on different constituent types, as well as a breakdown of precision/recall results in recovering various types of dependencies. We analyze various characteristics of the models through experiments on parsing accuracy, by collecting frequencies of various structures in the treebank, and through linguistically motivated examples. Finally, we compare the models to others that have been applied to parsing the treebank, aiming to give some explanation of the difference in performance of the various models.

Download Full-text

Relation Extraction With Clause-Based Open Information Extraction

10.32920/17303840 ◽

2021 ◽

Author(s):

Duc Thuan Vo

Keyword(s):

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Question Answering ◽

Relation Extraction ◽

Linguistic Knowledge ◽

Dependency Parsing ◽

Grammatical Structure ◽

Open Information Extraction ◽

Wide Range

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.<div>Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted relations can be erroneous at times and fail to have a meaningful interpretation. As such, we first propose refinements to the grammatical structure of syntactic and dependency parsing with clause structures and clause types in an effort to generate propositions that can be deemed as meaningful extractable relations. Second, considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. Third, we employ matrix factorization and collaborative filtering for relation extraction. To avoid the need for manually predefined schemas, we employ the notion of universal schemas that is formed as a collection of patterns derived from Open Information Extraction tools as well as from relation schemas of pre-existing datasets. While previous systems have trained relations only for entities, we exploit advanced features from relation characteristics such as clause types and semantic topics for predicting new relation instances. Finally, we present an event network representation for temporal and causal event relation extraction that benefits from existing Open IE systems to generate a set of triple relations that are then used to build an event network. The event network is bootstrapped by labeling the temporal and causal disposition of events that are directly linked to each other. The event network can be systematically traversed to identify temporal and causal relations between indirectly connected events. <br></div>

Download Full-text

PRINCIPAL PROBLEMS OF NATURAL LANGUAGE PROCESSING SYSTEMS

Studia Philologica ◽

10.28925/2311-2425.2018.11.5 ◽

2018 ◽

pp. 35-38

Author(s):

O. Hyryn

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Syntactic Analysis ◽

Syntactic Ambiguity ◽

Grammatical Structure ◽

English Sentence ◽

Analysis Methods ◽

The Way

The article deals with natural language processing, namely that of an English sentence. The article describes the problems, which might arise during the process and which are connected with graphic, semantic, and syntactic ambiguity. The article provides the description of how the problems had been solved before the automatic syntactic analysis was applied and the way, such analysis methods could be helpful in developing new analysis algorithms. The analysis focuses on the issues, blocking the basis for the natural language processing — parsing — the process of sentence analysis according to their structure, content and meaning, which aims to analyze the grammatical structure of the sentence, the division of sentences into constituent components and defining links between them.

Download Full-text

Multiobjective Genetic Programming for Natural Language Parsing and Tagging

Parallel Problem Solving from Nature - PPSN IX - Lecture Notes in Computer Science ◽

10.1007/11844297_44 ◽

2006 ◽

pp. 433-442 ◽

Cited By ~ 3

Author(s):

L. Araujo

Keyword(s):

Genetic Programming ◽

Natural Language ◽

Natural Language Parsing

Download Full-text

Linguistic, Philosophical, and Pragmatic Aspects of Type-Directed Natural Language Parsing

Logical Aspects of Computational Linguistics - Lecture Notes in Computer Science ◽

10.1007/3-540-48975-4_4 ◽

1999 ◽

pp. 70-91 ◽

Cited By ~ 1

Author(s):

Sebastian Shaumyan ◽

Paul Hudak

Keyword(s):

Natural Language ◽

Natural Language Parsing ◽

Pragmatic Aspects

Download Full-text

Finding Optimal 1-Endpoint-Crossing Trees

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00206 ◽

2013 ◽

Vol 1 ◽

pp. 13-24 ◽

Cited By ~ 9

Author(s):

Emily Pitler ◽

Sampath Kannan ◽

Mitchell Marcus

Keyword(s):

Natural Language ◽

Common Vertex ◽

Dependency Parsing ◽

Dependency Tree ◽

Parsing Algorithm

Dependency parsing algorithms capable of producing the types of crossing dependencies seen in natural language sentences have traditionally been orders of magnitude slower than algorithms for projective trees. For 95.8–99.8% of dependency parses in various natural language treebanks, whenever an edge is crossed, the edges that cross it all have a common vertex. The optimal dependency tree that satisfies this 1-Endpoint-Crossing property can be found with an O( n4) parsing algorithm that recursively combines forests over intervals with one exterior point. 1-Endpoint-Crossing trees also have natural connections to linguistics and another class of graphs that has been studied in NLP.

Download Full-text

Identificação de Pragas e Doenças na Cultura da Soja por meio de um Sistema Computacional em Linguagem Natural

10.14210/cotb.v12.p324-331 ◽

2021 ◽

Author(s):

Carolinne Roque e Faria ◽

Cinthyan Renata Sachs Camerlengo de Barb

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Computer System ◽

Language Processing ◽

Agricultural Area ◽

Syntactic Analysis ◽

Dependency Parsing ◽

Named Entities ◽

Pests And Diseases ◽

Improve Production

Technology is becoming expressively popular among agribusiness producers and is progressing in all agricultural area. One of the difficulties in this context is to handle data in natural language to solve problems in the field of agriculture. In order to build up dialogs and provide rich researchers, the present work uses Natural Language Processing (NLP) techniques to develop an automatic and effective computer system to interact with the user and assist in the identification of pests and diseases in the soybean farming, stored in a database repository to provide accurate diagnoses to simplify the work of the agricultural professional and also for those who deal with a lot of information in this area. Information on 108 pests and 19 diseases that damage Brazilian soybean was collected from Brazilian bibliographic manuals with the purpose to optimize the data and improve production, using the spaCy library for syntactic analysis of NLP, which allowed the pre-process the texts, recognize the named entities, calculate the similarity between the words, verify dependency parsing and also provided the support for the development requirements of the CAROLINA tool (Robotized Agronomic Conversation in Natural Language) using the language belonging to the agricultural area.

Download Full-text