Parsing

Author(s):  
John Carroll

This chapter introduces key concepts and techniques for natural-language parsing: that is, finding the grammatical structure of sentences. The chapter introduces the fundamental algorithms for parsing with context-free (CF) phrase structure grammars, how these deal with ambiguous grammars, and how CF grammars and associated disambiguation models can be derived from syntactically annotated text. It goes on to consider dependency analysis, and outlines the main approaches to dependency parsing based both on manually written grammars and on learning from text annotated with dependency structures. It finishes with an overview of techniques used for parsing with grammars that use feature structures to encode linguistic information.

2008 ◽  
Vol 34 (4) ◽  
pp. 513-553 ◽  
Author(s):  
Joakim Nivre

Parsing algorithms that process the input from left to right and construct a single derivation have often been considered inadequate for natural language parsing because of the massive ambiguity typically found in natural language grammars. Nevertheless, it has been shown that such algorithms, combined with treebank-induced classifiers, can be used to build highly accurate disambiguating parsers, in particular for dependency-based syntactic representations. In this article, we first present a general framework for describing and analyzing algorithms for deterministic incremental dependency parsing, formalized as transition systems. We then describe and analyze two families of such algorithms: stack-based and list-based algorithms. In the former family, which is restricted to projective dependency structures, we describe an arc-eager and an arc-standard variant; in the latter family, we present a projective and a non-projective variant. For each of the four algorithms, we give proofs of correctness and complexity. In addition, we perform an experimental evaluation of all algorithms in combination with SVM classifiers for predicting the next parsing action, using data from thirteen languages. We show that all four algorithms give competitive accuracy, although the non-projective list-based algorithm generally outperforms the projective algorithms for languages with a non-negligible proportion of non-projective constructions. However, the projective algorithms often produce comparable results when combined with the technique known as pseudo-projective parsing. The linear time complexity of the stack-based algorithms gives them an advantage with respect to efficiency both in learning and in parsing, but the projective list-based algorithm turns out to be equally efficient in practice. Moreover, when the projective algorithms are used to implement pseudo-projective parsing, they sometimes become less efficient in parsing (but not in learning) than the non-projective list-based algorithm. Although most of the algorithms have been partially described in the literature before, this is the first comprehensive analysis and evaluation of the algorithms within a unified framework.


2011 ◽  
Vol 37 (4) ◽  
pp. 867-879
Author(s):  
Mark-Jan Nederhof ◽  
Giorgio Satta

Bilexical context-free grammars (2-LCFGs) have proved to be accurate models for statistical natural language parsing. Existing dynamic programming algorithms used to parse sentences under these models have running time of O(∣w∣4), where w is the input string. A 2-LCFG is splittable if the left arguments of a lexical head are always independent of the right arguments, and vice versa. When a 2-LCFGs is splittable, parsing time can be asymptotically improved to O(∣w∣3). Testing this property is therefore of central interest to parsing efficiency. In this article, however, we show the negative result that splittability of 2-LCFGs is undecidable.


2021 ◽  
Author(s):  
Duc Thuan Vo

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.<div>Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted relations can be erroneous at times and fail to have a meaningful interpretation. As such, we first propose refinements to the grammatical structure of syntactic and dependency parsing with clause structures and clause types in an effort to generate propositions that can be deemed as meaningful extractable relations. Second, considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. Third, we employ matrix factorization and collaborative filtering for relation extraction. To avoid the need for manually predefined schemas, we employ the notion of universal schemas that is formed as a collection of patterns derived from Open Information Extraction tools as well as from relation schemas of pre-existing datasets. While previous systems have trained relations only for entities, we exploit advanced features from relation characteristics such as clause types and semantic topics for predicting new relation instances. Finally, we present an event network representation for temporal and causal event relation extraction that benefits from existing Open IE systems to generate a set of triple relations that are then used to build an event network. The event network is bootstrapped by labeling the temporal and causal disposition of events that are directly linked to each other. The event network can be systematically traversed to identify temporal and causal relations between indirectly connected events. <br></div>


2003 ◽  
Vol 29 (4) ◽  
pp. 589-637 ◽  
Author(s):  
Michael Collins

This article describes three statistical models for natural language parsing. The models extend methods from probabilistic context-free grammars to lexicalized grammars, leading to approaches in which a parse tree is represented as the sequence of decisions corresponding to a head-centered, top-down derivation of the tree. Independence assumptions then lead to parameters that encode the X-bar schema, subcategorization, ordering of complements, placement of adjuncts, bigram lexical dependencies, wh-movement, and preferences for close attachment. All of these preferences are expressed by probabilities conditioned on lexical heads. The models are evaluated on the Penn Wall Street Journal Treebank, showing that their accuracy is competitive with other models in the literature. To gain a better understanding of the models, we also give results on different constituent types, as well as a breakdown of precision/recall results in recovering various types of dependencies. We analyze various characteristics of the models through experiments on parsing accuracy, by collecting frequencies of various structures in the treebank, and through linguistically motivated examples. Finally, we compare the models to others that have been applied to parsing the treebank, aiming to give some explanation of the difference in performance of the various models.


2021 ◽  
Author(s):  
Duc Thuan Vo

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.<div>Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted relations can be erroneous at times and fail to have a meaningful interpretation. As such, we first propose refinements to the grammatical structure of syntactic and dependency parsing with clause structures and clause types in an effort to generate propositions that can be deemed as meaningful extractable relations. Second, considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. Third, we employ matrix factorization and collaborative filtering for relation extraction. To avoid the need for manually predefined schemas, we employ the notion of universal schemas that is formed as a collection of patterns derived from Open Information Extraction tools as well as from relation schemas of pre-existing datasets. While previous systems have trained relations only for entities, we exploit advanced features from relation characteristics such as clause types and semantic topics for predicting new relation instances. Finally, we present an event network representation for temporal and causal event relation extraction that benefits from existing Open IE systems to generate a set of triple relations that are then used to build an event network. The event network is bootstrapped by labeling the temporal and causal disposition of events that are directly linked to each other. The event network can be systematically traversed to identify temporal and causal relations between indirectly connected events. <br></div>


2018 ◽  
pp. 35-38
Author(s):  
O. Hyryn

The article deals with natural language processing, namely that of an English sentence. The article describes the problems, which might arise during the process and which are connected with graphic, semantic, and syntactic ambiguity. The article provides the description of how the problems had been solved before the automatic syntactic analysis was applied and the way, such analysis methods could be helpful in developing new analysis algorithms. The analysis focuses on the issues, blocking the basis for the natural language processing — parsing — the process of sentence analysis according to their structure, content and meaning, which aims to analyze the grammatical structure of the sentence, the division of sentences into constituent components and defining links between them.


Author(s):  
Emily Pitler ◽  
Sampath Kannan ◽  
Mitchell Marcus

Dependency parsing algorithms capable of producing the types of crossing dependencies seen in natural language sentences have traditionally been orders of magnitude slower than algorithms for projective trees. For 95.8–99.8% of dependency parses in various natural language treebanks, whenever an edge is crossed, the edges that cross it all have a common vertex. The optimal dependency tree that satisfies this 1-Endpoint-Crossing property can be found with an O( n4) parsing algorithm that recursively combines forests over intervals with one exterior point. 1-Endpoint-Crossing trees also have natural connections to linguistics and another class of graphs that has been studied in NLP.


2021 ◽  
Author(s):  
Carolinne Roque e Faria ◽  
Cinthyan Renata Sachs Camerlengo de Barb

Technology is becoming expressively popular among agribusiness producers and is progressing in all agricultural area. One of the difficulties in this context is to handle data in natural language to solve problems in the field of agriculture. In order to build up dialogs and provide rich researchers, the present work uses Natural Language Processing (NLP) techniques to develop an automatic and effective computer system to interact with the user and assist in the identification of pests and diseases in the soybean farming, stored in a database repository to provide accurate diagnoses to simplify the work of the agricultural professional and also for those who deal with a lot of information in this area. Information on 108 pests and 19 diseases that damage Brazilian soybean was collected from Brazilian bibliographic manuals with the purpose to optimize the data and improve production, using the spaCy library for syntactic analysis of NLP, which allowed the pre-process the texts, recognize the named entities, calculate the similarity between the words, verify dependency parsing and also provided the support for the development requirements of the CAROLINA tool (Robotized Agronomic Conversation in Natural Language) using the language belonging to the agricultural area.


Sign in / Sign up

Export Citation Format

Share Document