Parsing Models for Identifying Multiword Expressions

Multiword expressions lie at the syntax/semantics interface and have motivated alternative theories of syntax like Construction Grammar. Until now, however, syntactic analysis and multiword expression identification have been modeled separately in natural language processing. We develop two structured prediction models for joint parsing and multiword expression identification. The first is based on context-free grammars and the second uses tree substitution grammars, a formalism that can store larger syntactic fragments. Our experiments show that both models can identify multiword expressions with much higher accuracy than a state-of-the-art system based on word co-occurrence statistics. We experiment with Arabic and French, which both have pervasive multiword expressions. Relative to English, they also have richer morphology, which induces lexical sparsity in finite corpora. To combat this sparsity, we develop a simple factored lexical representation for the context-free parsing model. Morphological analyses are automatically transformed into rich feature tags that are scored jointly with lexical items. This technique, which we call a factored lexicon, improves both standard parsing and multiword expression identification accuracy.

Download Full-text

Parsing

10.1093/oxfordhb/9780199276349.013.0012 ◽

2012 ◽

Author(s):

John Carroll

Keyword(s):

Natural Language ◽

Language Processing ◽

Real World ◽

Level Of Detail ◽

Semantic Interpretation ◽

Syntactic Analysis ◽

Real World Applications ◽

Grammar Formalisms ◽

Speech Recognizer ◽

Context Free

This article introduces the concepts and techniques for natural language (NL) parsing, which signifies, using a grammar to assign a syntactic analysis to a string of words, a lattice of word hypotheses output by a speech recognizer or similar. The level of detail required depends on the language processing task being performed and the particular approach to the task that is being pursued. This article further describes approaches that produce ‘shallow’ analyses. It also outlines approaches to parsing that analyse the input in terms of labelled dependencies between words. Producing hierarchical phrase structure requires grammars that have at least context-free (CF) power. CF algorithms that are widely used in parsing of NL are described in this article. To support detailed semantic interpretation more powerful grammar formalisms are required, but these are usually parsed using extensions of CF parsing algorithms. Furthermore, this article describes unification-based parsing. Finally, it discusses three important issues that have to be tackled in real-world applications of parsing: evaluation of parser accuracy, parser efficiency, and measurement of grammar/parser coverage.

Download Full-text

Exploiting extra-textual and linguistic information in keyphrase extraction

Natural Language Engineering ◽

10.1017/s1351324914000126 ◽

2014 ◽

Vol 22 (1) ◽

pp. 73-95 ◽

Cited By ~ 6

Author(s):

GÁBOR BEREND

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Language Processing ◽

State Of The Art ◽

Keyphrase Extraction ◽

Textual Information ◽

Multiword Expressions ◽

Pos Tagging ◽

Multiple Datasets ◽

Document Visualization

AbstractKeyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.

Download Full-text

Structured Output Learning with Conditional Generative Flows

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5940 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5005-5012 ◽

Cited By ~ 1

Author(s):

You Lu ◽

Bert Huang

Keyword(s):

Prediction Models ◽

State Of The Art ◽

Prediction Method ◽

Structured Prediction ◽

Conditional Likelihood ◽

Structured Output Learning ◽

Approximate Likelihood ◽

Structured Output ◽

Prediction Problems ◽

The Relationship

Traditional structured prediction models try to learn the conditional likelihood, i.e., p(y|x), to capture the relationship between the structured output y and the input features x. For many models, computing the likelihood is intractable. These models are therefore hard to train, requiring the use of surrogate objectives or variational inference to approximate likelihood. In this paper, we propose conditional Glow (c-Glow), a conditional generative flow for structured output learning. C-Glow benefits from the ability of flow-based models to compute p(y|x exactly and efficiently. Learning with c-Glow does not require a surrogate objective or performing inference during training. Once trained, we can directly and efficiently generate conditional samples. We develop a sample-based prediction method, which can use this advantage to do efficient and effective inference. In our experiments, we test c-Glow on five different tasks. C-Glow outperforms the state-of-the-art baselines in some tasks and predicts comparable outputs in the other tasks. The results show that c-Glow is versatile and is applicable to many different structured prediction problems.

Download Full-text

Dependency parsing of biomedical text with BERT

BMC Bioinformatics ◽

10.1186/s12859-020-03905-8 ◽

2020 ◽

Vol 21 (S23) ◽

Author(s):

Jenna Kanerva ◽

Filip Ginter ◽

Sampo Pyysalo

Keyword(s):

Transfer Learning ◽

Language Processing ◽

State Of The Art ◽

Text Processing ◽

Syntactic Analysis ◽

Biomedical Text ◽

Dependency Parsing ◽

Shared Task ◽

Fine Tune ◽

Selection Of

Abstract Background: Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine. Methods: We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing. Results: We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.

Download Full-text

A MORPHO-SYNTACTIC ANALYSIS BASED LEXICAL SUBSYSTEM

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001493000303 ◽

1993 ◽

Vol 07 (03) ◽

pp. 595-619 ◽

Cited By ~ 3

Author(s):

P. SENGUPTA ◽

B.B. CHAUDHURI

Keyword(s):

Language Processing ◽

Lexical Representation ◽

Syntactic Analysis ◽

Indian Languages ◽

Lexical Database ◽

Indian Language ◽

Natural Languages ◽

Lexical Functional Grammar ◽

Finite State

A lexical subsystem that contains a morphological level parser is necessary for processing natural languages in general and inflectional languages in particular. Such a subsystem should be able to generate the surface form (i.e. as it appears in a natural sentence) of a word, given the sequence of morphemes constituting the word. Conversely, and more importantly, the subsystem should be able to parse a word into its constituent morphemes. A formalism which enables the lexicon writer to specify the lexicon of an inflectional language is discussed. The specifications are used to build up a lexical description in the form of a lexical database on one hand and a formulation of derivational morphology, called Augmented Finite State Automata (AFSA), on the other. A compact lexical representation has been achieved, where generation of the surface forms of a word, as well as parsing of a word is performed in a computationally attractive manner. The output produced as a result of parsing is suitable for input to the next stage of analysis in a Natural Language Processing (NLP) environment, which, in our case is based on a generalization of the Lexical Functional Grammar (LFG). The application of the formalism on inflectional Indian languages is considered, with Bengali, a modern Indian language, as a case study.

Download Full-text

Researching on Parsing

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.846-847.1376 ◽

2013 ◽

Vol 846-847 ◽

pp. 1376-1379

Author(s):

Li Fei Geng ◽

Hong Lian Li

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Linguistic Analysis ◽

Syntactic Analysis ◽

The Core ◽

Core Technology ◽

Context Free ◽

Probabilistic Context ◽

Context Free Grammars

Syntactic analysis is the core technology of natural language processing and it is the cornerstone for further linguistic analysis. This paper, first introduces the basic grammatical system and summary the technology of current parsing. Then analysis the characteristics of probabilistic context-free grammars deep and introduce the method of improving for probabilistic context-free. The last we point the difficulty of Chinese parsing.

Download Full-text

Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models

Applied Sciences ◽

10.3390/app112311328 ◽

2021 ◽

Vol 11 (23) ◽

pp. 11328

Author(s):

Nader Essam ◽

Abdullah M. Moussa ◽

Khaled M. Elsayed ◽

Sherif Abdou ◽

Mohsen Rashwan ◽

...

Keyword(s):

Social Media ◽

Language Processing ◽

State Of The Art ◽

Weighted Average ◽

Arabic Language ◽

Arab Countries ◽

The State ◽

Identification Accuracy ◽

Language Models ◽

Twitter Data

The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.

Download Full-text

PRINCIPAL PROBLEMS OF NATURAL LANGUAGE PROCESSING SYSTEMS

Studia Philologica ◽

10.28925/2311-2425.2018.11.5 ◽

2018 ◽

pp. 35-38

Author(s):

O. Hyryn

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Syntactic Analysis ◽

Syntactic Ambiguity ◽

Grammatical Structure ◽

English Sentence ◽

Analysis Methods ◽

The Way

The article deals with natural language processing, namely that of an English sentence. The article describes the problems, which might arise during the process and which are connected with graphic, semantic, and syntactic ambiguity. The article provides the description of how the problems had been solved before the automatic syntactic analysis was applied and the way, such analysis methods could be helpful in developing new analysis algorithms. The analysis focuses on the issues, blocking the basis for the natural language processing — parsing — the process of sentence analysis according to their structure, content and meaning, which aims to analyze the grammatical structure of the sentence, the division of sentences into constituent components and defining links between them.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text