Graph-Based Word Alignment for Clinical Language Evaluation

Among the more recent applications for natural language processing algorithms has been the analysis of spoken language data for diagnostic and remedial purposes, fueled by the demand for simple, objective, and unobtrusive screening tools for neurological disorders such as dementia. The automated analysis of narrative retellings in particular shows potential as a component of such a screening tool since the ability to produce accurate and meaningful narratives is noticeably impaired in individuals with dementia and its frequent precursor, mild cognitive impairment, as well as other neurodegenerative and neurodevelopmental disorders. In this article, we present a method for extracting narrative recall scores automatically and highly accurately from a word-level alignment between a retelling and the source narrative. We propose improvements to existing machine translation–based systems for word alignment, including a novel method of word alignment relying on random walks on a graph that achieves alignment accuracy superior to that of standard expectation maximization–based techniques for word alignment in a fraction of the time required for expectation maximization. In addition, the narrative recall score features extracted from these high-quality word alignments yield diagnostic classification accuracy comparable to that achieved using manually assigned scores and significantly higher than that achieved with summary-level text similarity metrics used in other areas of NLP. These methods can be trivially adapted to spontaneous language samples elicited with non-linguistic stimuli, thereby demonstrating the flexibility and generalizability of these methods.

Download Full-text

Learning Tractable Word Alignment Models with Complex Constraints

Computational Linguistics ◽

10.1162/coli_a_00007 ◽

2010 ◽

Vol 36 (3) ◽

pp. 481-504 ◽

Cited By ~ 6

Author(s):

João V. Graça ◽

Kuzman Ganchev ◽

Ben Taskar

Keyword(s):

Probabilistic Models ◽

Learning Algorithm ◽

Word Alignment ◽

Word Level ◽

Word Alignments ◽

Symmetry Constraints ◽

Critical Resource ◽

Complex Constraints ◽

Bilingual Text ◽

Efficient Learning

Word-level alignment of bilingual text is a critical resource for a growing variety of tasks. Probabilistic models for word alignment present a fundamental trade-off between richness of captured constraints and correlations versus efficiency and tractability of inference. In this article, we use the Posterior Regularization framework (Graça, Ganchev, and Taskar 2007) to incorporate complex constraints into probabilistic models during learning without changing the efficiency of the underlying model. We focus on the simple and tractable hidden Markov model, and present an efficient learning algorithm for incorporating approximate bijectivity and symmetry constraints. Models estimated with these constraints produce a significant boost in performance as measured by both precision and recall of manually annotated alignments for six language pairs. We also report experiments on two different tasks where word alignments are required: phrase-based machine translation and syntax transfer, and show promising improvements over standard methods.

Download Full-text

Special Issue on Robust Methods in Analysis of Natural Language Data

Natural Language Engineering ◽

10.1017/s1351324901009433 ◽

2001 ◽

Vol 7 (2) ◽

pp. 189-190

Author(s):

Afzal Ballim ◽

Vincenzo Pallotta

Keyword(s):

Natural Language ◽

Language Processing ◽

Automated Analysis ◽

Robust Methods ◽

Human Communication ◽

Analysis Techniques ◽

Spoken Language Processing ◽

Intelligent Information Systems ◽

Language Data ◽

Intelligent Information

The automated analysis of natural language data has become a central issue in the design of Intelligent Information Systems. The term natural language is intended to cover all the possible modalities of human communication and it is not restricted to written or spoken language. Processing unrestricted natural language is still considered as an AI-hard task. However various analysis techniques have been proposed in order to address specific aspects of natural language. In particular, recent interest has been on providing approximate analysis techniques, assuming that perfect analysis is not possible, but that partial results are still very useful.

Download Full-text

Using Noisy Word-Level Labels to Train a Phoneme Recognizer based on Neural Networks by Expectation Maximization

Proceedings of the 2019 8th International Conference on Computing and Pattern Recognition ◽

10.1145/3373509.3373538 ◽

2019 ◽

Author(s):

Chen Li ◽

Bo Zhang ◽

Shan Huang ◽

Zhenhuan Liu

Keyword(s):

Neural Networks ◽

Expectation Maximization ◽

Word Level

Download Full-text

A Hybrid Siamese Neural Network for Natural Language Inference in Cyber-Physical Systems

ACM Transactions on Internet Technology ◽

10.1145/3418208 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1-25

Author(s):

Pin Ni ◽

Yuming Li ◽

Gangmin Li ◽

Victor Chang

Keyword(s):

Natural Language ◽

Language Processing ◽

Short Term Memory ◽

Physical World ◽

Heterogeneous Data ◽

Cyber Physical Systems ◽

Physical Systems ◽

Language Data ◽

Text Language ◽

Different Sources

Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters.

Download Full-text

Multi-level Chunk-based Constituent-to-Dependency Treebank Transformation for Tibetan Dependency Parsing

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3424247 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-12

Author(s):

Shumin Shi ◽

Dan Luo ◽

Xing Wu ◽

Congjun Long ◽

Heyan Huang

Keyword(s):

Language Processing ◽

Manual Annotation ◽

Syntactic Parsing ◽

Dependency Parsing ◽

Low Resource ◽

Resource Setting ◽

Dependency Tree ◽

Low Resource Setting ◽

Novel Method ◽

Multi Level

Dependency parsing is an important task for Natural Language Processing (NLP). However, a mature parser requires a large treebank for training, which is still extremely costly to create. Tibetan is a kind of extremely low-resource language for NLP, there is no available Tibetan dependency treebank, which is currently obtained by manual annotation. Furthermore, there are few related kinds of research on the construction of treebank. We propose a novel method of multi-level chunk-based syntactic parsing to complete constituent-to-dependency treebank conversion for Tibetan under scarce conditions. Our method mines more dependencies of Tibetan sentences, builds a high-quality Tibetan dependency tree corpus, and makes fuller use of the inherent laws of the language itself. We train the dependency parsing models on the dependency treebank obtained by the preliminary transformation. The model achieves 86.5% accuracy, 96% LAS, and 97.85% UAS, which exceeds the optimal results of existing conversion methods. The experimental results show that our method has the potential to use a low-resource setting, which means we not only solve the problem of scarce Tibetan dependency treebank but also avoid needless manual annotation. The method embodies the regularity of strong knowledge-guided linguistic analysis methods, which is of great significance to promote the research of Tibetan information processing.

Download Full-text

Interrogating the Egypto-Sudanic Arabic Connection

Languages ◽

10.3390/languages6030123 ◽

2021 ◽

Vol 6 (3) ◽

pp. 123

Author(s):

Thomas A. Leddy-Cecere

Keyword(s):

The Novel ◽

Present Contribution ◽

Phonological Processes ◽

Verbal Inflection ◽

Segmental Phonology ◽

Language Data ◽

History Of ◽

Novel Method ◽

Dialect Classification ◽

Arabic Speaking

The Arabic dialectology literature repeatedly asserts the existence of a macro-level classificatory relationship binding the Arabic speech varieties of the combined Egypto-Sudanic area. This proposal, though oft-encountered, has not previously been formulated in reference to extensive linguistic criteria, but is instead framed primarily on the nonlinguistic premise of historical demographic and genealogical relationships joining the Arabic-speaking communities of the region. The present contribution provides a linguistically based evaluation of this proposed dialectal grouping, to assess whether the postulated dialectal unity is meaningfully borne out by available language data. Isoglosses from the domains of segmental phonology, phonological processes, pronominal morphology, verbal inflection, and syntax are analyzed across six dialects representing Arabic speech in the region. These are shown to offer minimal support for a unified Egypto-Sudanic dialect classification, but instead to indicate a significant north–south differentiation within the sample—a finding further qualified via application of the novel method of Historical Glottometry developed by François and Kalyan. The investigation concludes with reflection on the implications of these results on the understandings of the correspondence between linguistic and human genealogical relationships in the history of Arabic and in dialectological practice more broadly.

Download Full-text

BioVerbNet: a large semantic-syntactic classification of verbs in biomedicine

Journal of Biomedical Semantics ◽

10.1186/s13326-021-00247-z ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Olga Majewska ◽

Charlotte Collins ◽

Simon Baker ◽

Jari Björne ◽

Susan Windisch Brown ◽

...

Keyword(s):

Natural Language ◽

Language Processing ◽

Model Performance ◽

Representation Learning ◽

Verb Classes ◽

Expert Annotation ◽

Biomedical Texts ◽

Time Required ◽

Verb Meaning

Abstract Background Recent advances in representation learning have enabled large strides in natural language understanding; However, verbal reasoning remains a challenge for state-of-the-art systems. External sources of structured, expert-curated verb-related knowledge have been shown to boost model performance in different Natural Language Processing (NLP) tasks where accurate handling of verb meaning and behaviour is critical. The costliness and time required for manual lexicon construction has been a major obstacle to porting the benefits of such resources to NLP in specialised domains, such as biomedicine. To address this issue, we combine a neural classification method with expert annotation to create BioVerbNet. This new resource comprises 693 verbs assigned to 22 top-level and 117 fine-grained semantic-syntactic verb classes. We make this resource available complete with semantic roles and VerbNet-style syntactic frames. Results We demonstrate the utility of the new resource in boosting model performance in document- and sentence-level classification in biomedicine. We apply an established retrofitting method to harness the verb class membership knowledge from BioVerbNet and transform a pretrained word embedding space by pulling together verbs belonging to the same semantic-syntactic class. The BioVerbNet knowledge-aware embeddings surpass the non-specialised baseline by a significant margin on both tasks. Conclusion This work introduces the first large, annotated semantic-syntactic classification of biomedical verbs, providing a detailed account of the annotation process, the key differences in verb behaviour between the general and biomedical domain, and the design choices made to accurately capture the meaning and properties of verbs used in biomedical texts. The demonstrated benefits of leveraging BioVerbNet in text classification suggest the resource could help systems better tackle challenging NLP tasks in biomedicine.

Download Full-text

Robust methods in analysis of natural language data

Natural Language Engineering ◽

10.1017/s1351324902002942 ◽

2002 ◽

Vol 8 (2-3) ◽

pp. 93-96

Author(s):

AFZAL BALLIM ◽

VINCENZO PALLOTTA

Keyword(s):

Information Systems ◽

Natural Language ◽

Automated Analysis ◽

Approximate Analysis ◽

Robust Methods ◽

Analysis Techniques ◽

Intelligent Information Systems ◽

Recent Interest ◽

Language Data ◽

Intelligent Information

The automated analysis of natural language data has become a central issue in the design of intelligent information systems. Processing unconstrained natural language data is still considered as an AI-hard task. However, various analysis techniques have been proposed to address specific aspects of natural language. In particular, recent interest has been focused on providing approximate analysis techniques, assuming that when perfect analysis is not possible, partial results may be still very useful.

Download Full-text

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00032 ◽

2018 ◽

Vol 6 ◽

pp. 451-465 ◽

Cited By ~ 5

Author(s):

Daniela Gerz ◽

Ivan Vulić ◽

Edoardo Ponti ◽

Jason Naradowsky ◽

Roi Reichart ◽

...

Keyword(s):

Large Scale ◽

Language Modeling ◽

Language Models ◽

Data Sets ◽

High Type ◽

Word Level ◽

Level Information ◽

Character Sequences ◽

Novel Method ◽

Morphologically Rich Languages

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.

Download Full-text

Language beyond the language system: dorsal visuospatial pathways support processing of demonstratives and spatial language during naturalistic fast fMRI

10.1101/651257 ◽

2019 ◽

Author(s):

Roberta Rocca ◽

Kenny R. Coventry ◽

Kristian Tylén ◽

Marlene Staib ◽

Torben E. Lund ◽

...

Keyword(s):

Language Processing ◽

Information Integration ◽

Visual Cues ◽

Reference Frames ◽

Finite Impulse Response ◽

Angular Gyrus ◽

Attentional Orienting ◽

Word Level ◽

Pragmatic Inference ◽

The Right

AbstractSpatial demonstratives are powerful linguistic tools used to establish joint attention. Identifying the meaning of semantically underspecified expressions like “this one” hinges on the integration of linguistic and visual cues, attentional orienting and pragmatic inference. This synergy between language and extralinguistic cognition is pivotal to language comprehension in general, but especially prominent in demonstratives.In this study, we aimed to elucidate which neural architectures enable this intertwining between language and extralinguistic cognition using a naturalistic fMRI paradigm. In our experiment, 28 participants listened to a specially crafted dialogical narrative with a controlled number of spatial demonstratives. A fast multiband-EPI acquisition sequence (TR = 388ms) combined with finite impulse response (FIR) modelling of the hemodynamic response was used to capture signal changes at word-level resolution.We found that spatial demonstratives bilaterally engage a network of parietal areas, including the supramarginal gyrus, the angular gyrus, and precuneus, implicated in information integration and visuospatial processing. Moreover, demonstratives recruit frontal regions, including the right FEF, implicated in attentional orienting and reference frames shifts. Finally, using multivariate similarity analyses, we provide evidence for a general involvement of the dorsal (“where”) stream in the processing of spatial expressions, as opposed to ventral pathways encoding object semantics.Overall, our results suggest that language processing relies on a distributed architecture, recruiting neural resources for perception, attention, and extra-linguistic aspects of cognition in a dynamic and context-dependent fashion.

Download Full-text