scholarly journals Identification of Multiword Expressions by Combining Multiple Linguistic Information Sources

2014 ◽  
Vol 40 (2) ◽  
pp. 449-468 ◽  
Author(s):  
Yulia Tsvetkov ◽  
Shuly Wintner

We propose a framework for using multiple sources of linguistic information in the task of identifying multiword expressions in natural language texts. We define various linguistically motivated classification features and introduce novel ways for computing them. We then manually define interrelationships among the features, and express them in a Bayesian network. The result is a powerful classifier that can identify multiword expressions of various types and multiple syntactic constructions in text corpora. Our methodology is unsupervised and language-independent; it requires relatively few language resources and is thus suitable for a large number of languages. We report results on English, French, and Hebrew, and demonstrate a significant improvement in identification accuracy, compared with less sophisticated baselines.

2020 ◽  
Vol 9 (1) ◽  
pp. 1040-1044

Text summarization generates an abstract version of information on a particular topic from various sources without modifying its originality. It is essential to dig information from the large repository of data, thereby eliminating the irrelevant information. The manual summarization consumes a large amount of time and hence an automated text summarization model is required. The summarization can be performed from a single source or multiple sources. The Natural Language Processing (NLP) based text summarization can be generally categorized as abstractive and extractive methods. The extractive methods mine the essential text from the document whereas the abstractive methods summarize the document by rewriting. The extractive summarization methods rely on topics and centrality of the document. The abstractive techniques transform the sentences based on the language resources available. This paper deals with the study of extractive as well as abstractive strategies in text summarization. Overall objective of this paper is to provide a significant direction to the researchers to learn about different strategies applied in text summarization.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Lisa Grossman Liu ◽  
Raymond H. Grossman ◽  
Elliot G. Mitchell ◽  
Chunhua Weng ◽  
Karthik Natarajan ◽  
...  

AbstractThe recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6–14% increase in abbreviation coverage; 28–52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations.


2021 ◽  
pp. 114-131
Author(s):  
Alexander Antopolsky ◽  

The author determines the concept of linguistic information resources, gives an overview of their classifications. Describes the most significant Russian catalogs of linguistic information resources and the country's leading organizations in the field of computational linguistics. Discusses the priority tasks of creating a Russian infrastructure for linguistic information resources.


Author(s):  
John Carroll

This chapter introduces key concepts and techniques for natural-language parsing: that is, finding the grammatical structure of sentences. The chapter introduces the fundamental algorithms for parsing with context-free (CF) phrase structure grammars, how these deal with ambiguous grammars, and how CF grammars and associated disambiguation models can be derived from syntactically annotated text. It goes on to consider dependency analysis, and outlines the main approaches to dependency parsing based both on manually written grammars and on learning from text annotated with dependency structures. It finishes with an overview of techniques used for parsing with grammars that use feature structures to encode linguistic information.


2020 ◽  
Vol 21 (26) ◽  
Author(s):  
Marin Laak

Eesti Kirjandusmuuseum on olnud teerajajaid digihumanitaaria valdkonnas juba 1990. aastatest, alates arvutikultuuri laiemast levikust. Väärtuslike andmekogude haldamisel on olnud missiooniks nende kättesaadavaks tegemine avalikkusele. Kultuuripärand avati laiemale kasutajale kahes suunas: sisupõhised otsitavad andmebaasid ning suhtepõhised andmekeskkonnad. Siinse artikli eesmärgiks on näidata arvutusliku kirjandusteaduse tänapäevaseid võimalusi ja nendega seotud kirjanduslike keeleressursside loomist koostöös korpuslingvistidega. Artiklis analüüsin kultuuripärandi sisukeskkondade ja andmekoguside kasutusvõimalusi masinloetava keeleressursina. Esimeste selliste katsetena on valminud kirjavahetuse ja kriitika märgendatud keelekorpused päringusüsteemis KORP. Käesolev uurimus toob on 20. sajandi alguse mõjukriitika probleemi näitel välja kirjanduslike keelekorpuste potentsiaali kultuuripärandi uurimisel.   Estonia can soon expect an explosive growth in digital heritage and text resources due to the current project of mass digitisation of national cultural heritage (printed books, archival documents, photos, art, audiovisual, and ethnographic artifacts) (2019–2023). This will give new opportunities for different fields of digital humanities and make digitised heritage accessible to everyone in the form of open data. The project will focus on the usage of the heritage, on the needs of education, e-learning, and the creative industry, including digital creative arts. The aim of this article is to examine some research possibilities that opened up for literary history due to the digitisation of literary works and archival sources and to put them in the general context of digital humanities. Although the field of digital humanities is broad, the meaning of DH is often reduced to methods of computational language-centered analyses, mainly based on using different tools and software languages (R, Stylo, Phyton, Gephy, Top Modelling etc.). While the corpus-based research is already a professional standard in linguistics, literary scholars are still more used to working with traditional methods. This article introduces two digital literary history projects belonging to the field of digital humanities and analyses them as language resources for creating texts corpora, and introduces some results of the case study of Estonian criticism from the Young Estonia movement up to the 1920s, carried out using the literary texts corpora in the corpus query system KORP (https://korp.keeleressursid.ee) by the Centre of Estonian Language Resources. During the past twenty years, I have mainly focussed on developing large-scale implementation projects for digital representation of Estonian literary history. The objective of these experimental projects has been to develop principally new non-linear models of Estonian literary history for the digital environment. These activities were based on my research of the intertextual relations between authors, literary works, and critical texts using traditional methods. The first content-based literary history project “ERNI. Estonian Literary History in Texts 1924–1925” (www2.kirmus.ee/erni) was based on a hypertextual network of literary source texts and reviews. We re-conceptualised literary history as a non-linear narrative and a gallery with many entrances. The task of the project was also to ensure its usability in education: a significant number of study materials has been added in cooperation with schoolteachers. In 2004, we initiated our long-term and still running project “Kreutzwald’s Century: the Estonian Cultural History Web” (http://kreutzwald.kirmus.ee) at the Estonian Literary Museum. The objective of this project was to make literary sources of the period accessible as the dynamic, interactive information environment. This was a hybrid project which synthesised the classical study of Estonian literary history, the needs of the digital media user, and the expanding digital resources from different memory institutions; its underlying idea was to link together all the works of fiction of an author, as well as their biography, manuscripts, and photos and to make them visible for the user on five interactive time axes. The project uses a specially created platform. Today, this platform is extensively used by schoolteachers: in 2020 (Jan.–Dec.) it had about 8, 986.555 million clicks and during seven years (2013 Dec.–2020 Dec.) it has collected 64, 627.380 million clicks. To find out how we can fit such content-based models of literary heritage into the context of Digital Humanities we need to compare the previous modelling practices with our current experimental project in the corpus query system KORP. Our interdisciplinary project “Literary Studies Meet Corpus Linguistics” (2017–2020) concentrated on studying literary history sources with linguistic methods. As the result of the project two literary text corpora were created: “Epistolary text corpus of Estonian writers Johannes Semper and Johannes Vares-Barbarus” and “Corpus of the Estonian literary criticism, Noor-Eesti and the 1920s”. Both of them were pilot projects in the field, started with converting the digitalised archival and printed sources into machine-readable format before text and data mining for corpus creation. Query system KORP allows us to organise the language data by all the categories used in the corpus, for example, to learn who and in what context mentioned the name of the French writer André Gide. The second currently running project is the morphologically annotated corpus of literary criticism. This corpus contains texts of literary reviews and criticism in different genres, drawn from the projects ERNI and “Kreutzwald’s Century”. The first results in studying the dynamics of literary values can already be seen. A query in KORP about the word ‘mõju’ (‘influence’) revealed that the manifesto “More of European culture!”of the group Young Estonia, voiced in 1905, was during the independent Estonian Republic replaced by the valuing of a specific national character. Corpus query showed a change in the meaning of the word: in the criticism contemporary to Young Estonia, the word ‘mõju’ was only associated with the historical pressure from Russian and German cultures. The foundation for modern comparative linguistics at the University of Tartu was laid in the 1920s by the professorship in Estonian literature.


2021 ◽  
Author(s):  
Mahir Morshed

In the lead-up to the launch of Abstract Wikipedia, a sufficient body of linguistic information, based on which the text within for a given language can be generated, must be in place so that different sets of functions, some working with concepts and others turning these into word sequences, can work together to produce something natural in that language. To achieve that information body's development requires more thorough consideration of a number of linguistic aspects sooner rather than later. This session will thus discuss aspects of language planning with respect to Wikidata lexicographical data and natural language generation, including the compositionality and manipulability of lexical units, the breadth and interconnectedness of units of meaning, and the treatment of variation among a language’s lects broadly construed. Special reference to the handling of each of these aspects for Bengali and those linguistic varieties often grouped with it will be presented.


Author(s):  
Tamás Iványi

In recent years, festivals have become an essential part of summer activities for many members of Generation Z. Programs that last several days also mean significant financial burden for young people, so they gather information from multiple sources before decision-making. The purpose of the study is to examine which information sources – especially social media – and which motivations have become significant in the context of festival tourism's decision process.An online survey was conducted as part of and exploratory research over four consecutive years dealing with the use of information sources and the importance of the music festivals' characteristics targeting the Hungarian Generation Z attendees of festivals. Besides the descriptive statistics cluster analysis and ANOVA tables were used.It can be emphasized that in the case of festival tourism, the influence and usage of social media, relying on the opinions of acquaintances and friends is much more significant in the decision-making phase than in the case of traditional tourism. The program and the leading performers are not the only important factors, but meeting friends, the atmosphere of the festival, and reasonable value for money are also significant. Three groups of users could be identified: those who are mainly browsing official websites and search engines, those who try to make decisions based on earlier experiences, and those who are also looking at social media sites and digest several types of content to make the decision. Organisers of festivals should understand the differences among these groups to create better communication strategies.


Sign in / Sign up

Export Citation Format

Share Document