Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.

Download Full-text

Automated News Classification using N-gram Model and Key Features of Nepali Language

SCITECH Nepal ◽

10.3126/scitech.v13i1.23504 ◽

2018 ◽

Vol 13 (1) ◽

pp. 64-69

Author(s):

Dinesh Dangol ◽

Rupesh Dahi Shrestha ◽

Arun Timalsina

Keyword(s):

Text Classification ◽

English Language ◽

Promising Result ◽

Text Processing ◽

Automatic Text Classification ◽

Key Features ◽

Research Experiment ◽

N Gram ◽

Automated Text Processing ◽

Automatic Text

With an increasing trend of publishing news online on website, automatic text processing becomes more and more important. Automatic text classification has been a focus of many researchers in different languages for decades. There is a huge amount of research repository on features of English language and their uses on automated text processing. This research implements Nepali language key features for automatic text classification of Nepali news. In particular, the study on impact of Nepali language based features, which are extremely different than English language is more challenging because of the higher level of complexity to be resolved. The research experiment using vector space model, n-gram model and key feature based processing specific to Nepali language shows promising result compared to bag-of-words model for the task of automated Nepali news classification.

Download Full-text

Automated Travel History Extraction From Clinical Notes for Informing the Detection of Emergent Infectious Disease Events: Algorithm Development and Validation

JMIR Public Health and Surveillance ◽

10.2196/26719 ◽

2021 ◽

Vol 7 (3) ◽

pp. e26719

Author(s):

Kelly S Peterson ◽

Julia Lewis ◽

Olga V Patterson ◽

Alec B Chapman ◽

Daniel W Denhalter ◽

...

Keyword(s):

Public Health ◽

Infectious Disease ◽

United States ◽

Text Processing ◽

The United States ◽

Language Models ◽

Travel History ◽

Patient Travel ◽

Electronic Health ◽

Automated Text Processing

Background Patient travel history can be crucial in evaluating evolving infectious disease events. Such information can be challenging to acquire in electronic health records, as it is often available only in unstructured text. Objective This study aims to assess the feasibility of annotating and automatically extracting travel history mentions from unstructured clinical documents in the Department of Veterans Affairs across disparate health care facilities and among millions of patients. Information about travel exposure augments existing surveillance applications for increased preparedness in responding quickly to public health threats. Methods Clinical documents related to arboviral disease were annotated following selection using a semiautomated bootstrapping process. Using annotated instances as training data, models were developed to extract from unstructured clinical text any mention of affirmed travel locations outside of the continental United States. Automated text processing models were evaluated, involving machine learning and neural language models for extraction accuracy. Results Among 4584 annotated instances, 2659 (58%) contained an affirmed mention of travel history, while 347 (7.6%) were negated. Interannotator agreement resulted in a document-level Cohen kappa of 0.776. Automated text processing accuracy (F1 85.6, 95% CI 82.5-87.9) and computational burden were acceptable such that the system can provide a rapid screen for public health events. Conclusions Automated extraction of patient travel history from clinical documents is feasible for enhanced passive surveillance public health systems. Without such a system, it would usually be necessary to manually review charts to identify recent travel or lack of travel, use an electronic health record that enforces travel history documentation, or ignore this potential source of information altogether. The development of this tool was initially motivated by emergent arboviral diseases. More recently, this system was used in the early phases of response to COVID-19 in the United States, although its utility was limited to a relatively brief window due to the rapid domestic spread of the virus. Such systems may aid future efforts to prevent and contain the spread of infectious diseases.

Download Full-text

Ethnocultural Competence Development in Teaching the English Language to Kalmyk High School Students

Proceedings of Southern Federal University Philology ◽

10.18522/1995-0640-2021-1-182-192 ◽

2021 ◽

Vol 2021 (1) ◽

pp. 182-192

Author(s):

Tatiana V. Basanova

Keyword(s):

High School ◽

High School Students ◽

Identity Development ◽

English Language ◽

Competence Development ◽

English Language Teaching ◽

Complex Nature ◽

School Students ◽

Socialization Process ◽

Related Information

Developing ethnocultural competence by teaching a foreign language is considered to be a contribution to ethnic identity development that is the aim of ethnic socialization process. The present article describes the content of the English language teaching in the process of ethnocultural competence development. Thematic and procedural aspects are distinguished. Each one has a complex nature and contributes to profound considering ethnic related information by Kalmyk students at high school.

Download Full-text

Exploring the Adaptation of Recurrent Neural Network Approaches for Extracting Drug–Drug Interactions from Biomedical Text

International Journal of Machine Learning and Computing ◽

10.18178/ijmlc.2021.11.4.1046 ◽

2021 ◽

Vol 11 (4) ◽

pp. 267-273

Author(s):

Wen-Juan Hou ◽

◽

Bamfa Ceesay

Keyword(s):

Text Processing ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Biomedical Text ◽

Automatic Extraction ◽

Named Entity ◽

Structured Information ◽

Network Approaches ◽

Form Information

Information extraction (IE) is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several activities, such as named entity recognition, event extraction, relationship discovery, and document classification, with the overall goal of translating text into a more structured form. Information on the changes in the effect of a drug, when taken in combination with a second drug, is known as drug–drug interaction (DDI). DDIs can delay, decrease, or enhance absorption of drugs and thus decrease or increase their efficacy or cause adverse effects. Recent research trends have shown several adaptation of recurrent neural networks (RNNs) from text. In this study, we highlight significant challenges of using RNNs in biomedical text processing and propose automatic extraction of DDIs aiming at overcoming some challenges. Our results show that the system is competitive against other systems for the task of extracting DDIs.

Download Full-text

Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.664737 ◽

2021 ◽

Vol 4 ◽

Author(s):

Prashanth Rao ◽

Maite Taboada

Keyword(s):

English Language ◽

Latent Dirichlet Allocation ◽

Topic Modelling ◽

Monthly Basis ◽

Policy Makers ◽

Gender Parity ◽

News Organizations ◽

Gender Based ◽

News Corpus ◽

Key Events

We present a topic modelling and data visualization methodology to examine gender-based disparities in news articles by topic. Existing research in topic modelling is largely focused on the text mining of closed corpora, i.e., those that include a fixed collection of composite texts. We showcase a methodology to discover topics via Latent Dirichlet Allocation, which can reliably produce human-interpretable topics over an open news corpus that continually grows with time. Our system generates topics, or distributions of keywords, for news articles on a monthly basis, to consistently detect key events and trends aligned with events in the real world. Findings from 2 years worth of news articles in mainstream English-language Canadian media indicate that certain topics feature either women or men more prominently and exhibit different types of language. Perhaps unsurprisingly, topics such as lifestyle, entertainment, and healthcare tend to be prominent in articles that quote more women than men. Topics such as sports, politics, and business are characteristic of articles that quote more men than women. The data shows a self-reinforcing gendered division of duties and representation in society. Quoting female sources more frequently in a caregiving role and quoting male sources more frequently in political and business roles enshrines women’s status as caregivers and men’s status as leaders and breadwinners. Our results can help journalists and policy makers better understand the unequal gender representation of those quoted in the news and facilitate news organizations’ efforts to achieve gender parity in their sources. The proposed methodology is robust, reproducible, and scalable to very large corpora, and can be used for similar studies involving unsupervised topic modelling and language analyses.

Download Full-text

The Importance of Multilingual Information and Plain English in Response to the COVID-19 Pandemic

European Scientific Journal ESJ ◽

10.19044/esj.2021.v17n30p37 ◽

2021 ◽

Vol 17 (30) ◽

pp. 37

Author(s):

Barbara Cappuzzo

Keyword(s):

Limited English Proficiency ◽

English Language ◽

Human Beings ◽

Health Crisis ◽

Demographic Groups ◽

Related Information ◽

Information Accessibility ◽

Essential Components ◽

One Year ◽

The Impact

Health is a common issue for all human beings. As a consequence, everyone in the world has in some way to cope with the language of medicine. This is true now more than ever due to the global health crisis caused by the current COVID-19 pandemic, which has introduced a great amount of terms, previously mostly used by epidemiologists and statisticians, but which now have entered the daily lexicon of many languages. As the medium of international scientific communication, English is the language of worldwide information about the pandemic, and the main source of terms and expressions for other languages. The impact of the COVID-19 pandemic on English lexicon has been so deep that the Oxford English Dictionary Online (OED) released special updates in 2020 to fulfil the need to document the phenomenon. However, previous studies (Khan et al. 2020; Deang and Salazar 2021) have highlighted the important question concerning the existence of several ethnic minorities who have Limited English Proficiency (LEP) and therefore do not receive sufficient and appropriate information to defend themselves adequately against SARS-CoV-2, the virus we have all been fighting for more than one year now. The aim of this study is to highlight the importance of language and translation as essential components to provide all demographic groups/communities with access to COVID-19-related information in languages other than English and enable them to follow official health key rules. The main websites of Italian governmental and nongovernmental institutions were investigated, and the analysis focused on the availability and type of content of the multilingual material, as well as on information accessibility and clarity. The results showed important differences in the number of available languages and, even more, in the level of intelligibility of COVID-19 material in the English language. In this respect, this study intends to foster the use of plain English in the dissemination material provided by the websites of the main healthcare public institutions in Italy, a country with an ever-increasing number of registered foreigners, the majority born in non-EU countries.

Download Full-text

Topic Modelling in Bangla Language: An LDA Approach to Optimize Topics and News Classification

Computer and Information Science ◽

10.5539/cis.v11n4p77 ◽

2018 ◽

Vol 11 (4) ◽

pp. 77 ◽

Cited By ~ 2

Author(s):

Malek Mouhoub ◽

Mustakim Al Helal

Keyword(s):

Topic Modeling ◽

Text Categorization ◽

English Language ◽

Latent Dirichlet Allocation ◽

Similarity Measures ◽

Document Collections ◽

Statistical Language Modeling ◽

Document Models ◽

Wide Range ◽

News Corpus

Topic modeling is a powerful technique for unsupervised analysis of large document collections. Topic models have a wide range of applications including tag recommendation, text categorization, keyword extraction and similarity search in the text mining, information retrieval and statistical language modeling. The research on topic modeling is gaining popularity day by day. There are various efficient topic modeling techniques available for the English language as it is one of the most spoken languages in the whole world but not for the other spoken languages. Bangla being the seventh most spoken native language in the world by population, it needs automation in different aspects. This paper deals with finding the core topics of Bangla news corpus and classifying news with similarity measures. The document models are built using LDA (Latent Dirichlet Allocation) with bigram.

Download Full-text

Data mining for building knowledge bases: techniques, architectures and applications

The Knowledge Engineering Review ◽

10.1017/s0269888916000047 ◽

2016 ◽

Vol 31 (2) ◽

pp. 97-123 ◽

Cited By ~ 4

Author(s):

Alfred Krzywicki ◽

Wayne Wobcke ◽

Michael Bain ◽

John Calvo Martinez ◽

Paul Compton

Keyword(s):

Data Mining ◽

Knowledge Base ◽

Question Answering ◽

Knowledge Bases ◽

Event Extraction ◽

Data Sources ◽

Small Scale ◽

Knowledge Mining ◽

Practical Applications ◽

Unstructured Text

AbstractData mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we callknowledge mining: integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.

Download Full-text

Cascading Forgetting in Product Development Challenges and Evaluation

Proceedings of the Design Society: International Conference on Engineering Design ◽

10.1017/dsi.2019.259 ◽

2019 ◽

Vol 1 (1) ◽

pp. 2527-2536

Author(s):

Patricia Kügler ◽

Claudia Schon ◽

Benjamin Schleich ◽

Steffen Staab ◽

Sandro Wartzack

Keyword(s):

Product Design ◽

Information Overload ◽

Relevant Information ◽

Knowledge Bases ◽

Wear Behaviour ◽

Combustion Engines ◽

The Novel ◽

Related Information ◽

Intentional Forgetting ◽

Novel Approach

AbstractVast amounts of information and knowledge is produced and stored within product design projects. Especially for reuse and adaptation there exists no suitable method for product designers to handle this information overload. Due to this, the selection of relevant information in a specific development situation is time-consuming and inefficient. To tackle this issue, the novel approach Intentional Forgetting (IF) is applied for product design, which aims to support reuse and adaptation by reducing the vast amount of information to the relevant. Within this contribution an IF-operator called Cascading Forgetting is introduced and evaluated, which was implemented for forgetting related information elements in ontology knowledge bases. For the evaluation the development process of a test-rig for studying friction and wear behaviour of the cam/tappet contact in combustion engines is analysed. Due to the interdisciplinary task of the evaluation and the characteristics of semantic model, challenges are discussed. In conclusion, the focus of the evaluation is to consider how reliable the Cascading Forgetting works and how intuitive ontology-based representations appear to engineers.

Download Full-text

Lost in Space: Geolocation in Event Data

Political Science Research and Methods ◽

10.1017/psrm.2018.23 ◽

2018 ◽

Vol 7 (04) ◽

pp. 871-888 ◽

Cited By ~ 6

Author(s):

Sophie J. Lee ◽

Howard Liu ◽

Michael D. Ward

Keyword(s):

Learning Algorithm ◽

Text Processing ◽

Contextual Information ◽

Training Data ◽

Supervised Machine Learning ◽

Model Parameters ◽

Event Data ◽

Data Set ◽

N Gram ◽

Automated Text Processing

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Download Full-text