AUTOMATED TEXT PROCESSING: TOPIC SEGMENTATION OF EDUCATIONAL TEXTS

Author(s):  
Marina Solnyshkina ◽  
◽  
Iskander Yarmakeev ◽  
Elzara Gafiyatova ◽  
Farida Ismaeva
10.2196/26719 ◽  
2021 ◽  
Vol 7 (3) ◽  
pp. e26719
Author(s):  
Kelly S Peterson ◽  
Julia Lewis ◽  
Olga V Patterson ◽  
Alec B Chapman ◽  
Daniel W Denhalter ◽  
...  

Background Patient travel history can be crucial in evaluating evolving infectious disease events. Such information can be challenging to acquire in electronic health records, as it is often available only in unstructured text. Objective This study aims to assess the feasibility of annotating and automatically extracting travel history mentions from unstructured clinical documents in the Department of Veterans Affairs across disparate health care facilities and among millions of patients. Information about travel exposure augments existing surveillance applications for increased preparedness in responding quickly to public health threats. Methods Clinical documents related to arboviral disease were annotated following selection using a semiautomated bootstrapping process. Using annotated instances as training data, models were developed to extract from unstructured clinical text any mention of affirmed travel locations outside of the continental United States. Automated text processing models were evaluated, involving machine learning and neural language models for extraction accuracy. Results Among 4584 annotated instances, 2659 (58%) contained an affirmed mention of travel history, while 347 (7.6%) were negated. Interannotator agreement resulted in a document-level Cohen kappa of 0.776. Automated text processing accuracy (F1 85.6, 95% CI 82.5-87.9) and computational burden were acceptable such that the system can provide a rapid screen for public health events. Conclusions Automated extraction of patient travel history from clinical documents is feasible for enhanced passive surveillance public health systems. Without such a system, it would usually be necessary to manually review charts to identify recent travel or lack of travel, use an electronic health record that enforces travel history documentation, or ignore this potential source of information altogether. The development of this tool was initially motivated by emergent arboviral diseases. More recently, this system was used in the early phases of response to COVID-19 in the United States, although its utility was limited to a relatively brief window due to the rapid domestic spread of the virus. Such systems may aid future efforts to prevent and contain the spread of infectious diseases.


SCITECH Nepal ◽  
2018 ◽  
Vol 13 (1) ◽  
pp. 64-69
Author(s):  
Dinesh Dangol ◽  
Rupesh Dahi Shrestha ◽  
Arun Timalsina

With an increasing trend of publishing news online on website, automatic text processing becomes more and more important. Automatic text classification has been a focus of many researchers in different languages for decades. There is a huge amount of research repository on features of English language and their uses on automated text processing. This research implements Nepali language key features for automatic text classification of Nepali news. In particular, the study on impact of Nepali language based features, which are extremely different than English language is more challenging because of the higher level of complexity to be resolved. The research experiment using vector space model, n-gram model and key feature based processing specific to Nepali language shows promising result compared to bag-of-words model for the task of automated Nepali news classification.


2018 ◽  
Vol 7 (04) ◽  
pp. 871-888 ◽  
Author(s):  
Sophie J. Lee ◽  
Howard Liu ◽  
Michael D. Ward

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.


Author(s):  
Amir Adel Mabrouk Eldeib, Moulay Ibrahim El- Khalil Ghembaza

The science of diacritical marks is closely related to the Holy Quran, as it was used in the Quran to remove confusion and error from the pronunciation of the reader, so the introduction of any technique in the process of processing Quranic texts will have an effect on facilitating the tasks of researchers in the field of Quranic studies, whether on the reader of the Quran, to help him read accurate and correct recitation, or on the tutor to help him compile a number of examples appropriate for training. The importance of this research lies in employing automated text- processing algorithms to determine the locations of the Nunation vowelization types in the Holy Quran, and the possibility of their computerizing in order to facilitate the accurate recitation of the Holy Quran and, at the same time, to collect training examples in a database or building a corpus for future use in many research and software applications for the Holy Quran and its sciences. This research aims to present a new idea through the proposition of a framework architecture that identifies and discover automatically the locations and types of the Nunation in the Holy Quran based on the part- of- speech tagging algorithm for Arabic language so as to determine the type of words, and then by using a knowledge base to discover the appropriate Nunation words and their locations, and finally discovering the type of Nunation so as to determine the vowelization of the last letter of each Nunation word according to the Quran diacritical marks science. Furthermore, another benefit is to link searching processes with Quranic texts towards extracting the composition Nunation and the sequence Nunations in the Holy Quran emerges from the science of Quran diacritical marks; and display them as data according to a set of options selected by the user through suitable applications interfaces. The basic elements that the results of searching Quranic texts should display are highlighted, in order to extract the positions and types of Nunation vowelizations. As well as, a template for the results of searching all types of Nunation in a specific Quranic Chapter is given, with several possible options to retrieve all data in detail.


2012 ◽  
Vol 19 (5) ◽  
pp. 833-839 ◽  
Author(s):  
Kavishwar B Wagholikar ◽  
Kathy L MacLaughlin ◽  
Michael R Henry ◽  
Robert A Greenes ◽  
Ronald A Hankey ◽  
...  

Author(s):  
V. P. Leonov

In recent times marked by the offensive and largescale advancement of computer technol-ogy numerous libraries, scientific and educational centers in the world are creating their own extensive databases of literary and bibliographic texts. Facing such databases the close reading method designed to work with specific texts would seem to lose its meaning. The Italian sociolo-gist and literary critic Franco Moretti became the main critic of the close reading. He presented his ideas in the book «Distant Reading». This book can be viewed as a program to update the methodology of studying world literature. Moretti believes that the world literature should be studied not by looking at the details, but by examining it from a long distance: studying hudreds and thousands of texts. He suggest to use the Digital Humanities (DH) methods, i.e. to ap-ply digital (computer) methods in the humanities. To show the reasons for the survival of certain types of texts, Moretti compares literary processes with biological ones and draws an anology between natural selection and reader selection. Moretti’s predecessor, who first used quantitative methods in literary studies and saw common ground between literary and biological processes, was the author of the fundamental monograph “Methodology of an exact study of literature” B. I. Yarkho (18891942).Moretti’s book “Distant Reading” shatters stereotypes of the bibliographic environment. It is directed no to the study of close (slow) reading, but to the study of the entire world docmentary flow. This approach opens the way to the use of quantitative methods in the study of world bibliography. A new research strategy “exact study of bibliography” will be formed as part of digital and automated text processing.


2021 ◽  
pp. 1-28
Author(s):  
Ali Hürriyetoğlu ◽  
Erdem Yörük ◽  
Osman Mutlu ◽  
Fırat Duruşan ◽  
Çağrı Yoltar ◽  
...  

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.


Sign in / Sign up

Export Citation Format

Share Document