automated text processing
Recently Published Documents


TOTAL DOCUMENTS

18
(FIVE YEARS 10)

H-INDEX

3
(FIVE YEARS 0)

2021 ◽  
pp. 1-28
Author(s):  
Ali Hürriyetoğlu ◽  
Erdem Yörük ◽  
Osman Mutlu ◽  
Fırat Duruşan ◽  
Çağrı Yoltar ◽  
...  

Abstract We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.


10.2196/26719 ◽  
2021 ◽  
Vol 7 (3) ◽  
pp. e26719
Author(s):  
Kelly S Peterson ◽  
Julia Lewis ◽  
Olga V Patterson ◽  
Alec B Chapman ◽  
Daniel W Denhalter ◽  
...  

Background Patient travel history can be crucial in evaluating evolving infectious disease events. Such information can be challenging to acquire in electronic health records, as it is often available only in unstructured text. Objective This study aims to assess the feasibility of annotating and automatically extracting travel history mentions from unstructured clinical documents in the Department of Veterans Affairs across disparate health care facilities and among millions of patients. Information about travel exposure augments existing surveillance applications for increased preparedness in responding quickly to public health threats. Methods Clinical documents related to arboviral disease were annotated following selection using a semiautomated bootstrapping process. Using annotated instances as training data, models were developed to extract from unstructured clinical text any mention of affirmed travel locations outside of the continental United States. Automated text processing models were evaluated, involving machine learning and neural language models for extraction accuracy. Results Among 4584 annotated instances, 2659 (58%) contained an affirmed mention of travel history, while 347 (7.6%) were negated. Interannotator agreement resulted in a document-level Cohen kappa of 0.776. Automated text processing accuracy (F1 85.6, 95% CI 82.5-87.9) and computational burden were acceptable such that the system can provide a rapid screen for public health events. Conclusions Automated extraction of patient travel history from clinical documents is feasible for enhanced passive surveillance public health systems. Without such a system, it would usually be necessary to manually review charts to identify recent travel or lack of travel, use an electronic health record that enforces travel history documentation, or ignore this potential source of information altogether. The development of this tool was initially motivated by emergent arboviral diseases. More recently, this system was used in the early phases of response to COVID-19 in the United States, although its utility was limited to a relatively brief window due to the rapid domestic spread of the virus. Such systems may aid future efforts to prevent and contain the spread of infectious diseases.


2020 ◽  
Author(s):  
Kelly S Peterson ◽  
Julia Lewis ◽  
Olga V Patterson ◽  
Alec B Chapman ◽  
Daniel Denhalter ◽  
...  

BACKGROUND Patient travel history can be crucial in evaluating evolving infectious disease events. Such information can be challenging to acquire in electronic health records as it is often available only in unstructured text. OBJECTIVE Assess the feasibility of annotating and automatically extracting travel history mentions from unstructured clinical documents in the Department of Veterans Affairs (VA) across disparate healthcare facilities and among millions of patients. Information about travel exposure augments existing surveillance applications for increased preparedness in responding quickly to public health threats. METHODS Clinical documents related to arboviral disease were annotated following selection using a semi-automated bootstrapping process. Using annotated instances as training data, models were developed to extract from unstructured clinical text any mention of affirmed travel locations outside of the continental United States. Automated text processing models were evaluated involving machine learning and neural language models for extraction accuracy. RESULTS Among annotated instances, 2,659 (58%) contained an affirmed mention of travel history while 347 (7.6%) were negated. Inter-annotator agreement resulted in a document-level Cohen’s kappa (Κc) of 0.776. Automated text processing accuracy (F1=85.6) and computational burden were acceptable such that the system can provide a rapid screen for public health events. CONCLUSIONS Automated extraction of patient travel history from clinical documents is feasible for enhanced capabilities to improve public health systems. This evaluation was initially performed on emergent arboviral disease. More recently, this system was utilized in early phases of response to COVID-19 in the United States although its utility was limited to a relatively brief window due to rapid domestic spread of the virus. Such systems may aid future efforts to prevent and contain the spread of infectious diseases.


Author(s):  
Amir Adel Mabrouk Eldeib, Moulay Ibrahim El- Khalil Ghembaza

The science of diacritical marks is closely related to the Holy Quran, as it was used in the Quran to remove confusion and error from the pronunciation of the reader, so the introduction of any technique in the process of processing Quranic texts will have an effect on facilitating the tasks of researchers in the field of Quranic studies, whether on the reader of the Quran, to help him read accurate and correct recitation, or on the tutor to help him compile a number of examples appropriate for training. The importance of this research lies in employing automated text- processing algorithms to determine the locations of the Nunation vowelization types in the Holy Quran, and the possibility of their computerizing in order to facilitate the accurate recitation of the Holy Quran and, at the same time, to collect training examples in a database or building a corpus for future use in many research and software applications for the Holy Quran and its sciences. This research aims to present a new idea through the proposition of a framework architecture that identifies and discover automatically the locations and types of the Nunation in the Holy Quran based on the part- of- speech tagging algorithm for Arabic language so as to determine the type of words, and then by using a knowledge base to discover the appropriate Nunation words and their locations, and finally discovering the type of Nunation so as to determine the vowelization of the last letter of each Nunation word according to the Quran diacritical marks science. Furthermore, another benefit is to link searching processes with Quranic texts towards extracting the composition Nunation and the sequence Nunations in the Holy Quran emerges from the science of Quran diacritical marks; and display them as data according to a set of options selected by the user through suitable applications interfaces. The basic elements that the results of searching Quranic texts should display are highlighted, in order to extract the positions and types of Nunation vowelizations. As well as, a template for the results of searching all types of Nunation in a specific Quranic Chapter is given, with several possible options to retrieve all data in detail.


Author(s):  
V. P. Leonov

In recent times marked by the offensive and largescale advancement of computer technol-ogy numerous libraries, scientific and educational centers in the world are creating their own extensive databases of literary and bibliographic texts. Facing such databases the close reading method designed to work with specific texts would seem to lose its meaning. The Italian sociolo-gist and literary critic Franco Moretti became the main critic of the close reading. He presented his ideas in the book «Distant Reading». This book can be viewed as a program to update the methodology of studying world literature. Moretti believes that the world literature should be studied not by looking at the details, but by examining it from a long distance: studying hudreds and thousands of texts. He suggest to use the Digital Humanities (DH) methods, i.e. to ap-ply digital (computer) methods in the humanities. To show the reasons for the survival of certain types of texts, Moretti compares literary processes with biological ones and draws an anology between natural selection and reader selection. Moretti’s predecessor, who first used quantitative methods in literary studies and saw common ground between literary and biological processes, was the author of the fundamental monograph “Methodology of an exact study of literature” B. I. Yarkho (18891942).Moretti’s book “Distant Reading” shatters stereotypes of the bibliographic environment. It is directed no to the study of close (slow) reading, but to the study of the entire world docmentary flow. This approach opens the way to the use of quantitative methods in the study of world bibliography. A new research strategy “exact study of bibliography” will be formed as part of digital and automated text processing.


2019 ◽  
Vol 9 (3) ◽  
Author(s):  
Yi-fang Brook Wu ◽  
Xin Chen

Research on distance learning and computer-aided grading has been developed in parallel. Little work has been done in the past to join the two areas to solve the problem of automated learning assessment in virtual classrooms. This paper presents a model for learning assessment using an automated text processing technique to analyze class messages with an emphasis on course topics produced in an online class. It is suggested that students should be evaluated on many dimensions, including the learning artifacts such as course work submitted and class participation. Taking all these grading criteria into consideration, we design a model which combines three grading factors: the quality of course work, the quantity of efforts, and the activeness of participation, for evaluating the performance of students in the class. These three main items are measured on the basis of keyword contribution, message length, and message count, and a score is derived from the class messages to evaluate students’ performance. An assessment model is then constructed from these three measures to compute a performance indicator score for each student. The experiment shows that there is a high correlation between the performance indicator scores and the actual grades assigned by instructors. The rank orders of students by performance indicator scores and by the actual grades are highly correlated as well. Evidence from the experiment shows that the computer grader can be a great supplementary teaching and grading tool for distance learning instructors.


2019 ◽  
Author(s):  
Herwig Unger ◽  
Mario M. Kubek

Centroid terms are single, descriptive words that semantically and topically characterise text documents and thus can act as their very compact representation in automated text processing tasks that strongly rely on the semantic similarity of texts. Algorithms to classify and cluster them make use of this information. In this book, the novel, brain- and physicsinspired concept of centroid terms is introduced and deeply discussed. Furthermore, their unique properties and practical usage in major natural language processing and text mining tasks are covered. In this regard, a new graph-based method for their fast calculation is presented as well. In contrast to methods relying on the bag-of-words model, the derived centroid distance measure can uncover a topical relationship between texts even when their wording differs. As centroid terms can also represent short texts, the presented first fully integrated, P2P-based web search engine, called “WebEngine”, therefore makes heavy use of...


Author(s):  
Galina Khodiakova

In the article one of the approaches to teaching quantitative linguistics course is described. In spite of the fact that the course is relatively new there are already some traditions in its teaching. Usually the usage of a number of mathematical methods and methods of mathematical statistics is accented. Students studying quantitative linguistics according to these programs are required to have a deep understanding in the corresponding subject areas. From the beginning of the 2000s text processing computer programs have been actively developed, there are examples of using these programs in studying process. Supporting and updating previously created programs is not of current interest, online services developed by big corporations are widely used for analysis and processing of linguistic information. Programs that appeared lately have much better quality, reliability and availability compared to their predecessors. They can be successfully used in studying process. The goal of writing this article is to describe the possibilities of modern computer means for text information analysis and the methods of their usage in the process of teaching students the course of quantitative linguistics. The functionalities of a number of popular online text processing and analysis services are described in this article. Further in the article examples of the practical work on following topics are given: Text frequency characteristics, Zipf’s Law, Semantic text analysis, Typological indices of Greenberg, Grammar text analysis, building semantic graphs. Computer text processing is used also during phonosemantic analysis of words and text, identification of the author of a text, finding the amount of information in the linguistic unit. In the program of teaching students on specialization «Applied linguistics» for studying the discipline «Quantitative linguistics» 3 credits, 10 hours of lectures and 20 hours of workshops are allocated. In prospect, the development of quantitative linguistics teaching course, extension of a list of topics for studying methods of computer text processing, deepening knowledge by studying algorithms of automated text processing are possible. This can be a subject for further research in the field of teaching quantitative linguistics course.


Sign in / Sign up

Export Citation Format

Share Document