Grammatic and semantic normativity of linguistic units and features as a factor of automatic text processing

Author(s):  
Z. M. Shalyapina
Author(s):  
Valeriya K. Marchenko ◽  

This article is devoted to the study of the role of the question-answer form of writing in the "Writer’s Diary" by F.M. Dostoevsky using automatic text processing programs.


Author(s):  
Francisco M. Couto ◽  
Mário J. Silva ◽  
Vivian Lee ◽  
Emily Dimmer ◽  
Evelyn Camon ◽  
...  

Molecular Biology research projects produced vast amounts of data, part of which has been preserved in a variety of public databases. However, a large portion of the data contains a significant number of errors and therefore requires careful verification by curators, a painful and costly task, before being reliable enough to derive valid conclusions from it. On the other hand, research in biomedical information retrieval and information extraction are nowadays delivering Text Mining solutions that can support curators to improve the efficiency of their work to deliver better data resources. Over the past decades, automatic text processing systems have successfully exploited biomedical scientific literature to reduce the researchers’ efforts to keep up to date, but many of these systems still rely on domain knowledge that is integrated manually leading to unnecessary overheads and restrictions in its use. A more efficient approach would acquire the domain knowledge automatically from publicly available biological sources, such as BioOntologies, rather than using manually inserted domain knowledge. An example of this approach is GOAnnotator, a tool that assists the verification of uncurated protein annotations. It provided correct evidence text at 93% precision to the curators and thus achieved promising results. GOAnnotator was implemented as a web tool that is freely available at http://xldb.di.fc.ul.pt/rebil/tools/goa/.


SCITECH Nepal ◽  
2018 ◽  
Vol 13 (1) ◽  
pp. 64-69
Author(s):  
Dinesh Dangol ◽  
Rupesh Dahi Shrestha ◽  
Arun Timalsina

With an increasing trend of publishing news online on website, automatic text processing becomes more and more important. Automatic text classification has been a focus of many researchers in different languages for decades. There is a huge amount of research repository on features of English language and their uses on automated text processing. This research implements Nepali language key features for automatic text classification of Nepali news. In particular, the study on impact of Nepali language based features, which are extremely different than English language is more challenging because of the higher level of complexity to be resolved. The research experiment using vector space model, n-gram model and key feature based processing specific to Nepali language shows promising result compared to bag-of-words model for the task of automated Nepali news classification.


1985 ◽  
Vol 104 (4) ◽  
pp. 696
Author(s):  
Peter C. Patton ◽  
Ferenc Postma ◽  
Eep Talstra ◽  
Marc Vervenne

Author(s):  
Valery A. Mishlanov ◽  
◽  
Liudmila А. Kadzhaya ◽  
Vladimir A. Salimovskiy ◽  
Ivan V. Smirnov ◽  
...  

This paper discusses the issues of improving a predicate word dictionary structure that is used in solving problems of knowledge acquisition and text analysis. The principle of open dictionary architecture is shown. It takes into account the stylistic differentiation of speech and involves the description of predicate word subsystems functioning in separate speech varieties.


2020 ◽  
Author(s):  
Eduardo Rosado ◽  
Miguel Garcia-Remesal Sr ◽  
Sergio Paraiso-Medina Sr ◽  
Alejandro Pazos Sr ◽  
Victor Maojo Sr

BACKGROUND Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. OBJECTIVE To address this issue we developed BiDI (Biomedical Database Inventory), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them in a seamless manner. METHODS We designed an ensemble of Deep Learning methods to extract database mentions. To train the system we annotated a set of 1,242 articles that included mentions to database publications. Such a dataset was used along with transfer learning techniques to train an ensemble of deep learning NLP models based on the task of database publication detection. RESULTS The system obtained an f1-score of 0.929 on database detection, showing high precision and recall values. Applying this model to the PubMed and PubMed Central databases we identified over 10,000 unique databases. The ensemble also extracts the web links to the reported databases, discarding the irrelevant links. For the extraction of web links the model achieved a cross-validated f1-score of 0.908. We show two use cases, related to “omics” and the COVID-19 pandemia. CONCLUSIONS BiDI enables the access of biomedical resources over the Internet and facilitates data-driven research and other scientific initiatives. The repository is available at (http://gib.fi.upm.es/bidi/) and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (biomedical and others).


Author(s):  
Andrei Mikheev

Electronic text is essentially just a sequence of characters, but the majority of text processing tools operate in terms of linguistic units such as words and sentences. Tokenization is a process of segmenting text into words, and sentence splitting is the process of determining sentence boundaries in the text. In this chapter we describe major challenges for text tokenization and sentence splitting in different languages, and outline various computational approaches to tackling them.


Sign in / Sign up

Export Citation Format

Share Document