Disambiguating the species of biomedical named entities using natural language parsers

Technology is becoming expressively popular among agribusiness producers and is progressing in all agricultural area. One of the difficulties in this context is to handle data in natural language to solve problems in the field of agriculture. In order to build up dialogs and provide rich researchers, the present work uses Natural Language Processing (NLP) techniques to develop an automatic and effective computer system to interact with the user and assist in the identification of pests and diseases in the soybean farming, stored in a database repository to provide accurate diagnoses to simplify the work of the agricultural professional and also for those who deal with a lot of information in this area. Information on 108 pests and 19 diseases that damage Brazilian soybean was collected from Brazilian bibliographic manuals with the purpose to optimize the data and improve production, using the spaCy library for syntactic analysis of NLP, which allowed the pre-process the texts, recognize the named entities, calculate the similarity between the words, verify dependency parsing and also provided the support for the development requirements of the CAROLINA tool (Robotized Agronomic Conversation in Natural Language) using the language belonging to the agricultural area.

Download Full-text

Automated Trait Extraction using ClearEarth, a Natural Language Processing System for Text Mining in Natural Sciences

Biodiversity Information Science and Standards ◽

10.3897/biss.2.26080 ◽

2018 ◽

Vol 2 ◽

pp. e26080 ◽

Cited By ~ 1

Author(s):

Anne Thessen ◽

Jenette Preciado ◽

Payoj Jain ◽

James Martin ◽

Martha Palmer ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing System ◽

Natural Sciences ◽

Phenotypic Traits ◽

Good Precision ◽

Named Entities ◽

Clinical Notes ◽

Linguistic Annotation

The cTAKES package (using the ClearTK Natural Language Processing toolkit Bethard et al. 2014,http://cleartk.github.io/cleartk/) has been successfully used to automatically read clinical notes in the medical field (Albright et al. 2013, Styler et al. 2014). It is used on a daily basis to automatically process clinical notes and extract relevant information by dozens of medical institutions. ClearEarth is a collaborative project that brings together computational linguistics and domain scientists to port Natural Language Processing (NLP) modules trained on the same types of linguistic annotation to the fields of geology, cryology, and ecology. The goal for ClearEarth in the ecology domain is the extraction of ecologically-relevant terms, including eco-phenotypic traits from text and the assignment of those traits to taxa. Four annotators used Anafora (an annotation software; https://github.com/weitechen/anafora) to mark seven entity types (biotic, aggregate, abiotic, locality, quality, unit, value) and six reciprocal property types (synonym of/has synonym, part of/has part, subtype/supertype) in 133 documents from primarily Encyclopedia of Life (EOL) and Wikipedia according to project guidelines (https://github.com/ClearEarthProject/AnnotationGuidelines). Inter-annotator agreement ranged from 43% to 90%. Performance of ClearEarth on identifying named entities in biology text overall was good (precision: 85.56%; recall: 71.57%). The named entities with the best performance were organisms and their parts/products (biotic entities - precision: 72.09%; recall: 54.17%) and systems and environments (aggregate entities - precision: 79.23%; recall: 75.34%). Terms and their relationships extracted by ClearEarth can be embedded in the new ecocore ontology after vetting (http://www.obofoundry.org/ontology/ecocore.html). This project enables use of advanced industry and research software within natural sciences for downstream operations such as data discovery, assessment, and analysis. In addition, ClearEarth uses the NLP results to generate domain-specific ontologies and other semantic resources.

Download Full-text

Detecting Multiword Expressions and Named Entities in Natural Language Texts

10.14232/phd.2434 ◽

2015 ◽

Author(s):

István Nagy

Keyword(s):

Natural Language ◽

Named Entities ◽

Multiword Expressions

Download Full-text

Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks

Proceedings of the International conference “InterCarto/InterGIS” ◽

10.35595/2414-9179-2020-1-26-375-384 ◽

2020 ◽

Vol 26 (1) ◽

pp. 375-384

Author(s):

Alexey Kolesnikov ◽

Pavel Kikin ◽

Giovanni Niko ◽

Elena Komissarova

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Data Extraction ◽

Named Entity Recognition ◽

Point Of View ◽

Entity Recognition ◽

Named Entities ◽

Processing Technologies ◽

Plain Text

Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.

Download Full-text

Deep neural networks for Arabic information extraction

Smart and Sustainable Built Environment ◽

10.1108/sasbe-03-2019-0031 ◽

2020 ◽

Vol 9 (4) ◽

pp. 467-482

Author(s):

Abdelhalim Saadi ◽

Hacene Belhadef

Keyword(s):

Neural Network ◽

Neural Networks ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Deep Neural Networks ◽

Relevant Information ◽

Arabic Language ◽

Named Entities ◽

Content Type

PurposeThe purpose of this paper is to present a system based on deep neural networks to extract particular entities from natural language text, knowing that a massive amount of textual information is electronically available at present. Notably, a large amount of electronic text data indicates great difficulty in finding or extracting relevant information from them.Design/methodology/approachThis study presents an original system to extract Arabic-named entities by combining a deep neural network-based part-of-speech tagger and a neural network-based named entity extractor. Firstly, the system extracts the grammatical classes of the words with high precision depending on the context of the word. This module plays the role of the disambiguation process. Then, a second module is used to extract the named entities.FindingsUsing deep neural networks in natural language processing, requires tuning many hyperparameters, which is a time-consuming process. To deal with this problem, applying statistical methods like the Taguchi method is much requested. In this study, the system is successfully applied to the Arabic-named entities recognition, where accuracy of 96.81 per cent was reported, which is better than the state-of-the-art results.Research limitations/implicationsThe system is designed and trained for the Arabic language, but the architecture can be used for other languages.Practical implicationsInformation extraction systems are developed for different applications, such as analysing newspaper articles and databases for commercial, political and social objectives. Information extraction systems also can be built over an information retrieval (IR) system. The IR system eliminates irrelevant documents and paragraphs.Originality/valueThe proposed system can be regarded as the first attempt to use double deep neural networks to increase the accuracy. It also can be built over an IR system. The IR system eliminates irrelevant documents and paragraphs. This process reduces the mass number of documents from which the authors wish to extract the relevant information using an information extraction system.

Download Full-text

The structure of content

Proceedings of the International Symposium on Quality Assurance and Quality Control in XML ◽

10.4242/balisagevol9.derose01 ◽

2012 ◽

Cited By ~ 1

Author(s):

Steven J. DeRose

Keyword(s):

Natural Language ◽

Statistical Methods ◽

High Volume ◽

Language Identification ◽

Text Analytics ◽

Named Entities ◽

Distinctive Features ◽

Trade Offs ◽

In The Wild

Text analytics involves extracting features of meaning from natural language texts and making them explicit, much as markup does. It uses linguistics, AI, and statistical methods to get at a level of "meaning" that markup generally does not: down in the leaves of what to XML may be unanalyzed "content". This suggests potential for new kinds of error, consistency, and quality checking. However, text analytics can also discover features that markup is used for; this suggests that text analytics can also contribute to the markup process itself. Perhaps the simplest example of text analytics' potential for checking, is xml:lang. Language identification is well-developed technology, and xml:lang attributes "in the wild" could be much improved. More interestingly, the distribution of named entities (people, places, organizations, etc.), topics, and emphasis interacts closely with documents' markup structures. Summaries, abstracts, conclusions, and the like all have distinctive features which can be measured. This paper provides an overview of how text analytics works, what it can do, and how that relates to the things we typically mark up in XML. It also discuss the trade-offs and decisions involved in just what we choose to mark up, and how that interacts with automation. It presents several specific ways that text analytics can help create, check, and enhance XML components, and exemplifies some cases using a high-volume analytics tool.

Download Full-text

Complex named entities in Spanish texts

Lingvisticae Investigationes ◽

10.1075/li.30.1.06gal ◽

2007 ◽

Vol 30 (1) ◽

pp. 69-94

Author(s):

Sofía N. Galicia-Haro ◽

Alexander Gelbukh

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Semantic Analysis ◽

Linguistic Analysis ◽

Named Entities ◽

Prepositional Phrases ◽

Named Entity ◽

Open Class

We present a linguistic analysis of Named Entities in Spanish texts. Our work is focused on the determination of the structure of complex proper names: names with coordinated constituents, names with prepositional phrases and names formed by several content words initialized by a capital letter. We present the analysis of circa 49,000 examples obtained from Mexican newspapers. We detailed their structure and give some notions about the context surrounding them. Since named entities belong to open class of words they are being created daily, so the challenge for a named entity recognizer is to precisely determine the boundaries of new entity names in any text and to analyze thoroughly their components for deep semantic analysis. Knowing their general classes of structure it should be possible to derive useful heuristics or a specific grammar for natural language processing applications.

Download Full-text