Focused named entity recognition using machine learning

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

Obtaining Knowledge in Pathology Reports Through a Natural Language Processing Approach With Classification, Named-Entity Recognition, and Relation-Extraction Heuristics

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00008 ◽

2019 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Tomasz Oliwa ◽

Steven B. Maron ◽

Leah M. Chase ◽

Samantha Lomnicki ◽

Daniel V.T. Catenacci ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Classification Model ◽

Supervised Machine Learning ◽

Named Entity ◽

Pathology Reports

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.

Download Full-text

A Comparative Study of Dictionary-based and Machine Learning-based Named Entity Recognition in Pashto

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval ◽

10.1145/3443279.3443307 ◽

2020 ◽

Author(s):

Rafiullah Momand ◽

Shakirullah Waseeb ◽

Ahmad Masood Latif Rai

Keyword(s):

Machine Learning ◽

Comparative Study ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

SCIENTIFIC NAMED ENTITY RECOGNITION WITH THE HELP OF MODERN METHODS

Bulletin Series of Physics & Mathematical Sciences ◽

10.51889/2021-3.1728-7901.11 ◽

2021 ◽

Vol 75 (3) ◽

pp. 94-99

Author(s):

A.M. Yelenov ◽

◽

A.B. Jaxylykova ◽

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Support Vector ◽

Scientific Article ◽

Natural Languages ◽

Named Entity ◽

Learning Area

This research focuses on a comparative study of the Named Entity Recognition task for scientific article texts. Natural language processing could be considered as one of the cornerstones in the machine learning area which devotes its attention to the problems connected with the understanding of different natural languages and linguistic analysis. It was already shown that current deep learning techniques have a good performance and accuracy in such areas as image recognition, pattern recognition, computer vision, that could mean that such technology probably would be successful in the neuro-linguistic programming area too and lead to a dramatic increase on the research interest on this topic. For a very long time, quite trivial algorithms have been used in this area, such as support vector machines or various types of regression, basic encoding on text data was also used, which did not provide high results. The following dataset was used to process the experiment models: Dataset Scientific Entity Relation Core. The algorithms used were Long short-term memory, Random Forest Classifier with Conditional Random Fields, and Named-entity recognition with Bidirectional Encoder Representations from Transformers. In the findings, the metrics scores of all models were compared to each other to make a comparison. This research is devoted to the processing of scientific articles, concerning the machine learning area, because the subject is not investigated on enough properly level.The consideration of this task can help machines to understand natural languages better, so that they can solve other neuro-linguistic programming tasks better, enhancing scores in common sense.

Download Full-text

A systematic exposition of Punjabi Named Entity Recognition using different Machine Learning models

10.1109/icirca51532.2021.9544894 ◽

2021 ◽

Author(s):

Amandeep Kaur ◽

Sonam Khattar

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Entity Recognition ◽

Learning Models ◽

Named Entity ◽

Systematic Exposition ◽

Machine Learning Models

Download Full-text

Machine Learning Algorithms for Portuguese Named Entity Recognition

INTELIGENCIA ARTIFICIAL ◽

10.4114/ia.v11i36.893 ◽

2007 ◽

Vol 11 (36) ◽

Cited By ~ 7

Author(s):

R. L. Milidiú ◽

J. C. Duarte ◽

R. Cavalcante

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Named Entity Recognition ◽

Machine Learning Algorithms ◽

Entity Recognition ◽

Named Entity

Download Full-text

A Feature Based Simple Machine Learning Approach with Word Embeddings to Named Entity Recognition on Tweets

Natural Language Processing and Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-319-59569-6_30 ◽

2017 ◽

pp. 254-259 ◽

Cited By ~ 2

Author(s):

Mete Taşpınar ◽

Murat Can Ganiz ◽

Tankut Acarman

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Entity Recognition ◽

Learning Approach ◽

Word Embeddings ◽

Named Entity ◽

Simple Machine ◽

Machine Learning Approach ◽

Feature Based

Download Full-text

Using machine learning to maintain rule-based named-entity recognition and classification systems

10.3115/1073012.1073067 ◽

2001 ◽

Cited By ~ 17

Author(s):

Georgios Petasis ◽

Frantz Vichot ◽

Francis Wolinski ◽

Georgios Paliouras ◽

Vangelis Karkaletsis ◽

...

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Classification Systems ◽

Entity Recognition ◽

Rule Based ◽

Named Entity

Download Full-text

Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks

BioMed Research International ◽

10.1155/2014/240403 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6 ◽

Cited By ~ 49

Author(s):

Buzhou Tang ◽

Hongxin Cao ◽

Xiaolong Wang ◽

Qingcai Chen ◽

Hua Xu

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biomedical Domain ◽

Crucial Step ◽

Named Entity ◽

Different Types ◽

Word Representation ◽

Biomedical Named Entity Recognition

Biomedical Named Entity Recognition (BNER), which extracts important entities such as genes and proteins, is a crucial step of natural language processing in the biomedical domain. Various machine learning-based approaches have been applied to BNER tasks and showed good performance. In this paper, we systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings. We selected one algorithm from each of the three types of WR features and applied them to the JNLPBA and BioCreAtIvE II BNER tasks. Our results showed that all the three WR algorithms were beneficial to machine learning-based BNER systems. Moreover, combining these different types of WR features further improved BNER performance, indicating that they are complementary to each other. By combining all the three types of WR features, the improvements inF-measure on the BioCreAtIvE II GM and JNLPBA corpora were 3.75% and 1.39%, respectively, when compared with the systems using baseline features. To the best of our knowledge, this is the first study to systematically evaluate the effect of three different types of WR features for BNER tasks.

Download Full-text

Named Entity Recognition for a Low Resource Language

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2085.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 587-590

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Linguistic Knowledge ◽

Rule Based ◽

Low Resource ◽

Named Entity ◽

The North ◽

Rule Based Approach

Kokborok named entity recognition using the rules based approach is being studied in this paper. Named entity recognition is one of the applications of natural language processing. It is considered a subtask for information extraction. Named entity recognition is the means of identifying the named entity for some specific task. We have studied the named entity recognition system for the Kokborok language. Kokborok is the official language of the state of Tripura situated in the north eastern part of India. It is also widely spoken in other part of the north eastern state of India and adjoining areas of Bangladesh. The named entities are like the name of person, organization, location etc. Named entity recognitions are studied using the machine learning approach, rule based approach or the hybrid approach combining the machine learning and rule based approaches. Rule based named entity recognitions are influence by the linguistic knowledge of the language. Machine learning approach requires a large number of training data. Kokborok being a low resource language has very limited number of training data. The rule based approach requires linguistic rules and the results are not depended on the size of data available. We have framed a heuristic rules for identifying the named entity based on linguistic knowledge of the language. An encouraging result is obtained after we test our data with the rule based approach. We also tried to study and frame the rules for the counting system in Kokborok in this paper. The rule based approach to named entity recognition is found suitable for low resource language with limited digital work and absence of named entity tagged data. We have framed a suitable algorithm using the rules for solving the named entity recognition task for obtaining a desirable result.

Download Full-text