What's in this Collection Dataset? Semantic Annotation with GATE

Semantic annotations of datasets are very useful to support quality assurance, discovery, interpretability, linking and integration of datasets. However, providing such annotations manually is often a time-consuming task . If the process is to be at least partially automated and still provide good semantic annotations, precise information extraction is needed. The recognition of entity names (e.g., person, organization, location) from textual resources is the first step before linking the identified term or phrase to other semantic resources such as concepts in ontologies. A multitude of tools and techniques have been developed for information extraction. One of the big players is the text mining framework GATE (Cunningham et al. 2013) that supports annotation rules, semantic techniques and machine learning approaches. We will run GATE's default ANNIE pipeline on collection datasets to automatically detect persons, locations and time. We will also present extensions to extract organisms (Naderi et al. 2011), environmental terms, data parameters and biological processes and how to link them to ontologies and LOD resources, e.g., DBPedia (Sateli and Witte 2015). We would like to discuss the results with the conference participants and welcome comments and feedbacks on the current solution. The audience is also welcome to provide their own datasets in preparation for this session.

Download Full-text

Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora

Natural Language Engineering ◽

10.1017/s1351324920000352 ◽

2020 ◽

pp. 1-21 ◽

Cited By ~ 2

Author(s):

Clément Dalloux ◽

Vincent Claveau ◽

Natalia Grabar ◽

Lucas Emanuel Silva Oliveira ◽

Claudia Maria Cabral Moro ◽

...

Keyword(s):

Machine Learning ◽

Information Extraction ◽

State Of The Art ◽

Automatic Detection ◽

Brazilian Portuguese ◽

Supervised Machine Learning ◽

Biomedical Domain ◽

Learning Approaches ◽

Cross Domain ◽

Automatic Methods

Abstract Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented.

Download Full-text

Code Clone Detection Using Machine Learning Techniques

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2020040104 ◽

2020 ◽

Vol 11 (2) ◽

pp. 49-75

Author(s):

Amandeep Kaur ◽

Sandeep Sharma ◽

Munish Saini

Keyword(s):

Machine Learning ◽

Literature Review ◽

Systematic Literature Review ◽

Machine Learning Techniques ◽

Clone Detection ◽

Learning Approaches ◽

Code Clone ◽

Online Databases ◽

Learning Techniques ◽

Tools And Techniques

Code clone refers to code snippets that are copied and pasted with or without modifications. In recent years, traditional approaches for clone detection combine with other domains for better detection of a clone. This paper discusses the systematic literature review of machine learning techniques used in code clone detection. This study provides insights into various tools and techniques developed for clone detection by implementing machine learning approaches and how effectively those tools and techniques to identify clones. The authors perform a systematic literature review on studies selected from popular computer science-related digital online databases from January 2004 to January 2020. The software system and datasets used for analyzing tools and techniques are mentioned. A neural network machine learning technique is primarily used for the identification of the clone. Clone detection based on a program dependency graph must be explored in the future because it carries semantic information of code fragments.

Download Full-text

Discovering pathway and cell-type signatures in transcriptomic compendia with machine learning

10.7287/peerj.preprints.27229 ◽

2018 ◽

Author(s):

Gregory P Way ◽

Casey S Greene

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Specific Cell ◽

Biological Processes ◽

Learning Approaches ◽

Transcriptome Data ◽

Cell Type ◽

Learning Techniques ◽

Pathway Expression ◽

Machine Learning Applications

Pathway and cell-type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell-type and pathway expression, but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools in a practitioner’s toolkit for signature discovery through their ability to provide accurate and interpretable results. In the following review, we discuss various machine learning applications to extract pathway and cell-type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single cell RNA data. As data and compute resources increase, opportunities for machine learning to aid in revealing biological signatures will continue to grow.

Download Full-text

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation (Preprint)

10.2196/preprints.28229 ◽

2021 ◽

Author(s):

Riste Stojanov ◽

Gorjan Popovski ◽

Gjorgjina Cenikj ◽

Barbara Koroušić Seljak ◽

Tome Eftimov

Keyword(s):

Machine Learning ◽

Information Extraction ◽

State Of The Art ◽

Named Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Food Science ◽

Named Entity ◽

Semantic Resources ◽

Systematized Nomenclature Of Medicine

BACKGROUND Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources. OBJECTIVE In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction. METHODS We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags. RESULTS All BERT models provided very promising results with 93.30% to 94.31% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39% to 78.96%. CONCLUSIONS FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags.

Download Full-text

Discovering pathway and cell-type signatures in transcriptomic compendia with machine learning

10.7287/peerj.preprints.27229v1 ◽

2018 ◽

Author(s):

Gregory P Way ◽

Casey S Greene

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Specific Cell ◽

Biological Processes ◽

Learning Approaches ◽

Transcriptome Data ◽

Cell Type ◽

Learning Techniques ◽

Pathway Expression ◽

Machine Learning Applications

Download Full-text

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation

Journal of Medical Internet Research ◽

10.2196/28229 ◽

2021 ◽

Vol 23 (8) ◽

pp. e28229

Author(s):

Riste Stojanov ◽

Gorjan Popovski ◽

Gjorgjina Cenikj ◽

Barbara Koroušić Seljak ◽

Tome Eftimov

Keyword(s):

Machine Learning ◽

Information Extraction ◽

State Of The Art ◽

Named Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Food Science ◽

Named Entity ◽

Semantic Resources ◽

Systematized Nomenclature Of Medicine

Background Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources. Objective In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction. Methods We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags. Results All BERT models provided very promising results with 93.30% to 94.31% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39% to 78.96%. Conclusions FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags.

Download Full-text

Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-072018-021348 ◽

2019 ◽

Vol 2 (1) ◽

pp. 1-17

Author(s):

Gregory P. Way ◽

Casey S. Greene

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Specific Cell ◽

Biological Processes ◽

Learning Approaches ◽

Cell Type ◽

Learning Techniques ◽

Pathway Expression ◽

Machine Learning Applications ◽

Computational Resources

Pathway and cell type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell type and pathway expression but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools for signature discovery through their ability to provide accurate and interpretable results. In this review, we discuss various machine learning applications to extract pathway and cell type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single-cell RNA data. As data and computational resources increase, there will be more opportunities for machine learning to aid in revealing biological signatures.

Download Full-text

Supplemental Material for Psychometric and Machine Learning Approaches for Diagnostic Assessment and Tests of Individual Classification

Psychological Methods ◽

10.1037/met0000317.supp ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Diagnostic Assessment ◽

Learning Approaches

Download Full-text

Machine Learning Approaches for the Analysis of Non-Metallic Inclusion Data Sets

AISTech2019 Proceedings of the Iron and Steel Technology Conference ◽

10.33313/377/275 ◽

2019 ◽

Author(s):

M. Webler ◽

B. Abdulsalam

Keyword(s):

Machine Learning ◽

Data Sets ◽

Learning Approaches ◽

Metallic Inclusion

Download Full-text

Multiple vehicles detection and tracking for intelligent transport systems using machine learning approaches

Transport and Communication Science Journal ◽

10.25073/tcsj.70.3.7 ◽

2019 ◽

Vol 70 (3) ◽

pp. 214-224

Author(s):

Bui Ngoc Dung ◽

Manh Dzung Lai ◽

Tran Vu Hieu ◽

Nguyen Binh T. H.

Keyword(s):

Machine Learning ◽

Gaussian Mixture ◽

Research Field ◽

Transport Systems ◽

Learning Approaches ◽

Subtraction Method ◽

Intelligent Transport Systems ◽

Intelligent Transport ◽

Detection And Tracking ◽

Multiple Vehicles

Video surveillance is emerging research field of intelligent transport systems. This paper presents some techniques which use machine learning and computer vision in vehicles detection and tracking. Firstly the machine learning approaches using Haar-like features and Ada-Boost algorithm for vehicle detection are presented. Secondly approaches to detect vehicles using the background subtraction method based on Gaussian Mixture Model and to track vehicles using optical flow and multiple Kalman filters were given. The method takes advantages of distinguish and tracking multiple vehicles individually. The experimental results demonstrate high accurately of the method.

Download Full-text