DeepEventMine: end-to-end neural nested event extraction from biomedical texts

Hai-Long Trieu; Thy Thy Tran; Khoa N A Duong; Anh Nguyen; Makoto Miwa; Sophia Ananiadou

doi:10.1093/bioinformatics/btaa540

DeepEventMine: end-to-end neural nested event extraction from biomedical texts

Bioinformatics ◽

10.1093/bioinformatics/btaa540 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4910-4917

Author(s):

Hai-Long Trieu ◽

Thy Thy Tran ◽

Khoa N A Duong ◽

Anh Nguyen ◽

Makoto Miwa ◽

...

Keyword(s):

Directed Acyclic Graph ◽

State Of The Art ◽

Event Extraction ◽

Supplementary Information ◽

Supplementary Data ◽

General Domain ◽

Acyclic Graph ◽

End To End ◽

Biomedical Texts ◽

Extraction Model

Abstract Motivation Recent neural approaches on event extraction from text mainly focus on flat events in general domain, while there are less attempts to detect nested and overlapping events. These existing systems are built on given entities and they depend on external syntactic tools. Results We propose an end-to-end neural nested event extraction model named DeepEventMine that extracts multiple overlapping directed acyclic graph structures from a raw sentence. On the top of the bidirectional encoder representations from transformers model, our model detects nested entities and triggers, roles, nested events and their modifications in an end-to-end manner without any syntactic tools. Our DeepEventMine model achieves the new state-of-the-art performance on seven biomedical nested event extraction tasks. Even when gold entities are unavailable, our model can detect events from raw text with promising performance. Availability and implementation Our codes and models to reproduce the results are available at: https://github.com/aistairc/DeepEventMine. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CorGAT: a tool for the functional annotation of SARS-CoV-2 genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa1047 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Marco Antonio Tangaro ◽

Pietro Mandreoli ◽

David S Horner ◽

...

Keyword(s):

Functional Annotation ◽

Ad Hoc ◽

State Of The Art ◽

Supplementary Information ◽

Genomic Sequences ◽

Supplementary Data ◽

Evolutionary Patterns ◽

Genomic Variants ◽

Art Methods ◽

Available Resources

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CPS analysis: self-contained validation of biomedical data clustering

Bioinformatics ◽

10.1093/bioinformatics/btaa165 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3516-3521 ◽

Cited By ~ 1

Author(s):

Lixiang Zhang ◽

Lin Lin ◽

Jia Li

Keyword(s):

Data Clustering ◽

State Of The Art ◽

R Package ◽

Research Community ◽

Supplementary Information ◽

Biomedical Data ◽

Data Generation ◽

Supplementary Data ◽

Point Set ◽

Class Labels

Abstract Motivation Cluster analysis is widely used to identify interesting subgroups in biomedical data. Since true class labels are unknown in the unsupervised setting, it is challenging to validate any cluster obtained computationally, an important problem barely addressed by the research community. Results We have developed a toolkit called covering point set (CPS) analysis to quantify uncertainty at the levels of individual clusters and overall partitions. Functions have been developed to effectively visualize the inherent variation in any cluster for data of high dimension, and provide more comprehensive view on potentially interesting subgroups in the data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating uncertainty of clusters comparing to state-of-the-art measurements. We also showcase how to use CPS analysis to select data generation technologies or visualization methods. Availability and implementation The method is implemented in an R package called OTclust, available on CRAN. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Temporal network alignment via GoT-WAVE

Bioinformatics ◽

10.1093/bioinformatics/btz119 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3527-3529 ◽

Cited By ~ 3

Author(s):

David Aparício ◽

Pedro Ribeiro ◽

Tijana Milenković ◽

Fernando Silva

Keyword(s):

User Interface ◽

State Of The Art ◽

Source Code ◽

Network Alignment ◽

Supplementary Information ◽

Temporal Network ◽

Temporal Networks ◽

Supplementary Data ◽

Node Similarity ◽

User Friendly

Abstract Motivation Network alignment (NA) finds conserved regions between two networks. NA methods optimize node conservation (NC) and edge conservation. Dynamic graphlet degree vectors are a state-of-the-art dynamic NC measure, used within the fastest and most accurate NA method for temporal networks: DynaWAVE. Here, we use graphlet-orbit transitions (GoTs), a different graphlet-based measure of temporal node similarity, as a new dynamic NC measure within DynaWAVE, resulting in GoT-WAVE. Results On synthetic networks, GoT-WAVE improves DynaWAVE’s accuracy by 30% and speed by 64%. On real networks, when optimizing only dynamic NC, the methods are complementary. Furthermore, only GoT-WAVE supports directed edges. Hence, GoT-WAVE is a promising new temporal NA algorithm, which efficiently optimizes dynamic NC. We provide a user-friendly user interface and source code for GoT-WAVE. Availability and implementation http://www.dcc.fc.up.pt/got-wave/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PATO: Pangenome Analysis Toolkit

10.1101/2021.01.30.428878 ◽

2021 ◽

Author(s):

Miguel D. Fernández-de-Bobadilla ◽

Alba Talavera-Rodríguez ◽

Lucía Chacón ◽

Fernando Baquero ◽

Teresa M. Coque ◽

...

Keyword(s):

Population Structure ◽

Statistical Analysis ◽

Core Genome ◽

State Of The Art ◽

Source Code ◽

Supplementary Information ◽

Complete Analysis ◽

Large Set ◽

Supplementary Data ◽

Desktop Computer

AbstractMotivationComparative genomics is a growing field but one that will be eventually overtaken by sample size studies and the increase of available genomes in public databases. We present the Pangenome Analysis Toolkit (PATO) designed to simultaneously analyze thousands of genomes using a desktop computer. The tool performs common tasks of pangenome analysis such as core-genome definition and accessory genome properties and includes new features that help characterize population structure, annotate pathogenic features and create gene sharedness networks. PATO has been developed in R to integrate with the large set of tools available for genetic, phylogenetic and statistical analysis in this environment.ResultsPATO can perform the most demanding bioinformatic analyses in minutes with an accuracy comparable to state-of-the-art software but 20–30x times faster. PATO also integrates all the necessary functions for the complete analysis of the most common objectives in microbiology studies. Lastly, PATO includes the necessary tools for visualizing the results and can be integrated with other analytical packages available in R.AvailabilityThe source code for PATO is freely available at https://github.com/irycisBioinfo/PATO under the GPLv3 [email protected] informationSupplementary data are available at Bioinformatics online

Download Full-text

A path recorder algorithm for Multiple Longest Common Subsequences (MLCS) problems

Bioinformatics ◽

10.1093/bioinformatics/btaa134 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3035-3042

Author(s):

Shiwei Wei ◽

Yuping Wang ◽

Yuanchao Yang ◽

Sen Liu

Keyword(s):

Dna Sequences ◽

Directed Acyclic Graph ◽

Large Scale ◽

Search Time ◽

Longest Common Subsequence ◽

Supplementary Information ◽

Acyclic Graph ◽

Longest Common Subsequences ◽

Longest Path ◽

Longest Paths

Abstract Motivation Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining. The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption. To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed. Results In PRDAG, we transform the MLCS problem into searching the longest path from the Directed Acyclic Graph (DAG), where each longest path in DAG corresponds to an MLCS. To tackle the problem efficiently, we eliminate all redundant and repeated nodes during the construction of DAG, and for each node, we only maintain the longest paths from the source node to it but ignore all non-longest paths. As a result, the size of the DAG becomes very small, and the memory space and search time will be greatly saved. Empirical experiments have been performed on a standard benchmark set of both DNA sequences and protein sequences. The experimental results demonstrate that our model and algorithm outperform the related leading algorithms, especially for large-scale MLCS problems. Availability and implementation This program code is written by the first author and can be available at https://www.ncbi.nlm.nih.gov/nuccore and https://blog.csdn.net/wswguilin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HaploGrouper: a generalized approach to haplogroup classification

Bioinformatics ◽

10.1093/bioinformatics/btaa729 ◽

2020 ◽

Author(s):

Anuradha Jagadeesan ◽

S Sunna Ebenesersdóttir ◽

Valdis B Guðmundsdóttir ◽

Elisabet Linda Thordardottir ◽

Kristjan H S Moore ◽

...

Keyword(s):

Mitochondrial Dna ◽

Phylogenetic Tree ◽

Y Chromosome ◽

State Of The Art ◽

Supplementary Information ◽

Sequence Variants ◽

Use Case ◽

Supplementary Data ◽

Human Mitochondrial Dna ◽

Comparable Accuracy

Abstract Motivation We introduce HaploGrouper, a versatile software to classify haplotypes into haplogroups on the basis of a known phylogenetic tree. A typical use case for this software is the assignment of haplogroups to human mitochondrial DNA (mtDNA) or Y-chromosome haplotypes. Existing state-of-the-art haplogroup-calling software is typically hard-wired to work only with either mtDNA or Y-chromosome haplotypes from humans. Results HaploGrouper exhibits comparable accuracy in these instances and has the advantage of being able to assign haplogroups to any kind of haplotypes from any species—given an extant annotated phylogenetic tree defined by sequence variants. Availability and implementation The software is available at the following URL https://gitlab.com/bio_anth_decode/haploGrouper. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A fast and memory efficient MLCS algorithm by character merging for DNA sequences alignment

Bioinformatics ◽

10.1093/bioinformatics/btz725 ◽

2019 ◽

Author(s):

Sen Liu ◽

Yuping Wang ◽

Wuning Tong ◽

Shiwei Wei

Keyword(s):

Dna Sequences ◽

Directed Acyclic Graph ◽

Large Scale ◽

State Of The Art ◽

Longest Common Subsequence ◽

Sequence Length ◽

Acyclic Graph ◽

Character Sequences ◽

Common Subsequence ◽

Memory Efficient

Abstract Motivation Multiple longest common subsequence (MLCS) problem is searching all longest common subsequences of multiple character sequences. It appears in many fields such as data mining, DNA alignment, bioinformatics, text editing and so on. With the increasing in sequence length and number of sequences, the existing dynamic programming algorithms and the dominant point-based algorithms become ineffective and inefficient, especially for large-scale MLCS problems. Results In this paper, by considering the characteristics of DNA sequences with many consecutively repeated characters, we first design a character merging scheme which merges the consecutively repeated characters in the sequences. As a result, it shortens the length of sequences considered and saves the space of storing all sequences. To further reduce the space and time costs, we construct a weighted directed acyclic graph which is much smaller than widely used directed acyclic graph for MLCS problems. Based on these techniques, we propose a fast and memory efficient algorithm for MLCS problems. Finally, the experiments are conducted and the proposed algorithm is compared with several state-of-the art algorithms. The experimental results show that the proposed algorithm performs better than the compared state-of-the art algorithms in both time and space costs. Availability and implementation https://www.ncbi.nlm.nih.gov/nuccore and https://github.com/liusen1006/MLCS.

Download Full-text

QPARSE: searching for long-looped or multimeric G-quadruplexes potentially distinctive and druggable

Bioinformatics ◽

10.1093/bioinformatics/btz569 ◽

2019 ◽

Cited By ~ 1

Author(s):

Michele Berselli ◽

Enrico Lavezzo ◽

Stefano Toppo

Keyword(s):

Human Gene ◽

State Of The Art ◽

Comprehensive Analysis ◽

Supplementary Information ◽

Gene Promoters ◽

Supplementary Data ◽

Stem Loop ◽

Hiv 1 ◽

Rna And Dna ◽

The Web

Abstract Motivation G-quadruplexes (G4s) are non-canonical nucleic acid conformations that are widespread in all kingdoms of life and are emerging as important regulators both in RNA and DNA. Recently, two new higher-order architectures have been reported: adjacent interacting G4s, and G4s with stable long loops forming stem-loop structures. As there are no specialized tools to identify these conformations, we developed QPARSE. Results QPARSE can exhaustively search for degenerate potential quadruplex-forming sequences (PQSs) containing bulges and/or mismatches at genomic level, as well as either multimeric or long-looped PQS (MPQS and LLPQS respectively). While its assessment vs. known reference datasets is comparable with the state-of-the-art, what is more interesting is its performance in the identification of MPQS and LLPQS that present algorithms are not designed to search for. We report a comprehensive analysis of MPQS in human gene promoters and the analysis of LLPQS on three experimentally validated case studies from HIV-1, BCL2, and hTERT. Availability QPARSE is freely accessible on the web at http://www.medcomp.medicina.unipd.it/qparse/index or downloadable from github as a python 2.7 program https://github.com/B3rse/qparse Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

MolRep: A Deep Representation Learning Library for Molecular Property Prediction

10.1101/2021.01.13.426489 ◽

2021 ◽

Author(s):

Jiahua Rao ◽

Shuangjia Zheng ◽

Ying Song ◽

Jianwen Chen ◽

Chengtao Li ◽

...

Keyword(s):

State Of The Art ◽

Source Code ◽

Representation Learning ◽

Supplementary Information ◽

Data Sets ◽

Supplementary Data ◽

Property Prediction ◽

Average Rank ◽

Benchmark Data ◽

Classification Tasks

AbstractSummaryRecently, novel representation learning algorithms have shown potential for predicting molecular properties. However, unified frameworks have not yet emerged for fairly measuring algorithmic progress, and experimental procedures of different representation models often lack rigorousness and are hardly reproducible. Herein, we have developed MolRep by unifying 16 state-of-the-art models across 4 popular molecular representations for application and comparison. Furthermore, we ran more than 12.5 million experiments to optimize hyperparameters for each method on 12 common benchmark data sets. As a result, CMPNN achieves the best results ranked the 1st in 5 out of 12 tasks with an average rank of 1.75. Relatively, ECC has good performance in classification tasks and MAT good for regression (both ranked 1st for 3 tasks) with an average rank of 2.71 and 2.6, respectively.AvailabilityThe source code is available at: https://github.com/biomed-AI/MolRepSupplementary informationSupplementary data are available online.

Download Full-text