Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions

PeerJ Computer Science ◽

10.7717/peerj-cs.90 ◽

2016 ◽

Vol 2 ◽

pp. e90 ◽

Cited By ~ 24

Author(s):

Ranko Gacesa ◽

David J. Barlow ◽

Paul F. Long

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Biological Data ◽

Biological Databases ◽

Web Based ◽

Physiological Functions ◽

Link Type ◽

Venom Toxins ◽

Venomous Animals ◽

Toxin Protein

Ascribing function to sequence in the absence of biological data is an ongoing challenge in bioinformatics. Differentiating the toxins of venomous animals from homologues having other physiological functions is particularly problematic as there are no universally accepted methods by which to attribute toxin function using sequence data alone. Bioinformatics tools that do exist are difficult to implement for researchers with little bioinformatics training. Here we announce a machine learning tool called ‘ToxClassifier’ that enables simple and consistent discrimination of toxins from non-toxin sequences with >99% accuracy and compare it to commonly used toxin annotation methods. ‘ToxClassifer’ also reports the best-hit annotation allowing placement of a toxin into the most appropriate toxin protein family, or relates it to a non-toxic protein having the closest homology, giving enhanced curation of existing biological databases and new venomics projects. ‘ToxClassifier’ is available for free, either to download (https://github.com/rgacesa/ToxClassifier) or to use on a web-based server (http://bioserv7.bioinfo.pbf.hr/ToxClassifier/).

Download Full-text

Peer Review #1 of "Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions (v0.1)"

10.7287/peerj-cs.90v0.1/reviews/1 ◽

2016 ◽

Author(s):

JM Izarzugaza

Keyword(s):

Machine Learning ◽

Peer Review ◽

Physiological Functions ◽

Venom Toxins

Download Full-text

Prediction of Compound-Protein Interactions with Machine Learning Methods

Chemoinformatics and Advanced Machine Learning Perspectives ◽

10.4018/978-1-61520-911-8.ch016 ◽

2011 ◽

pp. 304-317

Author(s):

Yoshihiro Yamanishi ◽

Hisashi Kashima

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Structure ◽

Genomic Sequence ◽

Sequence Data ◽

Binary Classification ◽

Biological Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

In silico prediction of compound-protein interactions from heterogeneous biological data is critical in the process of drug development. In this chapter the authors review several supervised machine learning methods to predict unknown compound-protein interactions from chemical structure and genomic sequence information simultaneously. The authors review several kernel-based algorithms from two different viewpoints: binary classification and dimension reduction. In the results, they demonstrate the usefulness of the methods on the prediction of drug-target interactions and ligand-protein interactions from chemical structure data and genomic sequence data.

Download Full-text

Prediction of Compound-protein Interactions with Machine Learning Methods

Machine Learning ◽

10.4018/978-1-60960-818-7.ch315 ◽

2012 ◽

pp. 616-630

Author(s):

Yoshihiro Yamanishi ◽

Hisashi Kashima

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Structure ◽

Genomic Sequence ◽

Sequence Data ◽

Binary Classification ◽

Biological Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text

A repository of web-based bioinformatics resources developed in India

10.1101/2020.01.21.855627 ◽

2020 ◽

Author(s):

Abhishek Agarwal ◽

Piyush Agrawal ◽

Aditi Sharma ◽

Vinod Kumar ◽

Chirag Mugdal ◽

...

Keyword(s):

Scientific Community ◽

Biological Databases ◽

Web Interface ◽

Web Based ◽

Link Type ◽

Complete Detail ◽

User Friendly

AbstractIndiaBioDb (https://webs.iiitd.edu.in/raghava/indiabiodb/) is a manually curated comprehensive repository of bioinformatics resources developed and maintained by Indian researchers. This repository maintains information about 543 freely accessible functional resources that include around 258 biological databases. Each entry provides a complete detail about a resource that includes the name of resources, web link, detail of publication, information about the corresponding author, name of institute, type of resource. A user-friendly searching module has been integrated, which allows users to search our repository on any field. In order to retrieve categorized information, we integrate the browsing facility in this repository. This database can be utilized for extracting the useful information regarding the present scenario of bioinformatics inclusive of all research labs funded by government and private bodies of India. In addition to web interface, we also developed mobile to facilitate the scientific community.

Download Full-text

Peer Review #2 of "Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions (v0.1)"

10.7287/peerj-cs.90v0.1/reviews/2 ◽

2016 ◽

Keyword(s):

Machine Learning ◽

Peer Review ◽

Physiological Functions ◽

Venom Toxins

Download Full-text

BtToxin_Digger: a comprehensive and high-throughput pipeline for mining toxin protein genes from Bacillus thuringiensis

10.1101/2020.05.26.114520 ◽

2020 ◽

Author(s):

Hualin Liu ◽

Jinshui Zheng ◽

Dexin Bo ◽

Yun Yu ◽

Weixing Ye ◽

...

Keyword(s):

Bacillus Thuringiensis ◽

High Throughput ◽

Large Scale ◽

Web Based ◽

Bt Toxin ◽

Link Type ◽

Toxin Genes ◽

Toxin Protein ◽

Mining Tool ◽

Downstream Analysis

SummaryBacillus thuringiensis (Bt) which is a spore-forming gram-positive bacterium, has been used as the most successful microbial pesticide for decades. Its toxin genes (cry) have been successfully used for the development of GM crops against pests. We have previously developed a web-based insecticidal gene mining tool BtToxin_scanner, which has been proved to be the most important method for mining cry genes from Bt genome sequences. To facilitate efficiently mining major toxin genes and novel virulence factors from large-scale Bt genomic data, we re-design this tool with a new workflow. Here we present BtToxin_Digger, a comprehensive, high-throughput, and easy-to-use Bt toxin mining tool. It runs fast and can get rich, accurate, and useful results for downstream analysis and experiment designs. Moreover, it can also be used to mine other targeting genes from large-scale genome and metagenome data with the addition of other query sequences.Availability and ImplementationThe BtToxin_Digger codes and instructions are freely available at https://github.com/BMBGenomics/BtToxin_Digger. A web server of BtToxin_Digger can be found at http://bcam.hzau.edu.cn/[email protected]; [email protected].

Download Full-text

GEOMetaCuration: A web-based application for accurate manual curation of Gene Expression Omnibus metadata

10.1101/257444 ◽

2018 ◽

Author(s):

Zhao Li ◽

Jin Li ◽

Peng Yu

Keyword(s):

Gene Expression ◽

Large Scale ◽

Gene Expression Omnibus ◽

Biological Data ◽

Use Case ◽

Web Based ◽

Link Type ◽

Manual Curation ◽

Development Framework ◽

Biological Discovery

AbstractMetadata curation has become increasingly important for biological discovery and biomedical research because a large amount of heterogeneous biological data is currently freely available. To facilitate efficient metadata curation, we developed an easy-to-use web-based curation application, GEOMetaCuration, for curating the metadata of Gene Expression Omnibus datasets. It can eliminate mechanical operations that consume precious curation time and can help coordinate curation efforts among multiple curators. It improves the curation process by introducing various features that are critical to metadata curation, such as a back-end curation management system and a curator-friendly front-end. The application is based on a commonly used web development framework of Python/Django and is open-sourced under the GNU General Public License V3. GEOMetaCuration is expected to benefit the biocuration community and to contribute to computational generation of biological insights using large-scale biological data. An example use case can be found at the demo website: http://geometacuration.yubiolab.org. Source code URL: https://bitbucket.com/yubiolab/GEOMetaCuration

Download Full-text

Peer Review #1 of "Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions (v0.2)"

10.7287/peerj-cs.90v0.2/reviews/1 ◽

2016 ◽

Author(s):

JM Izarzugaza

Keyword(s):

Machine Learning ◽

Peer Review ◽

Physiological Functions ◽

Venom Toxins

Download Full-text

D-graph clusters flaviviruses and β-coronaviruses according to their hosts, disease type and human cell receptors

10.1101/2020.08.13.249649 ◽

2020 ◽

Author(s):

Benjamin A. Braun ◽

Catherine H. Schein ◽

Werner Braun

Keyword(s):

Phylogenetic Trees ◽

Sequence Data ◽

Biological Data ◽

Supplementary Information ◽

Disease Phenotypes ◽

Link Type ◽

Physico Chemical ◽

Protein Sequence Data ◽

Large Groups ◽

Physico Chemical Property

AbstractMotivationThere is a need for rapid and easy to use, alignment free methods to cluster large groups of protein sequence data. Commonly used phylogenetic trees based on alignments can be used to visualize only a limited number of protein sequences. DGraph, introduced here, is a dynamic programming application developed to generate 2D-maps based on similarity scores for sequences. The program automatically calculates and graphically displays property distance (PD) scores based on physico-chemical property (PCP) similarities from an unaligned list of FASTA files. Such “PD-graphs” show the interrelatedness of the sequences, whereby clusters can reveal deeper connectivities.ResultsPD-Graphs generated for flavivirus (FV), enterovirus (EV), and coronavirus (CoV) sequences from complete polyproteins or individual proteins are consistent with biological data on vector types, hosts, cellular receptors and disease phenotypes. PD-graphs separate the tick- from the mosquito-borne FV, clusters viruses that infect bats, camels, seabirds and humans separately and the clusters correlate with disease phenotype. The PD method segregates the β-CoV spike proteins of SARS, SARS-CoV-2, and MERS sequences from other human pathogenic CoV, with clustering consistent with cellular receptor usage. The graphs also suggest evolutionary relationships that may be difficult to determine with conventional bootstrapping methods that require postulating an ancestral sequence.Availability and implementationDGraph is written in Java, compatible with the Java 5 runtime or newer. Source code and executable is available from the GitHub website (https://github.com/bjmnbraun/DGraph/releases). Documentation for installation and use of the software is available from the Readme.md file at (https://github.com/bjmnbraun/DGraph)[email protected] or [email protected] informationSupplementary information Table S1 and Fig. S1 are online available.

Download Full-text

Enabling Semantic Queries Across Federated Bioinformatics Databases

10.1101/686600 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ana Claudia Sima ◽

Tarcisio Mendes de Farias ◽

Erich Zbinden ◽

Maria Anisimova ◽

Manuel Gil ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Heterogeneous Data ◽

Biological Data ◽

Data Sources ◽

Biological Knowledge ◽

Biological Databases ◽

Semantic Level ◽

Sparql Endpoint ◽

Link Type

MotivationData integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases.ResultsWe introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: 1) Bgee, a gene expression relational database; 2) OMA, a Hierarchical Data Format 5 (HDF5) orthology data store, and 3) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialised RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.Project URLhttp://biosoda.expasy.org, https://github.com/biosoda/bioquery

Download Full-text