biological dataset Latest Research Papers

Hierarchical confounder discovery in the experiment–machine learning cycle

10.1101/2021.05.11.443616 ◽

2021 ◽

Author(s):

Alex Rogozhnikov ◽

Pavan Ramkumar ◽

Saul Kato ◽

Sean Escola

Keyword(s):

Machine Learning ◽

Experimental Design ◽

Natural Phenomena ◽

Biomedical Image ◽

Biological Dataset ◽

Confounding Variables ◽

Derived Data ◽

High Dimensional Datasets ◽

Practical Constraints ◽

Frequent Presence

The promise of using machine learning (ML) to extract scientific insights from high dimensional datasets is tempered by the frequent presence of confounding variables, and it behooves scientists to determine whether or not a model has extracted the desired information or instead may have fallen prey to bias. Due both to features of many natural phenomena and to practical constraints of experimental design, complex bioscience datasets tend to be organized in nested hierarchies which can obfuscate the origin of a confounding effect and obviate traditional methods of confounder amelioration. We propose a simple non-parametric statistical method called the Rank-to-Group (RTG) score that can identify hierarchical confounder effects in raw data and ML-derived data embeddings. We show that RTG scores correctly assign the effects of hierarchical confounders in cases where linear methods such as regression fail. In a large public biomedical image dataset, we discover unreported effects of experimental design. We then use RTG scores to discover cross-modal correlated variability in a complex multi-phenotypic biological dataset. This approach should be of general use in experiment–analysis cycles and to ensure confounder robustness in ML models.

Discovering Effective Connectivity in Neural Circuits: Analysis Based on Machine Learning Methodology

Frontiers in Neuroinformatics ◽

10.3389/fninf.2021.561012 ◽

2021 ◽

Vol 15 ◽

Author(s):

Pedro Pozo-Jimenez ◽

Javier Lucas-Romero ◽

Jose A. Lopez-Garcia

Keyword(s):

Effective Connectivity ◽

Spike Trains ◽

Proof Of Concept ◽

Biological Dataset ◽

Classical Statistics ◽

Array Technology ◽

Favorable Conditions ◽

Analytical Tools ◽

Potential Use

As multielectrode array technology increases in popularity, accessible analytical tools become necessary. Simultaneous recordings from multiple neurons may produce huge amounts of information. Traditional tools based on classical statistics are either insufficient to analyze multiple spike trains or sophisticated and expensive in computing terms. In this communication, we put to the test the idea that AI algorithms may be useful to gather information about the effective connectivity of neurons in local nuclei at a relatively low computing cost. To this end, we decided to explore the capacity of the algorithm C5.0 to retrieve information from a large series of spike trains obtained from a simulated neuronal circuit with a known structure. Combinatory, iterative and recursive processes using C5.0 were built to examine possibilities of increasing the performance of a direct application of the algorithm. Furthermore, we tested the applicability of these processes to a reduced dataset obtained from original biological recordings with unknown connectivity. This was obtained in house from a mouse in vitro preparation of the spinal cord. Results show that this algorithm can retrieve neurons monosynaptically connected to the target in simulated datasets within a single run. Iterative and recursive processes can identify monosynaptic neurons and disynaptic neurons under favorable conditions. Application of these processes to the biological dataset gives clues to identify neurons monosynaptically connected to the target. We conclude that the work presented provides substantial proof of concept for the potential use of AI algorithms to the study of effective connectivity.

FOCAL3D: A 3-dimensional clustering package for single-molecule localization microscopy

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008479 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1008479

Author(s):

Daniel F. Nino ◽

Daniel Djayakarsana ◽

Joshua N. Milstein

Keyword(s):

Single Molecule ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Cluster Complex ◽

Nuclear Pore ◽

Data Sets ◽

Localization Microscopy ◽

Biological Dataset ◽

Open Source Software Package ◽

Single Molecule Localization Microscopy

Single-molecule localization microscopy (SMLM) is a powerful tool for studying intracellular structure and macromolecular organization at the nanoscale. The increasingly massive pointillistic data sets generated by SMLM require the development of new and highly efficient quantification tools. Here we present FOCAL3D, an accurate, flexible and exceedingly fast (scaling linearly with the number of localizations) density-based algorithm for quantifying spatial clustering in large 3D SMLM data sets. Unlike DBSCAN, which is perhaps the most commonly employed density-based clustering algorithm, an optimum set of parameters for FOCAL3D may be objectively determined. We initially validate the performance of FOCAL3D on simulated datasets at varying noise levels and for a range of cluster sizes. These simulated datasets are used to illustrate the parametric insensitivity of the algorithm, in contrast to DBSCAN, and clustering metrics such as the F1 and Silhouette score indicate that FOCAL3D is highly accurate, even in the presence of significant background noise and mixed populations of variable sized clusters, once optimized. We then apply FOCAL3D to 3D astigmatic dSTORM images of the nuclear pore complex (NPC) in human osteosaracoma cells, illustrating both the validity of the parameter optimization and the ability of the algorithm to accurately cluster complex, heterogeneous 3D clusters in a biological dataset. FOCAL3D is provided as an open source software package written in Python.

A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

PeerJ Computer Science ◽

10.7717/peerj-cs.307 ◽

2020 ◽

Vol 6 ◽

pp. e307

Author(s):

Michal B. Rozenwald ◽

Aleksandra A. Galitsyna ◽

Grigory V. Sapunov ◽

Ekaterina E. Khrameeva ◽

Mikhail S. Gelfand

Keyword(s):

Machine Learning ◽

Cell Lines ◽

Short Term Memory ◽

Regulation Of Gene Expression ◽

Gradient Boosting ◽

Linear Regression Models ◽

Biologically Relevant ◽

Technological Advances ◽

Biological Dataset ◽

Chromatin Folding

Technological advances have lead to the creation of large epigenetic datasets, including information about DNA binding proteins and DNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histone modification H3K4me3 were selected as the most informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML

Molecular modelling studies on thiazole-based α-Glucosidase Inhibitor using docking and CoMFA, CoMSIA and HQSAR

Current Drug Discovery Technologies ◽

10.2174/1570163817666201022111213 ◽

2020 ◽

Vol 17 ◽

Author(s):

Deshbandhu Joshi ◽

Shourya Yadav ◽

Rajesh Sharma ◽

Mitrunjaya Pandya ◽

Raghvendra Singh Bhadauria

Keyword(s):

Correlation Coefficient ◽

Training Set ◽

Glucosidase Inhibitors ◽

Biological Dataset ◽

Donor And Acceptor ◽

Molecular Modelling Studies ◽

Ic50 Values ◽

Modelling Studies ◽

Standard Error Estimation ◽

Leave One Out

Aims and Objective: The biological dataset was retrieved from two series of α-glucosidase inhibitors synthesized by Rahim et al. and Taha et al. and consisted of a total 46 (forty-six) α-glucosidase inhibitors. Methods: The α-glucosidase inhibitory IC50 values (µM; performed against α-glucosidase from Saccharomyces cerevisiae) were converted into negative logarithmic units (pIC50). The CoMFA and CoMSIA models were created utilizing 37 as the training set and externally validated utilizing 9 as a test set. The CoMFA models MMFF94 were generated and ranging from 3.4661 to 5.2749 using leave-one-out PLS analysis cross-validated correlation coefficient q2 0.787 a high non-crossvalidated correlation coefficient r2 0.819 with a low standard error estimation (SEE) 0.041, F value 1316.074 and r2pred 0.996. Results: The steric and electrostatic fields contributions were 0.507 and 0.493, respectively. The CoMSIA model q2 0.805, r 2 0.833 was attained, (SEE) 0.065, F value 520.302 and r2pred 0.990. Contribution of steric, electrostatic, hydrophobic, donor and acceptor fields were 0.151, 0.268, 0.223, 0.234, 0.124 respectively. Conclusion: The HQSAR model of training set exhibits significant cross-validated correlation coefficient q2 0.800 and noncross-validated correlation coefficient r2 0.943.

SAMA: A Fast Self-Adaptive Memetic Algorithm for Detecting SNP-SNP Interactions Associated with Disease

BioMed Research International ◽

10.1155/2020/5610658 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Ying Yin ◽

Boxin Guan ◽

Yuhai Zhao ◽

Yuan Li

Keyword(s):

Genome Wide Association Study ◽

Memetic Algorithm ◽

Search Algorithm ◽

The Other ◽

Detection Power ◽

Running Time ◽

Genome Wide ◽

Biological Dataset ◽

Selection Of ◽

Self Adaptive

Detecting SNP-SNP interactions associated with disease is significant in genome-wide association study (GWAS). Owing to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power and long running time. To tackle these drawbacks, a fast self-adaptive memetic algorithm (SAMA) is proposed in this paper. In this method, the crossover, mutation, and selection of standard memetic algorithm are improved to make SAMA adapt to the detection of SNP-SNP interactions associated with disease. Furthermore, a self-adaptive local search algorithm is introduced to enhance the detecting power of the proposed method. SAMA is evaluated on a variety of simulated datasets and a real-world biological dataset, and a comparative study between it and the other four methods (FHSA-SED, AntEpiSeeker, IEACO, and DESeeker) that have been developed recently based on evolutionary algorithms is performed. The results of extensive experiments show that SAMA outperforms the other four compared methods in terms of detection power and running time.

Shedding light on the underlying characteristics of genomes using Kronecker model families of codon evolution

10.1101/2020.08.12.247890 ◽

2020 ◽

Author(s):

Maryam Zaheri ◽

Nicolas Salamin

Keyword(s):

Transition Rate ◽

Accurate Estimation ◽

Mechanistic Models ◽

Rate Matrix ◽

Nucleotide Level ◽

Codon Models ◽

Biological Dataset ◽

Transition Rate Matrix ◽

Simplifying Assumptions ◽

Codon Evolution

AbstractThe mechanistic models of codon evolution rely on some simplistic assumptions in order to reduce the computational complexity of estimating the high number of parameters of the models. This paper is an attempt to investigate how much these simplistic assumptions are misleading when they violate the nature of the biological dataset in hand. We particularly focus on three simplistic assumptions made by most of the current mechanistic codon models including: 1) only single substitutions between nucleotides within codons in the codon transition rate matrix are allowed. 2) mutation is homogenous across nucleotides within a codon. 3) assuming HKY nucleotide model is good enough at the nucleotide level. For this purpose, we developed a framework of mechanistic codon models, each model in the framework hold or relax some of the mentioned simplifying assumptions. Holding or relaxing the three simplistic assumptions results in total to eight different mechanistic models in the framework. Through several experiments on biological datasets and simulations we show that the three simplistic assumptions are unrealistic for most of the biological datasets and relaxing these assumptions lead to accurate estimation of evolutionary parameters such as selection pressure.

An efficient exact algorithm for computing all pairwise distances between reconciliations in the duplication-transfer-loss model

BMC Bioinformatics ◽

10.1186/s12859-019-3203-9 ◽

2019 ◽

Vol 20 (S20) ◽

Cited By ~ 1

Author(s):

Santi Santichaivekin ◽

Ross Mawhorter ◽

Ran Libeskind-Hadas

Keyword(s):

Polynomial Time ◽

Maximum Parsimony ◽

Exact Algorithm ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Distance Metrics ◽

Worst Case ◽

Loss Model ◽

Biological Dataset ◽

Wide Range

Abstract Background Maximum parsimony reconciliation in the duplication-transfer-loss model is widely used in studying the evolutionary histories of genes and species and in studying coevolution of parasites and their hosts and pairs of symbionts. While efficient algorithms are known for finding maximum parsimony reconciliations, the number of reconciliations can grow exponentially in the size of the trees. An understanding of the space of maximum parsimony reconciliations is necessary to determine whether a single reconciliation can adequately represent the space or whether multiple representative reconciliations are needed. Results We show that for any instance of the reconciliation problem, the distribution of pairwise distances can be computed exactly by an efficient polynomial-time algorithm with respect to several different distance metrics. We describe the algorithm, analyze its asymptotic worst-case running time, and demonstrate its utility and viability on a large biological dataset. Conclusions This result provides new insights into the structure of the space of maximum parsimony reconciliations. These insights are likely to be useful in the wide range of applications that employ reconciliation methods.

WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data

Metabolites ◽

10.3390/metabo9090171 ◽

2019 ◽

Vol 9 (9) ◽

pp. 171 ◽

Cited By ~ 5

Author(s):

Borgsmüller ◽

Gloaguen ◽

Opialla ◽

Blanc ◽

Sicard ◽

...

Keyword(s):

Mass Spectrometry ◽

Gas Chromatography ◽

Large Scale ◽

Automated Analysis ◽

Peak Detection ◽

Performance Comparison ◽

Gas Chromatography Mass Spectrometry ◽

Peak Picking ◽

Chromatography Mass Spectrometry ◽

Biological Dataset

Lack of reliable peak detection impedes automated analysis of large-scale gas chromatography-mass spectrometry (GC-MS) metabolomics datasets. Performance and outcome of individual peak-picking algorithms can differ widely depending on both algorithmic approach and parameters, as well as data acquisition method. Therefore, comparing and contrasting between algorithms is difficult. Here we present a workflow for improved peak picking (WiPP), a parameter optimising, multi-algorithm peak detection for GC-MS metabolomics. WiPP evaluates the quality of detected peaks using a machine learning-based classification scheme based on seven peak classes. The quality information returned by the classifier for each individual peak is merged with results from different peak detection algorithms to create one final high-quality peak set for immediate down-stream analysis. Medium- and low-quality peaks are kept for further inspection. By applying WiPP to standard compound mixes and a complex biological dataset, we demonstrate that peak detection is improved through the novel way to assign peak quality, an automated parameter optimisation, and results in integration across different embedded peak picking algorithms. Furthermore, our approach can provide an impartial performance comparison of different peak picking algorithms. WiPP is freely available on GitHub (https://github.com/bihealth/WiPP) under MIT licence.

HetEnc: a deep learning predictive model for multi-type biological dataset

BMC Genomics ◽

10.1186/s12864-019-5997-2 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Leihong Wu ◽

Xiangwen Liu ◽

Joshua Xu

Keyword(s):

Deep Learning ◽

Predictive Model ◽

Biological Dataset

biological dataset
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Hierarchical confounder discovery in the experiment–machine learning cycle

Discovering Effective Connectivity in Neural Circuits: Analysis Based on Machine Learning Methodology

FOCAL3D: A 3-dimensional clustering package for single-molecule localization microscopy

A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

Molecular modelling studies on thiazole-based α-Glucosidase Inhibitor using docking and CoMFA, CoMSIA and HQSAR

SAMA: A Fast Self-Adaptive Memetic Algorithm for Detecting SNP-SNP Interactions Associated with Disease

Shedding light on the underlying characteristics of genomes using Kronecker model families of codon evolution

An efficient exact algorithm for computing all pairwise distances between reconciliations in the duplication-transfer-loss model

WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data

HetEnc: a deep learning predictive model for multi-type biological dataset

Export Citation Format

biological datasetRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Hierarchical confounder discovery in the experiment–machine learning cycle

Discovering Effective Connectivity in Neural Circuits: Analysis Based on Machine Learning Methodology

FOCAL3D: A 3-dimensional clustering package for single-molecule localization microscopy

A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

Molecular modelling studies on thiazole-based α-Glucosidase Inhibitor using docking and CoMFA, CoMSIA and HQSAR

SAMA: A Fast Self-Adaptive Memetic Algorithm for Detecting SNP-SNP Interactions Associated with Disease

Shedding light on the underlying characteristics of genomes using Kronecker model families of codon evolution

An efficient exact algorithm for computing all pairwise distances between reconciliations in the duplication-transfer-loss model

WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data

HetEnc: a deep learning predictive model for multi-type biological dataset

biological dataset
Recently Published Documents