Machine Learning Techniques for High-Throughput Structure and Function Analysis for Proteomics and Genomics

Background:Heterogeneity in disease populations complicates discovery of risk factors. To identify risk factors for subpopulations of diseases, we need analytical methods that can deal with unidentified disease subgroups.Objectives:Inspired by successful approaches from the Big Data field, we developed a high-throughput approach to identify subpopulations within patients with heterogeneous, complex diseases using the wealth of information available in Electronic Medical Records (EMRs).Methods:We extracted longitudinal healthcare-interaction records coded by 1,853 PheCodes[1] of the 64,819 patients from the Boston’s Partners-Biobank. Through dimensionality reduction using t-SNE[2] we created a 2D embedding of 32,424 of these patients (set A). We then identified distinct clusters post-t-SNE using DBscan[3] and visualized the relative importance of individual PheCodes within them using specialized spectrographs. We replicated this procedure in the remaining 32,395 records (set B).Results:Summary statistics of both sets were comparable (Table 1).Table 1.Summary statistics of the total Partners Biobank dataset and the 2 partitions.Set-Aset-BTotalEntries12,200,31112,177,13124,377,442Patients32,42432,39564,819Patientyears369,546.33368,597.92738,144.2unique ICD codes25,05624,95326,305unique Phecodes1,8511,8531,853We found 284 clusters in set A and 295 in set B, of which 63.4% from set A could be mapped to a cluster in set B with a median (range) correlation of 0.24 (0.03 – 0.58).Clusters represented similar yet distinct clinical phenotypes; e.g. patients diagnosed with “other headache syndrome” were separated into four distinct clusters characterized by migraines, neurofibromatosis, epilepsy or brain cancer, all resulting in patients presenting with headaches (Fig. 1 & 2). Though EMR databases tend to be noisy, our method was also able to differentiate misclassification from true cases; SLE patients with RA codes clustered separately from true RA cases.Figure 1.Two dimensional representation of Set A generated using dimensionality reduction (tSNE) and clustering (DBScan).Figure 2.Phenotype Spectrographs (PheSpecs) of four clusters characterized by “Other headache syndromes”, driven by codes relating to migraine, epilepsy, neurofibromatosis or brain cancer.Conclusion:We have shown that EMR data can be used to identify and visualize latent structure in patient categorizations, using an approach based on dimension reduction and clustering machine learning techniques. Our method can identify misclassified patients as well as separate patients with similar problems into subsets with different associated medical problems. Our approach adds a new and powerful tool to aid in the discovery of novel risk factors in complex, heterogeneous diseases.References:[1] Denny, J.C. et al. Bioinformatics (2010)[2]van der Maaten et al. Journal of Machine Learning Research (2008)[3] Ester, M. et al. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. (1996)Disclosure of Interests:Marc Maurits: None declared, Thomas Huizinga Grant/research support from: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Consultant of: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Marcel Reinders: None declared, Soumya Raychaudhuri: None declared, Elizabeth Karlson: None declared, Erik van den Akker: None declared, Rachel Knevel: None declared

Download Full-text

Editorial (Thematic Issue: Machine Learning Techniques for Protein Structure, Genomics Function Analysis and Disease Prediction)

Current Proteomics ◽

10.2174/157016461302160513235846 ◽

2016 ◽

Vol 13 (2) ◽

pp. 77-78 ◽

Cited By ~ 10

Author(s):

Quan Zou

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Function Analysis ◽

Thematic Issue ◽

Machine Learning Techniques ◽

Disease Prediction ◽

Learning Techniques ◽

Structure Genomics

Download Full-text

Recognition of Automated Hand-written Digits on Document Images Making Use of Machine Learning Techniques

European Journal of Engineering and Technology Research ◽

10.24018/ejers.2021.6.4.2460 ◽

2021 ◽

Vol 6 (4) ◽

pp. 37-44

Author(s):

Hiral Raja ◽

Aarti Gupta ◽

Rohit Miri

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Digit Recognition ◽

Learning Techniques ◽

Effective System ◽

Handwritten Digit ◽

Digit String ◽

And Function

The purpose of this study is to create an automated framework that can recognize similar handwritten digit strings. For starting the experiment, the digits were separated into different numbers. The process of defining handwritten digit strings is then concluded by recognizing each digit recognition module's segmented digit. This research utilizes various machine learning techniques to produce a strong performance on the digit string recognition challenge, including SVM, ANN, and CNN architectures. These approaches use SVM, ANN, and CNN models of HOG feature vectors to train images of digit strings. Deep learning methods organize the pictures by moving a fixed-size monitor over them while categorizing each sub-image as a digit pass or fail. Following complete segmentation, complete recognition of handwritten digits is accomplished. To assess the methods' results, data must be used for machine learning training. Following that, the digit data is evaluated using the desired machine learning methodology. The Experiment findings indicate that SVM and ANN also have disadvantages in precision and efficiency in text picture recognition. Thus, the other process, CNN, performs better and is more accurate. This paper focuses on developing an effective system for automatically recognizing handwritten digits. This research would examine the adaptation of emerging machine learning and deep learning approaches to various datasets, like SVM, ANN, and CNN. The test results undeniably demonstrate that the CNN approach is significantly more effective than the ANN and SVM approaches, ranking 71% higher. The suggested architecture is composed of three major components: image pre-processing, attribute extraction, and classification. The purpose of this study is to enhance the precision of handwritten digit recognition significantly. As will be demonstrated, pre-processing and function extraction are significant elements of this study to obtain maximum consistency.

Download Full-text

Learning Binding Affinity from Augmented High Throughput Screening Data

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch020 ◽

2013 ◽

pp. 364-385

Author(s):

Nicos Angelopoulos ◽

Andreas Hadjiprocopis ◽

Malcolm D. Walkinshaw

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

High Throughput ◽

High Throughput Screening ◽

Large Scale ◽

Bayesian Model Averaging ◽

Model Averaging ◽

Machine Learning Techniques ◽

High Dimensional ◽

Learning Techniques

In high throughput screening a large number of molecules are tested against a single target protein to determine binding affinity of each molecule to the target. The objective of such tests within the pharmaceutical industry is to identify potential drug-like lead molecules. Current technology allows for thousands of molecules to be tested inexpensively. The analysis of linking such biological data with molecular properties is thus becoming a major goal in both academic and pharmaceutical research. This chapter details how screening data can be augmented with high-dimensional descriptor data and how machine learning techniques can be utilised to build predictive models. The pyruvate kinase protein is used as a model target throughout the chapter. Binding affinity data from a public repository provide binding information on a large set of screened molecules. The authors consider three machine learning paradigms: Bayesian model averaging, Neural Networks, and Support Vector Machines. The authors apply algorithms from the three paradigms to three subsets of the data and comment on the relative merits of each. They also used the learnt models to classify the molecules in a large in-house molecular database that holds commercially available chemical structures from a large number of suppliers. They discuss the degree of agreement in compounds selected and ranked for three algorithms. Details of the technical challenges in such large scale classification and the ability of each paradigm to cope with these are put forward. The application of machine learning techniques to binding data augmented by high-dimensional can provide a powerful tool in compound testing. The emphasis of this work is on making very few assumptions or technical choices with regard to the machine learning techniques. This is to facilitate application of such techniques by non-experts.

Download Full-text

Assessing the Role of Machine Learning in Robotics

Regular Issue - International Journal of Innovative Science and Modern Engineering ◽

10.35940/ijisme.e1202.036520 ◽

2020 ◽

Vol 6 (5) ◽

pp. 13-15

Keyword(s):

Machine Learning ◽

The Body ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Robot Perception ◽

Learning Techniques ◽

Complete Detail ◽

And Function ◽

The Brain

Machine learning is concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. Neural framework offers wide support for machine learning algorithms. It is an interface, library or tool which allows developers to build machine learning models easily, without getting into the depth of the underlying algorithms. The neural framework is an exceptionally intricate piece of a person that co-ordinate its activities Moreover, tactile data by transmitting signs to and from various pieces of the body. Neural frameworks are applied to perform object gathering and a grasp orchestrating task. Machine Learning techniques have been applied to many sub problems in robot perception – pattern recognition and self-organisation. Modern robot framework which demands a complete detail of each movement of the robot, which breaks the pick-and-spot issue into about free, computationally conceivable sub-issues as a phase toward a comprehensive endeavour level framework

Download Full-text

Classifying protein structures into folds by convolutional neural networks, distance maps, and persistent homology

10.1101/2020.04.15.042739 ◽

2020 ◽

Author(s):

Yechan Hong ◽

Yongyu Deng ◽

Haofan Cui ◽

Jan Segert ◽

Jianlin Cheng

Keyword(s):

Machine Learning ◽

Persistent Homology ◽

Protein Structures ◽

Distance Matrix ◽

Significant Loss ◽

Machine Learning Techniques ◽

Tertiary Structures ◽

Distance Map ◽

Learning Techniques ◽

And Function

AbstractThe fold classification of a protein reveals valuable information about its shape and function. It is important to find a mapping between protein structures and their folds. There are numerous machine learning techniques to predict protein folds from 1-dimensional (1D) protein sequences, but there are few machine learning methods to directly class protein 3D (tertiary) structures into predefined folds (e.g. folds defined in the SCOP database). We develop a 2D-convolutional neural network to classify any protein structure into one of 1232 folds. We extract two classes of input features for each protein: residue-residue distance matrix and persistent homology images derived from 3D protein structures. Due to restrictions in computing resources, we sample every other point in the carbon alpha chain to generate a reduced distance map representation. We find that it does not lead to significant loss in accuracy. Using the distance matrix, we achieve an accuracy of 95.2% on the SCOP dataset. With persistence homology images of 100 × 100 resolution, we achieve an accuracy of 56% on SCOPe 2.07 dataset. Combining the two kinds of features further improves classification accuracy. The source code of our method (PRO3DCNN) is available at https://github.com/jianlin-cheng/PRO3DCNN.

Download Full-text

Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions

Frontiers in Microbiology ◽

10.3389/fmicb.2021.635781 ◽

2021 ◽

Vol 12 ◽

Author(s):

Isabel Moreno-Indias ◽

Leo Lahti ◽

Miroslava Nedyalkova ◽

Ilze Elbere ◽

Gennady Roshchupkin ◽

...

Keyword(s):

Machine Learning ◽

High Throughput ◽

Data Science ◽

Human Microbiome ◽

Machine Learning Techniques ◽

Central Research ◽

Learning Approaches ◽

Design Data ◽

Microbiome Composition ◽

Learning Techniques

The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 “ML4Microbiome” that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.

Download Full-text

Learning Binding Affinity from Augmented High Throughput Screening Data

Chemoinformatics and Advanced Machine Learning Perspectives ◽

10.4018/978-1-61520-911-8.ch011 ◽

2011 ◽

pp. 212-234

Author(s):

Nicos Angelopoulos ◽

Andreas Hadjiprocopis ◽

Malcolm D. Walkinshaw

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

High Throughput ◽

High Throughput Screening ◽

Large Scale ◽

Bayesian Model Averaging ◽

Model Averaging ◽

Machine Learning Techniques ◽

High Dimensional ◽

Learning Techniques

In high throughput screening a large number of molecules are tested against a single target protein to determine binding affinity of each molecule to the target. The objective of such tests within the pharmaceutical industry is to identify potential drug-like lead molecules. Current technology allows for thousands of molecules to be tested inexpensively. The analysis of linking such biological data with molecular properties is thus becoming a major goal in both academic and pharmaceutical research. This chapter details how screening data can be augmented with high-dimensional descriptor data and how machine learning techniques can be utilised to build predictive models. The pyruvate kinase protein is used as a model target throughout the chapter. Binding affinity data from a public repository provide binding information on a large set of screened molecules. The authors consider three machine learning paradigms: Bayesian model averaging, Neural Networks, and Support Vector Machines. The authors apply algorithms from the three paradigms to three subsets of the data and comment on the relative merits of each. They also used the learnt models to classify the molecules in a large in-house molecular database that holds commercially available chemical structures from a large number of suppliers. They discuss the degree of agreement in compounds selected and ranked for three algorithms. Details of the technical challenges in such large scale classification and the ability of each paradigm to cope with these are put forward. The application of machine learning techniques to binding data augmented by high-dimensional can provide a powerful tool in compound testing. The emphasis of this work is on making very few assumptions or technical choices with regard to the machine learning techniques. This is to facilitate application of such techniques by non-experts.

Download Full-text

Using machine learning techniques to reduce data annotation time

PsycEXTRA Dataset ◽

10.1037/e577762012-020 ◽

2006 ◽

Author(s):

Christopher Schreiner ◽

Kari Torkkola ◽

Mike Gardner ◽

Keshu Zhang

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Data Annotation ◽

Learning Techniques

Download Full-text

Using Machine Learning Algorithms on Prediction of Stock Price

Journal of Modeling and Optimization ◽

10.32732/jmo.2020.12.2.84 ◽

2020 ◽

Vol 12 (2) ◽

pp. 84-99

Author(s):

Li-Pang Chen

Keyword(s):

Machine Learning ◽

Stock Price ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Short Term ◽

Learning Techniques ◽

Historical Database ◽

Long Short Term Memory

In this paper, we investigate analysis and prediction of the time-dependent data. We focus our attention on four different stocks are selected from Yahoo Finance historical database. To build up models and predict the future stock price, we consider three different machine learning techniques including Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) and Support Vector Regression (SVR). By treating close price, open price, daily low, daily high, adjusted close price, and volume of trades as predictors in machine learning methods, it can be shown that the prediction accuracy is improved.

Download Full-text