scholarly journals Dcmd: Distance-based classification using mixture distributions on microbiome data

2021 ◽  
Vol 17 (3) ◽  
pp. e1008799
Author(s):  
Konstantin Shestopaloff ◽  
Mei Dong ◽  
Fan Gao ◽  
Wei Xu

Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.

2019 ◽  
Vol 13 (Supplement_1) ◽  
pp. S99-S100
Author(s):  
M Madgwick ◽  
P Sudhakar ◽  
T Korcsmáros

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Wei-Chung Shia ◽  
Li-Sheng Lin ◽  
Dar-Ren Chen

AbstractTraditional computer-aided diagnosis (CAD) processes include feature extraction, selection, and classification. Effective feature extraction in CAD is important in improving the classification’s performance. We introduce a machine-learning method and have designed an analysis procedure of benign and malignant breast tumour classification in ultrasound (US) images without a need for a priori tumour region-selection processing, thereby decreasing clinical diagnosis efforts while maintaining high classification performance. Our dataset constituted 677 US images (benign: 312, malignant: 365). Regarding two-dimensional US images, the oriented gradient descriptors’ histogram pyramid was extracted and utilised to obtain feature vectors. The correlation-based feature selection method was used to evaluate and select significant feature sets for further classification. Sequential minimal optimisation—combining local weight learning—was utilised for classification and performance enhancement. The image dataset’s classification performance showed an 81.64% sensitivity and 87.76% specificity for malignant images (area under the curve = 0.847). The positive and negative predictive values were 84.1 and 85.8%, respectively. Here, a new workflow, utilising machine learning to recognise malignant US images was proposed. Comparison of physician diagnoses and the automatic classifications made using machine learning yielded similar outcomes. This indicates the potential applicability of machine learning in clinical diagnoses.


2021 ◽  
Vol 5 (Supplement_1) ◽  
pp. A288-A288
Author(s):  
Alicia Arredondo Eve ◽  
Elif Tunc ◽  
Yu-Jeh Liu ◽  
Saumya Agrawal ◽  
Huriye Huriye Erbak Yilmaz ◽  
...  

Abstract Introduction: Coronary microvascular disease (CMD) affects small arteries that feed the heart and is more prevalent in postmenopausal women. Since CMD and Coronary artery disease (CAD) have distinct pathologies, but are treated the same way, the majority of the patients with CMD do not receive a proper diagnosis and treatment, which in turn results in higher rates of adverse future events such as heart failure, sudden cardiac death, and acute coronary syndrome (ACS). Previously, we performed full metabolite profiling of plasma samples using GC-MS analysis and tested their classification performance using machine learning approaches. This initial proof-of-concept study showed that plasma metabolite profiles can be used to develop diagnostic signatures for CMD. In the current study, we hypothesize that plasma metabolite and protein composition is different for postmenopausal women with no heart disease, with CAD, or with CMD. Methods: We obtained plasma samples from 70 postmenopausal women who are healthy, women who have CMD, and women who have CAD at the time of blood collection. In addition to GC-MS metabolite profiles, we performed LC-MS metabolomic profiling, and proteomic profiling of a panel of 92 proteins that were implicated in cardiometabolic disease. We identified a combination of metabolites and proteins, and further tested their classification performance using machine learning approaches to identify potential circulating biomarkers for CMD. Results: We identified a comprehensive list of metabolites and proteins that were involved in endothelial cell function, nitric oxide metabolism and inflammation, which significantly different in plasma from women with CMD. We further validated difference in the level of several protein biomarkers, such as RAGE, PTX3, AGRP, CNTN1, and MMP-3, which are statistically significantly higher in postmenopausal women with CMD when compared with healthy women or women with CAD. Conclusion: Our research identified a group of potential molecules that can be used in the design of easy and low-cost blood biomarkers for the clinical diagnosis of CMD.


2020 ◽  
Author(s):  
Shuang Jiang ◽  
Guanghua Xiao ◽  
Andrew Young Koh ◽  
Bo Yao ◽  
Qiwei Li ◽  
...  

AbstractThe human microbiome is a collection of microorganisms. They form complex communities and collectively affect host health. Recently, the advances in next-generation sequencing technology enable the high-throughput profiling of the human microbiome. This calls for a statistical model to construct microbial networks from the microbiome sequencing count data. As microbiome count data are high-dimensional and suffer from uneven sampling depth, over-dispersion, and zero-inflation, these characteristics can bias the network estimation and require specialized analytical tools. Here we propose a general framework, HARMONIES, a Hybrid Approach foR MicrobiOme Network Inferences via Exploiting Sparsity, to infer a sparse microbiome network. HARMONIES first utilizes a zero-inflated negative binomial (ZINB) distribution to model the skewness and excess zeros in the microbiome data, as well as incorporates a stochastic process prior for sample-wise normalization. This approach infers a sparse and stable network by imposing non-trivial regularizations based on the Gaussian graphical model. In comprehensive simulation studies, HARMONIES outperformed four other commonly used methods. When using published microbiome data from a colorectal cancer study, it discovered a novel community with disease-enriched bacteria. In summary, HARMONIES is a novel and useful statistical framework for microbiome network inference, and it is available at https://github.com/shuangj00/HARMONIES.


2021 ◽  
Vol 11 (21) ◽  
pp. 10244
Author(s):  
Minki Kim ◽  
Daehan Kim ◽  
Changha Hwang ◽  
Seongje Cho ◽  
Sangchul Han ◽  
...  

Malware family classification is grouping malware samples that have the same or similar characteristics into the same family. It plays a crucial role in understanding notable malicious patterns and recovering from malware infections. Although many machine learning approaches have been devised for this problem, there are still several open questions including, “Which features, classifiers, and evaluation metrics are better for malware familial classification”? In this paper, we propose a machine learning approach to Android malware family classification using built-in and custom permissions. Each Android app must declare proper permissions to access restricted resources or to perform restricted actions. Permission declaration is an efficient and obfuscation-resilient feature for malware analysis. We developed a malware family classification technique using permissions and conducted extensive experiments with several classifiers on a well-known dataset, DREBIN. We then evaluated the classifiers in terms of four metrics: macrolevel F1-score, accuracy, balanced accuracy (BAC), and the Matthews correlation coefficient (MCC). BAC and the MCC are known to be appropriate for evaluating imbalanced data classification. Our experimental results showed that: (i) custom permissions had a positive impact on classification performance; (ii) even when the same classifier and the same feature information were used, there was a difference up to 3.67% between accuracy and BAC; (iii) LightGBM and AdaBoost performed better than other classifiers we considered.


2021 ◽  
Author(s):  
Yao Miao ◽  
Yasushi Iimura ◽  
Hidenori Sugano ◽  
Kosuke Fukumori ◽  
Toshihisa Tanaka

Automatic seizure onset zone (SOZ) localization using interictal electrocorticogram (ECoG) improves the diagnosis and treatment of patients with medically refractory epilepsy. This study aimed to investigate the characteristics of phase-amplitude coupling (PAC) extracted from interictal ECoG and the feasibility of PAC served as a promising biomarker for SOZ identification. We employed the mean vector length modulation index approach on the 20-s ECoG window to calculate PAC features between low frequency rhythms (0.5–24 Hz) and high frequency oscillations (HFOs) (80–560 Hz). We used statistical measures to test the significant difference in PAC between SOZ and non-seizure onset zone (NSOZ). To overcome the drawback of handcraft feature engineering, we established novel machine learning models to automatically learn the characteristics of PAC features obtained and classify them to identify SOZ. Besides, to conquer the imbalance of datasets, we introduced novel feature-wise/class-wise re-weighting strategies in conjunction with classifiers. In addition, we proposed the time-series nest cross-validation to provide more accurate and unbiased evaluations for this model. Seven patients with focal cortical dysplasia were included in this study. The experiment results not only illustrate that the significant coupling at band pairs of slow waves and HFOs exists in the SOZ when compared with the NSOZ but also indicate the effectiveness of PAC features and the proposed models with better classification performance.


2020 ◽  
Vol 36 (17) ◽  
pp. 4544-4550 ◽  
Author(s):  
Divya Sharma ◽  
Andrew D Paterson ◽  
Wei Xu

Abstract Motivation Research supports the potential use of microbiome as a predictor of some diseases. Motivated by the findings that microbiome data is complex in nature, and there is an inherent correlation due to hierarchical taxonomy of microbial Operational Taxonomic Units (OTUs), we propose a novel machine learning method incorporating a stratified approach to group OTUs into phylum clusters. Convolutional Neural Networks (CNNs) were used to train within each of the clusters individually. Further, through an ensemble learning approach, features obtained from each cluster were then concatenated to improve prediction accuracy. Our two-step approach comprising stratification prior to combining multiple CNNs, aided in capturing the relationships between OTUs sharing a phylum efficiently, as compared to using a single CNN ignoring OTU correlations. Results We used simulated datasets containing 168 OTUs in 200 cases and 200 controls for model testing. Thirty-two OTUs, potentially associated with risk of disease were randomly selected and interactions between three OTUs were used to introduce non-linearity. We also implemented this novel method in two human microbiome studies: (i) Cirrhosis with 118 cases, 114 controls; (ii) type 2 diabetes (T2D) with 170 cases, 174 controls; to demonstrate the model’s effectiveness. Extensive experimentation and comparison against conventional machine learning techniques yielded encouraging results. We obtained mean AUC values of 0.88, 0.92, 0.75, showing a consistent increment (5%, 3%, 7%) in simulations, Cirrhosis and T2D data, respectively, against the next best performing method, Random Forest. Availability and implementation https://github.com/divya031090/TaxoNN_OTU. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Vol 56 (03) ◽  
pp. 209-216 ◽  
Author(s):  
Said Ouatik El Alaoui ◽  
Mourad Sarrouti

SummaryBackground and Objective: Biomedical question type classification is one of the important components of an automatic biomedical question answering system. The performance of the latter depends directly on the performance of its biomedical question type classification system, which consists of assigning a category to each question in order to determine the appropriate answer extraction algorithm. This study aims to automatically classify biomedical questions into one of the four categories: (1) yes/no, (2) factoid, (3) list, and (4) summary.Methods: In this paper, we propose a biomedical question type classification method based on machine learning approaches to automatically assign a category to a biomedical question. First, we extract features from biomedical questions using the proposed handcrafted lexico-syntactic patterns. Then, we feed these features for machine- learning algorithms. Finally, the class label is predicted using the trained classifiers.Results: Experimental evaluations performed on large standard annotated datasets of biomedical questions, provided by the BioASQ challenge, demonstrated that our method exhibits significant improved performance when compared to four baseline systems. The proposed method achieves a roughly 10-point increase over the best baseline in terms of accuracy. Moreover, the obtained results show that using handcrafted lexico-syntactic patterns as features’ provider of support vector machine (SVM) lead to the highest accuracy of 89.40%.Conclusion: The proposed method can automatically classify BioASQ questions into one of the four categories: yes/no, factoid, list, and summary. Furthermore, the results demonstrated that our method produced the best classification performance compared to four baseline systems.


2019 ◽  
Vol 26 (2) ◽  
pp. 945-962 ◽  
Author(s):  
Okyaz Eminaga ◽  
Omran Al-Hamad ◽  
Martin Boegemann ◽  
Bernhard Breil ◽  
Axel Semjonow

This study aims to introduce as proof of concept a combination model for classification of prostate cancer using deep learning approaches. We utilized patients with prostate cancer who underwent surgical treatment representing the various conditions of disease progression. All possible combinations of significant variables from logistic regression and correlation analyses were determined from study data sets. The combination possibility and deep learning model was developed to predict these combinations that represented clinically meaningful patient’s subgroups. The observed relative frequencies of different tumor stages and Gleason score Gls changes from biopsy to prostatectomy were available for each group. Deep learning models and seven machine learning approaches were compared for the classification performance of Gleason score changes and pT2 stage. Deep models achieved the highest F1 scores by pT2 tumors (0.849) and Gls change (0.574). Combination possibility and deep learning model is a useful decision-aided tool for prostate cancer and to group patients with prostate cancer into clinically meaningful groups.


2018 ◽  
Vol 2018 ◽  
pp. 1-11 ◽  
Author(s):  
Zichuan Fan ◽  
Fanchen Kong ◽  
Yang Zhou ◽  
Yiqing Chen ◽  
Yalan Dai

Mass spectrometry (MS) is an important technique in protein research. Effective classification methods by MS data could contribute to early and less-invasive diagnosis and also facilitate developments in the bioinformatics field. As MS data is featured by high dimension, appropriate methods which can effectively deal with the large amount of MS data have been widely studied. In this paper, the applications of methods based on intelligence algorithms have been investigated. Firstly, classification and biomarker analysis methods using typical machine learning approaches have been discussed. Then those are followed by the Ensemble strategy algorithms. Clearly, simple and basic machine learning algorithms hardly addressed the various needs of protein MS classification. Preprocessing algorithms have been also studied, as these methods are useful for feature selection or feature extraction to improve classification performance. Protein MS data growing with data volume becomes complicated and large; improvements in classification methods in terms of classifier selection and combinations of different algorithms and preprocessing algorithms are more emphasized in further work.


Sign in / Sign up

Export Citation Format

Share Document