scholarly journals Machine learning algorithms in big data analyses identify determinants of insulin gene transcription

2021 ◽  
Author(s):  
Wilson KM Wong ◽  
Vinod Thorat ◽  
Mugdha V Joglekar ◽  
Charlotte X Dong ◽  
Hugo Lee ◽  
...  

Machine learning (ML) workflows enable unprejudiced and robust evaluation of complex datasets and are being increasingly sought in analyzing transcriptome-based big datasets. Here, we analysed over 490,000,000 data points to compare 10 different ML algorithms in a large (N=11,652) training dataset of single-cell RNA-sequencing of human pancreatic cells to identify features (genes) associated with the presence or absence of insulin gene transcript(s). Prediction accuracy and sensitivity of models were tested in a separate validation dataset (N=2,913 single-cell transcriptomes) and the efficacy of each ML workflow to accurately identify insulin-producing cells assessed. Overall, Ensemble ML workflows, and in particular, Random Forest ML algorithm delivered high predictive power in a receiver operator characteristic (ROC) curve analysis (AUC=0.83) at the highest sensitivity (0.98) as compared to the other nine algorithms. The top 10 features, (including IAPP, ADCYAP1, LDHA and SST) common to the three Ensemble ML workflows were identified to be localized to human islet-β cells as well as non-β cells and were significantly dysregulated in scRNA-seq datasets from Ire-1αβ-/- mice that demonstrate de-differentiation of pancreatic β-cells as well as in pancreatic single cells from individuals with Type 2 Diabetes. Our findings provide a direct comparison of ML workflows in big data analyses, identify key determinants of insulin transcription and provide workflows for other regulatory analyses to identify/validate novel genes/features of endocrine pancreatic gene transcription.

2021 ◽  
Author(s):  
Daisha Van Der Watt ◽  
Hannah Boekweg ◽  
Thy Truong ◽  
Amanda J Guise ◽  
Edward D Plowey ◽  
...  

AbstractSingle cell proteomics is an emerging sub-field within proteomics with the potential to revolutionize our understanding of cellular heterogeneity and interactions. Recent efforts have largely focused on technological advancements in sample preparation, chromatography and instrumentation to enable measuring proteins present in these ultra-limited samples. Although advancements in data acquisition have rapidly improved our ability to analyze single cells, the software pipelines used in data analysis were originally written for traditional bulk samples and their performance on single cell data has not been investigated. We benchmarked five popular peptide identification tools on single cell proteomics data. We found that MetaMorpheus achieved the greatest number of peptide spectrum matches at a 1% false discovery rate. Depending on the tool, we also find that post processing machine learning can improve spectrum identification results by up to ∼40%. Although rescoring leads to a greater number of peptide spectrum matches, these new results typically are generated by 3rd party tools and have no way of being utilized by the primary pipeline for quantification. Exploration of novel metrics for machine learning algorithms will continue to improve performance.


Author(s):  
Xabier Rodríguez-Martínez ◽  
Enrique Pascual-San-José ◽  
Mariano Campoy-Quiles

This review article presents the state-of-the-art in high-throughput computational and experimental screening routines with application in organic solar cells, including materials discovery, device optimization and machine-learning algorithms.


Metabolites ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 363
Author(s):  
Louise Cottle ◽  
Ian Gilroy ◽  
Kylie Deng ◽  
Thomas Loudovaris ◽  
Helen E. Thomas ◽  
...  

Pancreatic β cells secrete the hormone insulin into the bloodstream and are critical in the control of blood glucose concentrations. β cells are clustered in the micro-organs of the islets of Langerhans, which have a rich capillary network. Recent work has highlighted the intimate spatial connections between β cells and these capillaries, which lead to the targeting of insulin secretion to the region where the β cells contact the capillary basement membrane. In addition, β cells orientate with respect to the capillary contact point and many proteins are differentially distributed at the capillary interface compared with the rest of the cell. Here, we set out to develop an automated image analysis approach to identify individual β cells within intact islets and to determine if the distribution of insulin across the cells was polarised. Our results show that a U-Net machine learning algorithm correctly identified β cells and their orientation with respect to the capillaries. Using this information, we then quantified insulin distribution across the β cells to show enrichment at the capillary interface. We conclude that machine learning is a useful analytical tool to interrogate large image datasets and analyse sub-cellular organisation.


2021 ◽  
Author(s):  
Fang He ◽  
John H Page ◽  
Kerry R Weinberg ◽  
Anirban Mishra

BACKGROUND The current COVID-19 pandemic is unprecedented; under resource-constrained setting, predictive algorithms can help to stratify disease severity, alerting physicians of high-risk patients, however there are few risk scores derived from a substantially large EHR dataset, using simplified predictors as input. OBJECTIVE To develop and validate simplified machine learning algorithms which predicts COVID-19 adverse outcomes, to evaluate the AUC (area under the receiver operating characteristic curve), sensitivity, specificity and calibration of the algorithms, to derive clinically meaningful thresholds. METHODS We conducted machine learning model development and validation via cohort study using multi-center, patient-level, longitudinal electronic health records (EHR) from Optum® COVID-19 database which provides anonymized, longitudinal EHR from across US. The models were developed based on clinical characteristics to predict 28-day in-hospital mortality, ICU admission, respiratory failure, mechanical ventilator usages at inpatient setting. Data from patients who were admitted prior to Sep 7, 2020, is randomly sampled into development, test and validation datasets; data collected from Sep 7, 2020 through Nov 15, 2020 was reserved as prospective validation dataset. RESULTS Of 3.7M patients in the analysis, a total of 585,867 patients were diagnosed or tested positive for SARS-CoV-2; and 50,703 adult patients were hospitalized with COVID-19 between Feb 1 and Nov 15, 2020. Among the study cohort (N=50,703), there were 6,204 deaths, 9,564 ICU admissions, 6,478 mechanically ventilated or EMCO patients and 25,169 patients developed ARDS or respiratory failure within 28 days since hospital admission. The algorithms demonstrated high accuracy (AUC = 0.89 (0.89 - 0.89) on validation dataset (N=10,752)), consistent prediction through the second wave of pandemic from September to November (AUC = 0.85 (0.85 - 0.86) on post-development validation (N= 14,863)), great clinical relevance and utility. Besides, a comprehensive 386 input covariates from baseline and at admission was included in the analysis; the end-to-end pipeline automates feature selection and model development process, producing 10 key predictors as input such as age, blood urea nitrogen, oxygen saturation, which are both commonly measured and concordant with recognized risk factors for COVID-19. CONCLUSIONS The systematic approach and rigorous validations demonstrate consistent model performance to predict even beyond the time period of data collection, with satisfactory discriminatory power and great clinical utility. Overall, the study offers an accurate, validated and reliable prediction model based on only ten clinical features as a prognostic tool to stratifying COVID-19 patients into intermediate, high and very high-risk groups. This simple predictive tool could be shared with a wider healthcare community, to enable service as an early warning system to alert physicians of possible high-risk patients, or as a resource triaging tool to optimize healthcare resources. CLINICALTRIAL N/A


2017 ◽  
Vol 47 (10) ◽  
pp. 2625-2626 ◽  
Author(s):  
Fuchun Sun ◽  
Guang-Bin Huang ◽  
Q. M. Jonathan Wu ◽  
Shiji Song ◽  
Donald C. Wunsch II

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yao Huimin

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the “special points” other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.


Author(s):  
C.S.R. Prabhu ◽  
Aneesh Sreevallabh Chivukula ◽  
Aditya Mogadala ◽  
Rohit Ghosh ◽  
L.M. Jenila Livingston

Author(s):  
Suriya Murugan ◽  
Sumithra M. G.

Cognitive radio has emerged as a promising candidate solution to improve spectrum utilization in next generation wireless networks. Spectrum sensing is one of the main challenges encountered by cognitive radio and the application of big data is a powerful way to solve various problems. However, for the increasingly tense spectrum resources, the prediction of cognitive radio based on big data is an inevitable trend. The signal data from various sources is analyzed using the big data cognitive radio framework and efficient data analytics can be performed using different types of machine learning techniques. This chapter analyses the process of spectrum sensing in cognitive radio, the challenges to process spectrum data and need for dynamic machine learning algorithms in decision making process.


Sign in / Sign up

Export Citation Format

Share Document