Circular RNA as a potential biomarker for forensic age prediction using multiple machine learning models: A preliminary study

Mapping Intimacies ◽

10.1101/2020.11.10.376418 ◽

2020 ◽

Author(s):

Junyan Wang ◽

Chunyan Wang ◽

Lihong Fu ◽

Qian Wang ◽

Guangping Fu ◽

...

Keyword(s):

Random Forest ◽

Age Estimation ◽

Prediction Models ◽

Circular Rna ◽

Support Vector ◽

Random Forest Regression ◽

Age Related ◽

Potential Biomarker ◽

Preliminary Study ◽

Age Prediction

AbstractIn forensic science, accurate estimation of the age of a victim or suspect can facilitate the investigators to narrow a search and aid in solving a crime. Aging is a complex process associated with various molecular regulation on DNA or RNA levels. Recent studies have shown that circular RNAs (circRNAs) upregulate globally during aging in multiple organisms such as mice and elegans because of their ability to resist degradation by exoribonucleases. In the current study, we attempted to investigate circRNAs’ potential capability of age prediction. Here, we identified more than 40,000 circRNAs in the blood of thirteen Chinese unrelated healthy individuals with ages of 20-62 years according to their circRNA-seq profiles. Three methods were applied to select age-related circRNAs candidates including false discovery rate, lasso regression, and support vector machine. The analysis uncovered a strong bias for circRNA upregulation during aging in human blood. A total of 28 circRNAs were chosen for further validation in 50 healthy unrelated subjects aged between 19 and 72 years by RT-qPCR and finally, 7 age-related circRNAs were chosen for final age prediction models. Several different algorithms including multivariate linear regression (MLR), regression tree, bagging regression, random forest regression (RFR), and support vector regression (SVR) were compared based on root mean square error (RMSE) and mean average error (MAE) values. Among five modeling methods, random forest regression (RFR) performed better than the others with an RMSE value of 5.072 years and an MAE value of 4.065 years (R2 = 0.902). In this preliminary study, we firstly used circRNAs as additional novel age-related biomarkers for developing forensic age estimation models. We propose that the use of circRNAs to obtain additional clues for forensic investigations and serve as aging indicators for age prediction would become a promising field of interest.Author summaryIn forensic investigations, estimation of the age of biological evidence recovered from crime scenes can provide additional information such as chronological age or the appearance of a culprit, which could give valuable investigative leads especially when there is no eyewitness available. Hence, generating an accurate model for age prediction using body fluids such as blood commonly seen at a crime scene can be of vital importance. Various molecular changes on DNA or RNA levels were discovered that they upregulated or downregulated during a person’s lifetime. Although some biomarkers have been proved to be associated with aging and used to predict age, several disadvantages such as low sensitivity, prediction accuracy, instability and susceptibility of diseases or immune states, thus limiting their applicability in the field of age estimation. Here, we utilized a novel biomarker namely circular RNA (circRNA) to generate highly accurate age prediction models. We propose that circRNA is more suitable for forensic degradation samples because of its unique molecular structure. This preliminary research offers a new thought for exploring potential biomarker for age prediction.

Download Full-text

Machine Learning-Based Prediction of Air Quality

Applied Sciences ◽

10.3390/app10249151 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9151

Author(s):

Yun-Chia Liang ◽

Yona Maimury ◽

Angela Hsiang-Ling Chen ◽

Josue Rodolfo Cuevas Juarez

Keyword(s):

Machine Learning ◽

Air Quality ◽

Random Forest ◽

Prediction Models ◽

Superior Performance ◽

Support Vector ◽

Economic Activities ◽

Adaptive Boosting ◽

Series Of Experiments ◽

Artificial Neural Network Ann

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.

Download Full-text

Empirical methods for the estimation of Southern Ocean CO<sub>2</sub>: support vector and random forest regression

Biogeosciences ◽

10.5194/bg-14-5551-2017 ◽

2017 ◽

Vol 14 (23) ◽

pp. 5551-5569 ◽

Cited By ~ 10

Author(s):

Luke Gregor ◽

Schalk Kok ◽

Pedro M. S. Monteiro

Keyword(s):

Random Forest ◽

Southern Ocean ◽

Synthetic Data ◽

Good Effect ◽

Support Vector ◽

Co2 Uptake ◽

Empirical Methods ◽

Random Forest Regression ◽

Proxy Variables ◽

The Impact

Abstract. The Southern Ocean accounts for 40 % of oceanic CO2 uptake, but the estimates are bound by large uncertainties due to a paucity in observations. Gap-filling empirical methods have been used to good effect to approximate pCO2 from satellite observable variables in other parts of the ocean, but many of these methods are not in agreement in the Southern Ocean. In this study we propose two additional methods that perform well in the Southern Ocean: support vector regression (SVR) and random forest regression (RFR). The methods are used to estimate ΔpCO2 in the Southern Ocean based on SOCAT v3, achieving similar trends to the SOM-FFN method by Landschützer et al. (2014). Results show that the SOM-FFN and RFR approaches have RMSEs of similar magnitude (14.84 and 16.45 µatm, where 1 atm  =  101 325 Pa) where the SVR method has a larger RMSE (24.40 µatm). However, the larger errors for SVR and RFR are, in part, due to an increase in coastal observations from SOCAT v2 to v3, where the SOM-FFN method used v2 data. The success of both SOM-FFN and RFR depends on the ability to adapt to different modes of variability. The SOM-FFN achieves this by having independent regression models for each cluster, while this flexibility is intrinsic to the RFR method. Analyses of the estimates shows that the SVR and RFR's respective sensitivity and robustness to outliers define the outcome significantly. Further analyses on the methods were performed by using a synthetic dataset to assess the following: which method (RFR or SVR) has the best performance? What is the effect of using time, latitude and longitude as proxy variables on ΔpCO2? What is the impact of the sampling bias in the SOCAT v3 dataset on the estimates? We find that while RFR is indeed better than SVR, the ensemble of the two methods outperforms either one, due to complementary strengths and weaknesses of the methods. Results also show that for the RFR and SVR implementations, it is better to include coordinates as proxy variables as RMSE scores are lowered and the phasing of the seasonal cycle is more accurate. Lastly, we show that there is only a weak bias due to undersampling. The synthetic data provide a useful framework to test methods in regions of sparse data coverage and show potential as a useful tool to evaluate methods in future studies.

Download Full-text

Statistical and Electrical Features Evaluation for Electrical Appliances Energy Disaggregation

Sustainability ◽

10.3390/su11113222 ◽

2019 ◽

Vol 11 (11) ◽

pp. 3222 ◽

Cited By ~ 15

Author(s):

Pascal Schirmer ◽

Iosif Mporas

Keyword(s):

Random Forest ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Regression ◽

Nearest Neighbours ◽

Energy Disaggregation ◽

Vector Machines ◽

Non Linear ◽

Load Monitoring ◽

Sinusoidal Current

In this paper we evaluate several well-known and widely used machine learning algorithms for regression in the energy disaggregation task. Specifically, the Non-Intrusive Load Monitoring approach was considered and the K-Nearest-Neighbours, Support Vector Machines, Deep Neural Networks and Random Forest algorithms were evaluated across five datasets using seven different sets of statistical and electrical features. The experimental results demonstrated the importance of selecting both appropriate features and regression algorithms. Analysis on device level showed that linear devices can be disaggregated using statistical features, while for non-linear devices the use of electrical features significantly improves the disaggregation accuracy, as non-linear appliances have non-sinusoidal current draw and thus cannot be well parametrized only by their active power consumption. The best performance in terms of energy disaggregation accuracy was achieved by the Random Forest regression algorithm.

Download Full-text

Algorithmic and data modeling: Will algorithmic modeling improve predictions of traits evaluated on ordinal scales?

10.1101/2020.10.07.329466 ◽

2020 ◽

Author(s):

Zhanyou Xu ◽

Andreomar Kurek ◽

Steven B. Cannon ◽

Williams D. Beavis

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Ridge Regression ◽

Genomic Prediction ◽

Ordinal Data ◽

Prediction Models ◽

Characteristic Curve ◽

Gradient Boosting ◽

Support Vector ◽

Data Types

AbstractSelection of markers linked to alleles at quantitative trait loci (QTL) for tolerance to Iron Deficiency Chlorosis (IDC) has not been successful. Genomic selection has been advocated for continuous numeric traits such as yield and plant height. For ordinal data types such as IDC, genomic prediction models have not been systematically compared. The objectives of research reported in this manuscript were to evaluate the most commonly used genomic prediction method, ridge regression and it’s equivalent logistic ridge regression method, with algorithmic modeling methods including random forest, gradient boosting, support vector machine, K-nearest neighbors, Naïve Bayes, and artificial neural network using the usual comparator metric of prediction accuracy. In addition we compared the methods using metrics of greater importance for decisions about selecting and culling lines for use in variety development and genetic improvement projects. These metrics include specificity, sensitivity, precision, decision accuracy, and area under the receiver operating characteristic curve. We found that Support Vector Machine provided the best specificity for culling IDC susceptible lines, while Random Forest GP models provided the best combined set of decision metrics for retaining IDC tolerant and culling IDC susceptible lines.

Download Full-text

Systematic Framework to Predict Early-Stage Liver Carcinoma Using Hybrid of Feature Selection Techniques and Regression Techniques

Complexity ◽

10.1155/2022/7816200 ◽

2022 ◽

Vol 2022 ◽

pp. 1-11

Author(s):

Marium Mehmood ◽

Nasser Alshammari ◽

Saad Awadh Alanazi ◽

Fahad Ahmad

Keyword(s):

Feature Selection ◽

Random Forest ◽

Liver Diseases ◽

Early Stage ◽

Support Vector ◽

Liver Carcinoma ◽

Random Forest Regression ◽

Soft Computing Techniques ◽

Regression Algorithms ◽

Regression Techniques

The liver is the human body’s mandatory organ, but detecting liver disease at an early stage is very difficult due to the hiddenness of symptoms. Liver diseases may cause loss of energy or weakness when some irregularities in the working of the liver get visible. Cancer is one of the most common diseases of the liver and also the most fatal of all. Uncontrolled growth of harmful cells is developed inside the liver. If diagnosed late, it may cause death. Treatment of liver diseases at an early stage is, therefore, an important issue as is designing a model to diagnose early disease. Firstly, an appropriate feature should be identified which plays a more significant part in the detection of liver cancer at an early stage. Therefore, it is essential to extract some essential features from thousands of unwanted features. So, these features will be mined using data mining and soft computing techniques. These techniques give optimized results that will be helpful in disease diagnosis at an early stage. In these techniques, we use feature selection methods to reduce the dataset’s feature, which include Filter, Wrapper, and Embedded methods. Different Regression algorithms are then applied to these methods individually to evaluate the result. Regression algorithms include Linear Regression, Ridge Regression, LASSO Regression, Support Vector Regression, Decision Tree Regression, Multilayer Perceptron Regression, and Random Forest Regression. Based on the accuracy and error rates generated by these Regression algorithms, we have evaluated our results. The result shows that Random Forest Regression with the Wrapper Method from all the deployed Regression techniques is the best and gives the highest R2-Score of 0.8923 and lowest MSE of 0.0618.

Download Full-text

Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm

Journal of Translational Medicine ◽

10.1186/s12967-020-02550-2 ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Kerry E. Poppenberg ◽

Vincent M. Tutino ◽

Lu Li ◽

Muhammad Waqas ◽

Armond June ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Model Performance ◽

Supervised Machine Learning ◽

Support Vector ◽

Learning Methods ◽

Training Cohort ◽

Network Analyses ◽

Machine Learning Methods

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

Download Full-text

The age-related expression decline of ERCC1 and XPF for forensic age estimation: A preliminary study

Journal of Forensic and Legal Medicine ◽

10.1016/j.jflm.2017.05.005 ◽

2017 ◽

Vol 49 ◽

pp. 15-19 ◽

Cited By ~ 2

Author(s):

Xiao-Dong Deng ◽

Qin Gao ◽

Wei Zhang ◽

Bo Zhang ◽

Ying Ma ◽

...

Keyword(s):

Age Estimation ◽

Forensic Age Estimation ◽

Age Related ◽

Preliminary Study

Download Full-text

Review of Gregor et al: Empirical methods for the estimation of Southern Ocean CO2: Support Vector and Random Forest Regression

10.5194/bg-2017-215-rc1 ◽

2017 ◽

Author(s):

Anonymous

Keyword(s):

Random Forest ◽

Southern Ocean ◽

Support Vector ◽

Empirical Methods ◽

Random Forest Regression

Download Full-text

Data-Driven-Based Forecasting of Two-Phase Flow Parameters in Rectangular Channel

Frontiers in Energy Research ◽

10.3389/fenrg.2021.641661 ◽

2021 ◽

Vol 9 ◽

Author(s):

Qingyu Huang ◽

Yang Yu ◽

Yaoyi Zhang ◽

Bo Pang ◽

Yafeng Wang ◽

...

Keyword(s):

Random Forest ◽

Two Phase Flow ◽

Interfacial Area ◽

Machine Learning Algorithms ◽

Data Driven ◽

Support Vector ◽

Phase Flow ◽

Random Forest Regression ◽

Two Phase ◽

Flow Parameters

In the current nuclear reactor system analysis codes, the interfacial area concentration and void fraction are mainly obtained through empirical relations based on different flow regime maps. In the present research, the data-driven method has been proposed, using four machine learning algorithms (lasso regression, support vector regression, random forest regression and back propagation neural network) in the field of artificial intelligence to predict some important two-phase flow parameters in rectangular channels, and evaluate the performance of different models through multiple metrics. The random forest regression algorithm was found to have the strongest ability to learn from the experimental data in this study. Test results show that the prediction errors of the random forest regression model for interfacial area concentrations and void fractions are all less than 20%, which means the target parameters have been forecasted with good accuracy.

Download Full-text

EEG-based age-prediction models as stable and heritable indicators of brain maturational level in children and adolescents

10.1101/407049 ◽

2018 ◽

Author(s):

Marjolein M.L.J.Z. Vandenbosch ◽

Dennis van’t Ent ◽

Dorret I. Boomsma ◽

Andrey P. Anokhin ◽

Dirk J.A. Smit

Keyword(s):

Random Forest ◽

Children And Adolescents ◽

Prediction Error ◽

Prediction Models ◽

Brain Activity ◽

Childhood And Adolescence ◽

Functional Brain ◽

Eeg Recordings ◽

Functional Brain Activity ◽

Age Prediction

AbstractThe human brain shows remarkable development of functional brain activity from childhood to adolescence. Here, we investigated whether electroencephalogram (EEG) recordings are suitable for predicting the age of children and adolescents. Moreover, we investigated whether over-or underestimation of age was stable over longer time periods, as stable prediction error can be interpreted as reflecting individual brain maturational level. Finally, we established whether the age-prediction error was genetically determined. Three minutes eyes-closed resting state EEG data from the longitudinal EEG studies of Netherlands Twin Register (n=836) and Washington University in St. Louis (n = 702) were used at ages 5, 7, 12, 14, 16 and 18. Longitudinal data were available within childhood and adolescence. We calculated power in 1 Hz wide bins (1 to 24 Hz). Random Forest regression and Relevance Vector Machine with 6-fold cross-validation were applied. The best mean absolute prediction error was obtained with Random Forest (1.22 years). Classification of childhood vs. adolescence reached over 94% accuracy. Prediction errors were moderately to highly stable over periods of 1.5 to 2.1 years (0.53 < r < 0.74) and signifcantly affected by genetic factors (heritability between 42% and 79%). Our results show that age prediction from low-cost EEG recordings is comparable in accuracy to those obtained with MRI. Children and adolescents showed stable over- or underestimation of their age, which means that some participants have stable brain activity patterns that reflect those of an older or younger age, and could therefore reflect individual brain maturational level. This prediction error is heritable, suggesting that genes underlie maturational level of functional brain activity. We propose that age prediction based on EEG recordings can be used for tracking neurodevelopment in typically developing children, in preterm children, and in children with neurodevelopmental disorders.

Download Full-text