scholarly journals Delineating the impact of machine learning elements in pre-microRNA detection

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3131 ◽  
Author(s):  
Müşerref Duygu Saçar Demirci ◽  
Jens Allmer

Gene regulation modulates RNA expression via transcription factors. Post-transcriptional gene regulation in turn influences the amount of protein product through, for example, microRNAs (miRNAs). Experimental establishment of miRNAs and their effects is complicated and even futile when aiming to establish the entirety of miRNA target interactions. Therefore, computational approaches have been proposed. Many such tools rely on machine learning (ML) which involves example selection, feature extraction, model training, algorithm selection, and parameter optimization. Different ML algorithms have been used for model training on various example sets, more than 1,000 features describing pre-miRNAs have been proposed and different training and testing schemes have been used for model establishment. For pre-miRNA detection, negative examples cannot easily be established causing a problem for two class classification algorithms. There is also no consensus on what ML approach works best and, therefore, we set forth and established the impact of the different parts involved in ML on model performance. Furthermore, we established two new negative datasets and analyzed the impact of using them for training and testing. It was our aim to attach an order of importance to the parts involved in ML for pre-miRNA detection, but instead we found that all parts are intricately connected and their contributions cannot be easily untangled leading us to suggest that when attempting ML-based pre-miRNA detection many scenarios need to be explored.

2020 ◽  
Vol 12 (6) ◽  
pp. 934 ◽  
Author(s):  
Eriita G. Jones ◽  
Sebastien Wong ◽  
Anthony Milton ◽  
Joseph Sclauzero ◽  
Holly Whittenbury ◽  
...  

Precision viticulture benefits from the accurate detection of vineyard vegetation from remote sensing, without a priori knowledge of vine locations. Vineyard detection enables efficient, and potentially automated, derivation of spatial measures such as length and area of crop, and hence required volumes of water, fertilizer, and other resources. Machine learning techniques have provided significant advancements in recent years in the areas of image segmentation, classification, and object detection, with neural networks shown to perform well in the detection of vineyards and other crops. However, what has not been extensively quantitatively examined is the extent to which the initial choice of input imagery impacts detection/segmentation accuracy. Here, we use a standard deep convolutional neural network (CNN) to detect and segment vineyards across Australia using DigitalGlobe Worldview-2 images at ∼50 cm (panchromatic) and ∼2 m (multispectral) spatial resolution. A quantitative assessment of the variation in model performance with input parameters during model training is presented from a remote sensing perspective, with combinations of panchromatic, multispectral, pan-sharpened multispectral, and the spectral Normalised Difference Vegetation Index (NDVI) considered. The impact of image acquisition parameters—namely, the off-nadir angle and solar elevation angle—on the quality of pan-sharpening is also assessed. The results are synthesised into a ‘recipe’ for optimising the accuracy of vineyard segmentation, which can provide a guide to others aiming to implement or improve automated crop detection and classification.


2020 ◽  
Vol 34 (10) ◽  
pp. 13971-13972
Author(s):  
Yang Qi ◽  
Farseev Aleksandr ◽  
Filchenkov Andrey

Nowadays, social networks play a crucial role in human everyday life and no longer purely associated with spare time spending. In fact, instant communication with friends and colleagues has become an essential component of our daily interaction giving a raise of multiple new social network types emergence. By participating in such networks, individuals generate a multitude of data points that describe their activities from different perspectives and, for example, can be further used for applications such as personalized recommendation or user profiling. However, the impact of the different social media networks on machine learning model performance has not been studied comprehensively yet. Particularly, the literature on modeling multi-modal data from multiple social networks is relatively sparse, which had inspired us to take a deeper dive into the topic in this preliminary study. Specifically, in this work, we will study the performance of different machine learning models when being learned on multi-modal data from different social networks. Our initial experimental results reveal that social network choice impacts the performance and the proper selection of data source is crucial.


2021 ◽  
Author(s):  
Ali Sakhaee ◽  
Anika Gebauer ◽  
Mareike Ließ ◽  
Axel Don

Abstract. Soil organic carbon (SOC), as the largest terrestrial carbon pool, has the potential to influence climate change and mitigation, and consequently SOC monitoring is important in the frameworks of different international treaties. There is therefore a need for high resolution SOC maps. Machine learning (ML) offers new opportunities to do this due to its capability for data mining of large datasets. The aim of this study, therefore, was to test three commonly used algorithms in digital soil mapping – random forest (RF), boosted regression trees (BRT) and support vector machine for regression (SVR) – on the first German Agricultural Soil Inventory to model agricultural topsoil SOC content. Nested cross-validation was implemented for model evaluation and parameter tuning. Moreover, grid search and differential evolution algorithm were applied to ensure that each algorithm was tuned and optimised suitably. The SOC content of the German Agricultural Soil Inventory was highly variable, ranging from 4 g kg−1 to 480 g kg−1. However, only 4 % of all soils contained more than 87 g kg−1 SOC and were considered organic or degraded organic soils. The results show that SVR provided the best performance with RMSE of 32 g kg−1 when the algorithms were trained on the full dataset. However, the average RMSE of all algorithms decreased by 34 % when mineral and organic soils were modeled separately, with the best result from SVR with RMSE of 21 g kg−1. Model performance is often limited by the size and quality of the available soil dataset for calibration and validation. Therefore, the impact of enlarging the training data was tested by including 1223 data points from the European Land Use/Land Cover Area Frame Survey for agricultural sites in Germany. The model performance was enhanced for maximum 1 % for mineral soils and 2 % for organic soils. Despite the capability of machine learning algorithms in general, and particularly SVR, in modelling SOC on a national scale, the study showed that the most important to improve the model performance was separate modelling of mineral and organic soils.


Neurosurgery ◽  
2020 ◽  
Vol 67 (Supplement_1) ◽  
Author(s):  
Syed M Adil ◽  
Cyrus Elahi ◽  
Robert Gramer ◽  
Charis A Spears ◽  
Anthony Fuller ◽  
...  

Abstract INTRODUCTION Traumatic brain injury (TBI) disproportionately affects low- and middle-income countries (LMICs). In these low-resource settings, effective triage of TBI patients-including the decision of whether or not to perform neurosurgery-is critical in optimizing both patient outcomes and healthcare resource utilization. METHODS Data from TBI patients of all ages were prospectively collected at Mulago National Referral Hospital in Kampala, Uganda, from 2016 to 2019. Seven different machine learning models (based on 1 linear and 6 non-linear algorithms) designed to predict good vs poor outcome near hospital discharge were developed and internally validated using 5-fold cross-validation. Predictors included clinical variables easily acquired on admission-demographics, physical exam, and mechanism of injury-and whether or not the patient received surgery. Using the elastic-net regularized logistic regression model (GLMnet), the probability of poor outcome was calculated for each patient both with and without surgery (quantifying the “treatment benefit”). A relative treatment benefit was then calculated, equaling this benefit of surgery divided by the probability of bad outcome with no surgery. Predictions were calibrated using Platt scaling. RESULTS Ultimately, 1766 patients were included. Areas under the receiver operating characteristic curve (AUCs) ranged from 81.7% (k-nearest neighbors) to 88.0% (random forest). The GLMnet had the second-best AUC at 87.7%. For the entire cohort, the median relative treatment benefit was 37.6% (IQR, 31.0% to 46.0%); similarly, in just those receiving surgery, it was 38.0% (IQR, 31.4% to 47.0%). The top four variables promoting good outcomes in the GLMnet model were high GCS, being fully alert, having both pupils reactive, and receiving surgery. CONCLUSION We provide the first deployable machine learning-based model to predict TBI outcomes with and without surgery in LMICs, thus enabling more effective surgical decision making in the resource-limited setting. Currently, patients are not being optimally chosen for neurosurgical intervention. Future studies should externally validate the model, improve model performance by combining data across countries, and explore use of more advanced algorithms.


2019 ◽  
Author(s):  
Douglas Spangler ◽  
Thomas Hermansson ◽  
David Smekal ◽  
Hans Blomberg

AbstractBackgroundThe triage of patients in pre-hospital care is a difficult task, and improved risk assessment tools are needed both at the dispatch center and on the ambulance to differentiate between low- and high-risk patients. This study develops and validates a machine learning-based approach to predicting hospital outcomes based on routinely collected prehospital data.MethodsDispatch, ambulance, and hospital data were collected in one Swedish region from 2016 - 2017. Dispatch center and ambulance records were used to develop gradient boosting models predicting hospital admission, critical care (defined as admission to an intensive care unit or in-hospital mortality), and two-day mortality. Model predictions were used to generate composite risk scores which were compared to National Early Warning System (NEWS) scores and actual dispatched priorities in a similar but prospectively gathered dataset from 2018.ResultsA total of 38203 patients were included from 2016-2018. Concordance indexes (or area under the receiver operating characteristics curve) for dispatched priorities ranged from 0.51 – 0.66, while those for NEWS scores ranged from 0.66 - 0.85. Concordance ranged from 0.71 – 0.80 for risk scores based only on dispatch data, and 0.79 – 0.89 for risk scores including ambulance data. Dispatch data-based risk scores consistently outperformed dispatched priorities in predicting hospital outcomes, while models including ambulance data also consistently outperformed NEWS scores. Model performance in the prospective test dataset was similar to that found using cross-validation, and calibration was comparable to that of NEWS scores.ConclusionsMachine learning-based risk scores outperformed a widely-used rule-based triage algorithm and human prioritization decisions in predicting hospital outcomes. Performance was robust in a prospectively gathered dataset, and scores demonstrated adequate calibration. Future research should investigate the generality of these results to prehospital triage in other settings, and establish the impact of triage tools based on these methods by means of randomized trial.


2019 ◽  
Vol 26 (10) ◽  
pp. 977-988 ◽  
Author(s):  
Gang Fang ◽  
Izabela E Annis ◽  
Jennifer Elston-Lafata ◽  
Samuel Cykert

Abstract Objective We aimed to investigate bias in applying machine learning to predict real-world individual treatment effects. Materials and Methods Using a virtual patient cohort, we simulated real-world healthcare data and applied random forest and gradient boosting classifiers to develop prediction models. Treatment effect was estimated as the difference between the predicted outcomes of a treatment and a control. We evaluated the impact of predictors (ie, treatment predictors [X1], confounders [X2], treatment effects modifiers [X3], and other outcome risk factors [X4]) with known effects on treatment and outcome using real-world data, and outcome imbalance on predicting individual outcome. Using counterfactuals, we evaluated percentage of patients with biased predicted individual treatment effects Results The X4 had relatively more impact on model performance than X2 and X3 did. No effects were observed from X1. Moderate-to-severe outcome imbalance had a significantly negative impact on model performance, particularly among subgroups in which an outcome occurred. Bias in predicting individual treatment effects was significant and persisted even when the models had a 100% accuracy in predicting health outcome. Discussion Inadequate inclusion of the X2, X3, and X4 and moderate-to-severe outcome imbalance may affect model performance in predicting individual outcome and subsequently bias in predicting individual treatment effects. Machine learning models with all features and high performance for predicting individual outcome still yielded biased individual treatment effects. Conclusions Direct application of machine learning might not adequately address bias in predicting individual treatment effects. Further method development is needed to advance machine learning to support individualized treatment selection.


2021 ◽  
Vol 13 (12) ◽  
pp. 2242
Author(s):  
Jianzhao Liu ◽  
Yunjiang Zuo ◽  
Nannan Wang ◽  
Fenghui Yuan ◽  
Xinhao Zhu ◽  
...  

The net ecosystem CO2 exchange (NEE) is a critical parameter for quantifying terrestrial ecosystems and their contributions to the ongoing climate change. The accumulation of ecological data is calling for more advanced quantitative approaches for assisting NEE prediction. In this study, we applied two widely used machine learning algorithms, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to build models for simulating NEE in major biomes based on the FLUXNET dataset. Both models accurately predicted NEE in all biomes, while XGBoost had higher computational efficiency (6~62 times faster than RF). Among environmental variables, net solar radiation, soil water content, and soil temperature are the most important variables, while precipitation and wind speed are less important variables in simulating temporal variations of site-level NEE as shown by both models. Both models perform consistently well for extreme climate conditions. Extreme heat and dryness led to much worse model performance in grassland (extreme heat: R2 = 0.66~0.71, normal: R2 = 0.78~0.81; extreme dryness: R2 = 0.14~0.30, normal: R2 = 0.54~0.55), but the impact on forest is less (extreme heat: R2 = 0.50~0.78, normal: R2 = 0.59~0.87; extreme dryness: R2 = 0.86~0.90, normal: R2 = 0.81~0.85). Extreme wet condition did not change model performance in forest ecosystems (with R2 changing −0.03~0.03 compared with normal) but led to substantial reduction in model performance in cropland (with R2 decreasing 0.20~0.27 compared with normal). Extreme cold condition did not lead to much changes in model performance in forest and woody savannas (with R2 decreasing 0.01~0.08 and 0.09 compared with normal, respectively). Our study showed that both models need training samples at daily timesteps of >2.5 years to reach a good model performance and >5.4 years of daily samples to reach an optimal model performance. In summary, both RF and XGBoost are applicable machine learning algorithms for predicting ecosystem NEE, and XGBoost algorithm is more feasible than RF in terms of accuracy and efficiency.


Biostatistics ◽  
2020 ◽  
Author(s):  
W Katherine Tan ◽  
Patrick J Heagerty

Summary Scalable and accurate identification of specific clinical outcomes has been enabled by machine-learning applied to electronic medical record systems. The development of classification models requires the collection of a complete labeled data set, where true clinical outcomes are obtained by human expert manual review. For example, the development of natural language processing algorithms requires the abstraction of clinical text data to obtain outcome information necessary for training models. However, if the outcome is rare then simple random sampling results in very few cases and insufficient information to develop accurate classifiers. Since large scale detailed abstraction is often expensive, time-consuming, and not feasible, more efficient strategies are needed. Under such resource constrained settings, we propose a class of enrichment sampling designs, where selection for abstraction is stratified by auxiliary variables related to the true outcome of interest. Stratified sampling on highly specific variables results in targeted samples that are more enriched with cases, which we show translates to increased model discrimination and better statistical learning performance. We provide mathematical details and simulation evidence that links sampling designs to their resulting prediction model performance. We discuss the impact of our proposed sampling on both model training and validation. Finally, we illustrate the proposed designs for outcome label collection and subsequent machine-learning, using radiology report text data from the Lumbar Imaging with Reporting of Epidemiology study.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Rinu Chacko ◽  
Deepak Jain ◽  
Manasi Patwardhan ◽  
Abhishek Puri ◽  
Shirish Karande ◽  
...  

Abstract Machine learning and data analytics are being increasingly used for quantitative structure property relation (QSPR) applications in the chemical domain where the traditional Edisonian approach towards knowledge-discovery have not been fruitful. The perception of odorant stimuli is one such application as olfaction is the least understood among all the other senses. In this study, we employ machine learning based algorithms and data analytics to address the efficacy of using a data-driven approach to predict the perceptual attributes of an odorant namely the odorant characters (OC) of “sweet” and “musky”. We first analyze a psychophysical dataset containing perceptual ratings of 55 subjects to reveal patterns in the ratings given by subjects. We then use the data to train several machine learning algorithms such as random forest, gradient boosting and support vector machine for prediction of the odor characters and report the structural features correlating well with the odor characters based on the optimal model. Furthermore, we analyze the impact of the data quality on the performance of the models by comparing the semantic descriptors generally associated with a given odorant to its perception by majority of the subjects. The study presents a methodology for developing models for odor perception and provides insights on the perception of odorants by untrained human subjects and the effect of the inherent bias in the perception data on the model performance. The models and methodology developed here could be used for predicting odor characters of new odorants.


Sign in / Sign up

Export Citation Format

Share Document