mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net

Abstract Background Localization of messenger RNAs (mRNAs) plays a crucial role in the growth and development of cells. Particularly, it plays a major role in regulating spatio-temporal gene expression. The in situ hybridization is a promising experimental technique used to determine the localization of mRNAs but it is costly and laborious. It is also a known fact that a single mRNA can be present in more than one location, whereas the existing computational tools are capable of predicting only a single location for such mRNAs. Thus, the development of high-end computational tool is required for reliable and timely prediction of multiple subcellular locations of mRNAs. Hence, we develop the present computational model to predict the multiple localizations of mRNAs. Results The mRNA sequences from 9 different localizations were considered. Each sequence was first transformed to a numeric feature vector of size 5460, based on the k-mer features of sizes 1–6. Out of 5460 k-mer features, 1812 important features were selected by the Elastic Net statistical model. The Random Forest supervised learning algorithm was then employed for predicting the localizations with the selected features. Five-fold cross-validation accuracies of 70.87, 68.32, 68.36, 68.79, 96.46, 73.44, 70.94, 97.42 and 71.77% were obtained for the cytoplasm, cytosol, endoplasmic reticulum, exosome, mitochondrion, nucleus, pseudopodium, posterior and ribosome respectively. With an independent test set, accuracies of 65.33, 73.37, 75.86, 72.99, 94.26, 70.91, 65.53, 93.60 and 73.45% were obtained for the respective localizations. The developed approach also achieved higher accuracies than the existing localization prediction tools. Conclusions This study presents a novel computational tool for predicting the multiple localization of mRNAs. Based on the proposed approach, an online prediction server “mLoc-mRNA” is accessible at http://cabgrid.res.in:8080/mlocmrna/. The developed approach is believed to supplement the existing tools and techniques for the localization prediction of mRNAs.

Download Full-text

Spatio-temporal estimation of the daily cases of COVID-19 in worldwide using random forest machine learning algorithm

Chaos Solitons & Fractals ◽

10.1016/j.chaos.2020.110210 ◽

2020 ◽

Vol 140 ◽

pp. 110210 ◽

Cited By ~ 5

Author(s):

Cafer Mert Yeşilkanat

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Temporal Estimation ◽

Spatio Temporal

Download Full-text

Detecting suicidal risk using MMPI-2 based on machine learning algorithm

Scientific Reports ◽

10.1038/s41598-021-94839-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sunhae Kim ◽

Hye-Kyung Lee ◽

Kounseok Lee

Keyword(s):

Machine Learning ◽

Suicidal Ideation ◽

Random Forest ◽

Minnesota Multiphasic Personality Inventory ◽

Learning Algorithm ◽

Suicidal Risk ◽

K Nearest Neighbors ◽

Large Group ◽

Suicidal Attempts ◽

Scale Scores

AbstractMinnesota Multiphasic Personality Inventory-2 (MMPI-2) is a widely used tool for early detection of psychological maladjustment and assessing the level of adaptation for a large group in clinical settings, schools, and corporations. This study aims to evaluate the utility of MMPI-2 in assessing suicidal risk using the results of MMPI-2 and suicidal risk evaluation. A total of 7,824 datasets collected from college students were analyzed. The MMPI-2-Resturcutred Clinical Scales (MMPI-2-RF) and the response results for each question of the Mini International Neuropsychiatric Interview (MINI) suicidality module were used. For statistical analysis, random forest and K-Nearest Neighbors (KNN) techniques were used with suicidal ideation and suicide attempt as dependent variables and 50 MMPI-2 scale scores as predictors. On applying the random forest method to suicidal ideation and suicidal attempts, the accuracy was 92.9% and 95%, respectively, and the Area Under the Curves (AUCs) were 0.844 and 0.851, respectively. When the KNN method was applied, the accuracy was 91.6% and 94.7%, respectively, and the AUCs were 0.722 and 0.639, respectively. The study confirmed that machine learning using MMPI-2 for a large group provides reliable accuracy in classifying and predicting the subject's suicidal ideation and past suicidal attempts.

Download Full-text

A Random Forest Model for the Prediction of FOG Content in Inlet Wastewater from Urban WWTPs

Water ◽

10.3390/w13091237 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1237

Author(s):

Vanesa Mateo Pérez ◽

José Manuel Mesa Fernández ◽

Joaquín Villanueva Balsera ◽

Cristina Alonso Álvarez

Keyword(s):

Random Forest ◽

Wastewater Treatment Plants ◽

Learning Algorithm ◽

Global Improvement ◽

Treatment Processes ◽

Early Adoption ◽

Novel Approach ◽

Number Of Factors ◽

Initial Stage ◽

Correct Design

The content of fats, oils, and greases (FOG) in wastewater, as a result of food preparation, both in homes and in different commercial and industrial activities, is a growing problem. In addition to the blockages generated in the sanitary networks, it also represents a difficulty for the performance of wastewater treatment plants (WWTP), increasing energy and maintenance costs and worsening the performance of downstream treatment processes. The pretreatment stage of these facilities is responsible for removing most of the FOG to avoid these problems. However, so far, optimization has been limited to the correct design and initial installation dimensioning. Proper management of this initial stage is left to the experience of the operators to adjust the process when changes occur in the characteristics of the wastewater inlet. The main difficulty is the large number of factors influencing these changes. In this work, a prediction model of the FOG content in the inlet water is presented. The model is capable of correctly predicting 98.45% of the cases in training and 72.73% in testing, with a relative error of 10%. It was developed using random forest (RF) and the good results obtained (R2 = 0.9348 and RMSE = 0.089 in test) will make it possible to improve operations in this initial stage. The good features of this machine learning algorithm had not been used, so far, in the modeling of pretreatment parameters. This novel approach will result in a global improvement in the performance of this type of facility allowing early adoption of adjustments to the pretreatment process to remove the maximum amount of FOG.

Download Full-text

Influence of Random Forest Hyperparameterization on Short-Term Runoff Forecasting in an Andean Mountain Catchment

Atmosphere ◽

10.3390/atmos12020238 ◽

2021 ◽

Vol 12 (2) ◽

pp. 238

Author(s):

Pablo Contreras ◽

Johanna Orellana-Alvear ◽

Paul Muñoz ◽

Jörg Bendix ◽

Rolando Célleri

Keyword(s):

Random Forest ◽

Model Performance ◽

Early Warning Systems ◽

Point Of View ◽

Lead Times ◽

Physical Parameters ◽

Runoff Forecasting ◽

Spatio Temporal ◽

Improved Model ◽

Search Approach

The Random Forest (RF) algorithm, a decision-tree-based technique, has become a promising approach for applications addressing runoff forecasting in remote areas. This machine learning approach can overcome the limitations of scarce spatio-temporal data and physical parameters needed for process-based hydrological models. However, the influence of RF hyperparameters is still uncertain and needs to be explored. Therefore, the aim of this study is to analyze the sensitivity of RF runoff forecasting models of varying lead time to the hyperparameters of the algorithm. For this, models were trained by using (a) default and (b) extensive hyperparameter combinations through a grid-search approach that allow reaching the optimal set. Model performances were assessed based on the R2, %Bias, and RMSE metrics. We found that: (i) The most influencing hyperparameter is the number of trees in the forest, however the combination of the depth of the tree and the number of features hyperparameters produced the highest variability-instability on the models. (ii) Hyperparameter optimization significantly improved model performance for higher lead times (12- and 24-h). For instance, the performance of the 12-h forecasting model under default RF hyperparameters improved to R2 = 0.41 after optimization (gain of 0.17). However, for short lead times (4-h) there was no significant model improvement (0.69 < R2 < 0.70). (iii) There is a range of values for each hyperparameter in which the performance of the model is not significantly affected but remains close to the optimal. Thus, a compromise between hyperparameter interactions (i.e., their values) can produce similar high model performances. Model improvements after optimization can be explained from a hydrological point of view, the generalization ability for lead times larger than the concentration time of the catchment tend to rely more on hyperparameterization than in what they can learn from the input data. This insight can help in the development of operational early warning systems.

Download Full-text

Evaluation of AdaBoost's elastic net-type regularized multi-core learning algorithm in volleyball teaching actions

Wireless Networks ◽

10.1007/s11276-021-02694-z ◽

2021 ◽

Author(s):

Haowen Wu

Keyword(s):

Learning Algorithm ◽

Elastic Net

Download Full-text

A Random Forest-Based Data Fusion Method for Obtaining All-Weather Land Surface Temperature with High Spatial Resolution

Remote Sensing ◽

10.3390/rs13112211 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2211

Author(s):

Shuo Xu ◽

Jie Cheng ◽

Quan Zhang

Keyword(s):

Random Forest ◽

Surface Temperature ◽

Spatial Resolution ◽

Land Surface Temperature ◽

Land Surface ◽

High Spatial Resolution ◽

Learning Algorithm ◽

Mainland China ◽

Fusion Method ◽

Moisture Condition

Land surface temperature (LST) is an important parameter for mirroring the water–heat exchange and balance on the Earth’s surface. Passive microwave (PMW) LST can make up for the lack of thermal infrared (TIR) LST caused by cloud contamination, but its resolution is relatively low. In this study, we developed a TIR and PWM LST fusion method on based the random forest (RF) machine learning algorithm to obtain the all-weather LST with high spatial resolution. Since LST is closely related to land cover (LC) types, terrain, vegetation conditions, moisture condition, and solar radiation, these variables were selected as candidate auxiliary variables to establish the best model to obtain the fusion results of mainland China during 2010. In general, the fusion LST had higher spatial integrity than the MODIS LST and higher accuracy than downscaled AMSR-E LST. Additionally, the magnitude of LST data in the fusion results was consistent with the general spatiotemporal variations of LST. Compared with in situ observations, the RMSE of clear-sky fused LST and cloudy-sky fused LST were 2.12–4.50 K and 3.45–4.89 K, respectively. Combining the RF method and the DINEOF method, a complete all-weather LST with a spatial resolution of 0.01° can be obtained.

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Real-Time AI-Based Informational Decision-Making Support System Utilizing Dynamic Text Sources

Applied Sciences ◽

10.3390/app11136237 ◽

2021 ◽

Vol 11 (13) ◽

pp. 6237

Author(s):

Azharul Islam ◽

KyungHi Chang

Keyword(s):

Machine Learning ◽

Decision Making ◽

Random Forest ◽

Support System ◽

Classification Accuracy ◽

Short Term Memory ◽

Learning Algorithm ◽

Unstructured Data ◽

Stochastic Gradient Descent ◽

Decision Making Support

Unstructured data from the internet constitute large sources of information, which need to be formatted in a user-friendly way. This research develops a model that classifies unstructured data from data mining into labeled data, and builds an informational and decision-making support system (DMSS). We often have assortments of information collected by mining data from various sources, where the key challenge is to extract valuable information. We observe substantial classification accuracy enhancement for our datasets with both machine learning and deep learning algorithms. The highest classification accuracy (99% in training, 96% in testing) was achieved from a Covid corpus which is processed by using a long short-term memory (LSTM). Furthermore, we conducted tests on large datasets relevant to the Disaster corpus, with an LSTM classification accuracy of 98%. In addition, random forest (RF), a machine learning algorithm, provides a reasonable 84% accuracy. This research’s main objective is to increase the application’s robustness by integrating intelligence into the developed DMSS, which provides insight into the user’s intent, despite dealing with a noisy dataset. Our designed model selects the random forest and stochastic gradient descent (SGD) algorithms’ F1 score, where the RF method outperforms by improving accuracy by 2% (to 83% from 81%) compared with a conventional method.

Download Full-text

Plausibilidad Financiera de los Hogares en el Sistema de Atención a la Dependencia en España: Evidencia Regional

Studies of Applied Economics ◽

10.25115/eea.v39i3.4783 ◽

2021 ◽

Vol 39 (3) ◽

Author(s):

Román Mínguez Salido ◽

Raúl Del Pozo Rubio ◽

María del Carmen García Centeno

Keyword(s):

Machine Learning ◽

Random Forest ◽

Elastic Net ◽

El Sistema

Uno de los temas más analizados en las últimas décadas ha sido el catastrofismo financiero debido a los Pagos de Bolsillo (PDB) que realizan los hogares por el acceso y utilización de los sistemas de salud. En este trabajo se persiguen fundamentalmente dos objetivos. El primero, se centra en predecir la tasa de catastrofismo financiero y obtener la importancia de las variables para predecir la tasa de catastrofismo para un nivel de renta alto, medio o bajo de las diferentes Comunidades Autónomas. Para ello, se establecerá una comparativa entre dos algoritmos machine learning, uno basado en regresiones elastic-net para estimar modelos lineales generalizados; y, otro basado en algoritmos random forest, que permite captar las posibles no linealidades e interacciones que se pueden producir en los datos. Los resultados muestran que es más adecuado el random forest. A partir de estos resultados, el segundo objetivo, se centra en establecer un ordenamiento entre las diferentes Comunidades Autónomas según su nivel de renta para las diferentes categorías de las tasas de catastrofismo mediante la utilización de un modelo de decisión multicriterio discreto (método PROMETHEE).

Download Full-text

Kalmag: a high spatio temporal model of the Geomagnetic field

10.5194/egusphere-egu21-7173 ◽

2021 ◽

Author(s):

Julien Baerenzung ◽

Matthias Holschneider

Keyword(s):

Geomagnetic Field ◽

Learning Algorithm ◽

Dynamical Evolution ◽

Small Scale ◽

Large Set ◽

The Core ◽

Resolution Model ◽

Auto Regressive ◽

Spatio Temporal ◽

Marine Vessels

<p>We present a new high resolution model of the Geomagnetic field spanning the last 121 years. The model derives from a large set of data taken by low orbiting satellites, ground based observatories, marine vessels, airplane and during land surveys. It is obtained by combining a Kalman filter to a smoothing algorithm. Seven different magnetic sources are taken into account. Three of them are of internal origin. These are the core, the lithospheric &#160;and the induced / residual ionospheric fields. The other four sources are of external origin. They are composed by a close, a remote and a fluctuating magnetospheric fields as well as a source associated with field aligned currents. The dynamical evolution of each source is prescribed by an auto regressive process of either first or second order, except for the lithospheric field which is assumed to be static. The parameters of the processes were estimated through a machine learning algorithm with a sample of data taken by the low orbiting satellites of the CHAMP and Swarm missions. In this presentation we will mostly focus on the rapid variations of the core field, and the small scale lithospheric field.&#160; We will also discuss the nature of model uncertainties and the limitiations they imply.</p>

Download Full-text