scholarly journals Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Daishin Ueno ◽  
Harunori Kawabe ◽  
Shotaro Yamasaki ◽  
Taku Demura ◽  
Ko Kato

Abstract Background RNA degradation is important for the regulation of gene expression. Despite the identification of proteins and sequences related to deadenylation-dependent RNA degradation in plants, endonucleolytic cleavage-dependent RNA degradation has not been studied in detail. Here, we developed truncated RNA end sequencing in Arabidopsis thaliana to identify cleavage sites and evaluate the efficiency of cleavage at each site. Although several features are related to RNA cleavage efficiency, the effect of each feature on cleavage efficiency has not been evaluated by considering multiple putative determinants in A. thaliana. Results Cleavage site information was acquired from a previous study, and cleavage efficiency at the site level (CSsite value), which indicates the number of reads at each cleavage site normalized to RNA abundance, was calculated. To identify features related to cleavage efficiency at the site level, multiple putative determinants (features) were used to perform feature selection using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model. The results indicated that whole RNA features were important for the CSsite value, in addition to features around cleavage sites. Whole RNA features related to the translation process and nucleotide frequency around cleavage sites were major determinants of cleavage efficiency. The results were verified in a model constructed using only sequence features, which showed that the prediction accuracy was similar to that determined using all features including the translation process, suggesting that cleavage efficiency can be predicted using only sequence information. The LASSO regression model was validated in exogenous genes, which showed that the model constructed using only sequence information can predict cleavage efficiency in both endogenous and exogenous genes. Conclusions Feature selection using the LASSO regression model in A. thaliana identified 155 features. Correlation coefficients revealed that whole RNA features are important for determining cleavage efficiency in addition to features around the cleavage sites. The LASSO regression model can predict cleavage efficiency in endogenous and exogenous genes using only sequence information. The model revealed the significance of the effect of multiple determinants on cleavage efficiency, suggesting that sequence features are important for RNA degradation mechanisms in A. thaliana.

2021 ◽  
Author(s):  
Daishin Ueno ◽  
Shotaro Yamasaki ◽  
Yuta Sadakiyo ◽  
Takumi Teruyama ◽  
Taku Demura ◽  
...  

ABSTRACTRNA degradation is critical for control of gene expression, and endonucleolytic cleavage– dependent RNA degradation is conserved among eukaryotes. Some cleavage sites are secondarily capped in the cytoplasm and identified using the CAGE method. Although uncapped cleavage sites are widespread in eukaryotes, comparatively little information has been obtained about these sites using CAGE-based degradome analysis. Previously, we developed the truncated RNA-end sequencing (TREseq) method in plant species and used it to acquire comprehensive information about uncapped cleavage sites; we observed G-rich sequences near cleavage sites. However, it remains unclear whether this finding is general to other eukaryotes. In this study, we conducted TREseq analyses in fruit flies (Drosophila melanogaster) and budding yeast (Saccharomyces cerevisiae). The results revealed specific sequence features related to RNA cleavage in D. melanogaster and S. cerevisiae that were similar to sequence patterns in Arabidopsis thaliana. Although previous studies suggest that ribosome movements are important for determining cleavage position, feature selection using a random forest classifier showed that sequences around cleavage sites were major determinant for cleaved or uncleaved sites. Together, our results suggest that sequence features around cleavage sites are critical for determining cleavage position, and that sequence-specific endonucleolytic cleavage–dependent RNA degradation is highly conserved across eukaryotes.


2019 ◽  
Vol 121 ◽  
pp. 99-110 ◽  
Author(s):  
Ricardo Rendall ◽  
Ivan Castillo ◽  
Alix Schmidt ◽  
Swee-Teng Chin ◽  
Leo H. Chiang ◽  
...  

2021 ◽  
Vol 9 ◽  
Author(s):  
Qiao-Ying Xie ◽  
Ming-Wei Wang ◽  
Zu-Ying Hu ◽  
Cheng-Jian Cao ◽  
Cong Wang ◽  
...  

Aim: Metabolic syndrome (MS) screening is essential for the early detection of the occupational population. This study aimed to screen out biomarkers related to MS and establish a risk assessment and prediction model for the routine physical examination of an occupational population.Methods: The least absolute shrinkage and selection operator (Lasso) regression algorithm of machine learning was used to screen biomarkers related to MS. Then, the accuracy of the logistic regression model was further verified based on the Lasso regression algorithm. The areas under the receiving operating characteristic curves were used to evaluate the selection accuracy of biomarkers in identifying MS subjects with risk. The screened biomarkers were used to establish a logistic regression model and calculate the odds ratio (OR) of the corresponding biomarkers. A nomogram risk prediction model was established based on the selected biomarkers, and the consistency index (C-index) and calibration curve were derived.Results: A total of 2,844 occupational workers were included, and 10 biomarkers related to MS were screened. The number of non-MS cases was 2,189 and that of MS was 655. The area under the curve (AUC) value for non-Lasso and Lasso logistic regression was 0.652 and 0.907, respectively. The established risk assessment model revealed that the main risk biomarkers were absolute basophil count (OR: 3.38, CI:1.05–6.85), platelet packed volume (OR: 2.63, CI:2.31–3.79), leukocyte count (OR: 2.01, CI:1.79–2.19), red blood cell count (OR: 1.99, CI:1.80–2.71), and alanine aminotransferase level (OR: 1.53, CI:1.12–1.98). Furthermore, favorable results with C-indexes (0.840) and calibration curves closer to ideal curves indicated the accurate predictive ability of this nomogram.Conclusions: The risk assessment model based on the Lasso logistic regression algorithm helped identify MS with high accuracy in physically examining an occupational population.


2018 ◽  
Vol 7 (4.30) ◽  
pp. 498 ◽  
Author(s):  
Seng Jia Xin ◽  
Kamil Khalid

House price prediction is important for the government, finance company, real estate sector and also the house owner.  The data of the house price at Ames, Iowa in United State which from the year 2006 to 2010 is used for multivariate analysis. However, multicollinearity is commonly occurred in the multivariate analysis and gives a serious effect to the model. Therefore, in this study investigates the performance of the Ridge regression model and Lasso regression model as both regressions can deal with multicollinearity. Ridge regression model and Lasso regression model are constructed and compared. The root mean square error (RMSE) and adjusted R-squared are used to evaluate the performance of the models. This comparative study found that the Lasso regression model is performing better compared to the Ridge regression model. Based on this analysis, the selected variables includes the aspect of  house size, age of house, condition of house and also the location of the house.


2020 ◽  
Vol 109 (8) ◽  
pp. 2585-2593
Author(s):  
Atsushi Kosugi ◽  
Kok Hoong Leong ◽  
Hinako Tsuji ◽  
Yoshihiro Hayashi ◽  
Shungo Kumada ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document