Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework

Author(s):  
Md Mehedi Hasan ◽  
Shaherin Basith ◽  
Mst Shamima Khatun ◽  
Gwang Lee ◽  
Balachandran Manavalan ◽  
...  

Abstract DNA N6-methyladenine (6mA) represents important epigenetic modifications, which are responsible for various cellular processes. The accurate identification of 6mA sites is one of the challenging tasks in genome analysis, which leads to an understanding of their biological functions. To date, several species-specific machine learning (ML)-based models have been proposed, but majority of them did not test their model to other species. Hence, their practical application to other plant species is quite limited. In this study, we explored 10 different feature encoding schemes, with the goal of capturing key characteristics around 6mA sites. We selected five feature encoding schemes based on physicochemical and position-specific information that possesses high discriminative capability. The resultant feature sets were inputted to six commonly used ML methods (random forest, support vector machine, extremely randomized tree, logistic regression, naïve Bayes and AdaBoost). The Rosaceae genome was employed to train the above classifiers, which generated 30 baseline models. To integrate their individual strength, Meta-i6mA was proposed that combined the baseline models using the meta-predictor approach. In extensive independent test, Meta-i6mA showed high Matthews correlation coefficient values of 0.918, 0.827 and 0.635 on Rosaceae, rice and Arabidopsis thaliana, respectively and outperformed the existing predictors. We anticipate that the Meta-i6mA can be applied across different plant species. Furthermore, we developed an online user-friendly web server, which is available at http://kurata14.bio.kyutech.ac.jp/Meta-i6mA/.

Author(s):  
Ke Wang ◽  
Qingwen Xue ◽  
Jian John Lu

Identifying high-risk drivers before an accident happens is necessary for traffic accident control and prevention. Due to the class-imbalance nature of driving data, high-risk samples as the minority class are usually ill-treated by standard classification algorithms. Instead of applying preset sampling or cost-sensitive learning, this paper proposes a novel automated machine learning framework that simultaneously and automatically searches for the optimal sampling, cost-sensitive loss function, and probability calibration to handle class-imbalance problem in recognition of risky drivers. The hyperparameters that control sampling ratio and class weight, along with other hyperparameters, are optimized by Bayesian optimization. To demonstrate the performance of the proposed automated learning framework, we establish a risky driver recognition model as a case study, using video-extracted vehicle trajectory data of 2427 private cars on a German highway. Based on rear-end collision risk evaluation, only 4.29% of all drivers are labeled as risky drivers. The inputs of the recognition model are the discrete Fourier transform coefficients of target vehicle’s longitudinal speed, lateral speed, and the gap between the target vehicle and its preceding vehicle. Among 12 sampling methods, 2 cost-sensitive loss functions, and 2 probability calibration methods, the result of automated machine learning is consistent with manual searching but much more computation-efficient. We find that the combination of Support Vector Machine-based Synthetic Minority Oversampling TEchnique (SVMSMOTE) sampling, cost-sensitive cross-entropy loss function, and isotonic regression can significantly improve the recognition ability and reduce the error of predicted probability.


2021 ◽  
Vol 22 (16) ◽  
pp. 8958
Author(s):  
Phasit Charoenkwan ◽  
Chanin Nantasenamat ◽  
Md. Mehedi Hasan ◽  
Mohammad Ali Moni ◽  
Pietro Lio’ ◽  
...  

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides


2020 ◽  
Vol 9 (4) ◽  
pp. 252 ◽  
Author(s):  
Kwanele Phinzi ◽  
Dávid Abriha ◽  
László Bertalan ◽  
Imre Holb ◽  
Szilárd Szabó

Gullies reduce both the quality and quantity of productive land, posing a serious threat to sustainable agriculture, hence, food security. Machine Learning (ML) algorithms are essential tools in the identification of gullies and can assist in strategic decision-making relevant to soil conservation. Nevertheless, accurate identification of gullies is a function of the selected ML algorithms, the image and number of classes used, i.e., binary (two classes) and multiclass. We applied Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), and Random Forest (RF) on a Systeme Pour l’Observation de la Terre (SPOT-7) image to extract gullies and investigated whether the multiclass (m) approach can offer better classification accuracy than the binary (b) approach. Using repeated k-fold cross-validation, we generated 36 models. Our findings revealed that, of these models, both RFb (98.70%) and SVMm (98.01%) outperformed the LDA in terms of overall accuracy (OA). However, the LDAb (99.51%) recorded the highest producer’s accuracy (PA) but had low corresponding user’s accuracy (UA) with 18.5%. The binary approach was generally better than the multiclass approach; however, on class level, the multiclass approach outperformed the binary approach in gully identification. Despite low spectral resolution, the pan-sharpened SPOT-7 product successfully identified gullies. The proposed methodology is relatively simple, but practically sound, and can be used to monitor gullies within and beyond the study region.


2020 ◽  
Author(s):  
Abel Szkalisity ◽  
Filippo Piccinini ◽  
Attila Beleon ◽  
Tamas Balassa ◽  
Istvan Gergely Varga ◽  
...  

ABSTRACTBiological processes are inherently continuous, and the chance of phenotypic discovery is significantly restricted by discretising them. Using multi-parametric active regression we introduce a novel concept to describe and explore biological data in a continuous manner. We have implemented Regression Plane (RP), the first user-friendly discovery tool enabling class-free phenotypic supervised machine learning.


2019 ◽  
Author(s):  
Sheng-Yong Niu ◽  
Binqiang Liu ◽  
Qin Ma ◽  
Wen-Chi Chou

AbstractA transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, e.g., gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine-learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random-forest-based feature selection, TU prediction, and TU visualization.


2019 ◽  
Vol 116 (16) ◽  
pp. 7847-7856 ◽  
Author(s):  
Akira Shiraishi ◽  
Toshimi Okuda ◽  
Natsuko Miyasaka ◽  
Tomohiro Osugi ◽  
Yasushi Okuno ◽  
...  

Neuropeptides play pivotal roles in various biological events in the nervous, neuroendocrine, and endocrine systems, and are correlated with both physiological functions and unique behavioral traits of animals. Elucidation of functional interaction between neuropeptides and receptors is a crucial step for the verification of their biological roles and evolutionary processes. However, most receptors for novel peptides remain to be identified. Here, we show the identification of multiple G protein-coupled receptors (GPCRs) for species-specific neuropeptides of the vertebrate sister group, Ciona intestinalis Type A, by combining machine learning and experimental validation. We developed an original peptide descriptor-incorporated support vector machine and used it to predict 22 neuropeptide–GPCR pairs. Of note, signaling assays of the predicted pairs identified 1 homologous and 11 Ciona-specific neuropeptide–GPCR pairs for a 41% hit rate: the respective GPCRs for Ci-GALP, Ci-NTLP-2, Ci-LF-1, Ci-LF-2, Ci-LF-5, Ci-LF-6, Ci-LF-7, Ci-LF-8, Ci-YFV-1, and Ci-YFV-3. Interestingly, molecular phylogenetic tree analysis revealed that these receptors, excluding the Ci-GALP receptor, were evolutionarily unrelated to any other known peptide GPCRs, confirming that these GPCRs constitute unprecedented neuropeptide receptor clusters. Altogether, these results verified the neuropeptide–GPCR pairs in the protochordate and evolutionary lineages of neuropeptide GPCRs, and pave the way for investigating the endogenous roles of novel neuropeptides in the closest relatives of vertebrates and the evolutionary processes of neuropeptidergic systems throughout chordates. In addition, the present study also indicates the versatility of the machine-learning–assisted strategy for the identification of novel peptide–receptor pairs in various organisms.


2020 ◽  
pp. 1420326X2093157
Author(s):  
Yu Huang ◽  
Zhi Gao ◽  
Hongguang Zhang

The accurate identification of the characteristics of pollutant sources can effectively prevent the loss of human life and property damage caused by the sudden release of harmful chemicals in emergency situations. Machine learning algorithms, artificial neural network (ANN), support vector machine (SVM), k-nearest neighbour (KNN) and naive Bayesian (NB) classification can be used to identify the location of pollutant sources with limited sensor data inputs. In this study, the identification accuracy of the four above-mentioned machine learning algorithms was investigated and compared, considering the different sensor layouts, eigenvector inputs, meteorological parameters and number of samples. The results show that the collection of pollutant concentrations over an extended period of time could improve identification accuracy. Additional sensors were required to reach the same identification accuracy after the introduction of distributed meteorological parameters. Increasing the number of trained samples by a factor of five improved the identification accuracy of KNN by 22% and that of SVM by 1.7%; however, ANN and NB classification remained basically unchanged. When identifying the release mass of the pollutant source, multiple linear, ANN and SVM regression models were adopted. Results show that ANN performs best, whereas SVM provides the least optimal performance.


Author(s):  
Phasit Charoenkwan ◽  
Nuttapat Anuwongcharoen ◽  
Chanin Nantasenamat ◽  
Md. Mehedi Hasan ◽  
Watshara Shoombuatong

: In light of the growing resistance toward current antiviral drugs, efforts to discover novel and effective antiviral therapeutic agents remain a pressing scientific effort. Antiviral peptides (AVPs) represents promising therapeutic agents due to their extraordinary advantages in terms of potency, efficacy and pharmacokinetic properties. The growing volume of newly discovered peptide sequences in the post-genomic era requires computational approaches for timely and accurate identification of AVPs. Machine learning (ML) methods such as random forest and support vector machine represents robust learning algorithms that are instrumental in successful peptide-based drug discovery. Therefore, this review summarizes the current state-of-the-art on the application of ML methods for identifying AVPs directly from the sequence information. We compare the efficiency of these methods in terms of the underlying characteristics of the dataset used along with feature encoding methods, ML algorithms, cross-validation methods and prediction performance. Finally, guidelines for development of robust AVP models are also discussed. It is anticipated that this review will be serve as a useful guide for the design and development of robust AVP and related therapeutic peptide predictors in the future.


Sign in / Sign up

Export Citation Format

Share Document