iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

Abstract With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

Download Full-text

Customer Segment Prognostic System by Machine Learning using Principal Component and Linear Discriminant Analysis

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2290.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 6198-6203

Keyword(s):

Machine Learning ◽

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

Principal Component ◽

Customer Behavior ◽

Machine Learning Algorithms ◽

Data Set ◽

Linear Discriminant ◽

Customer Group

Recently, manufacturing industry faces lots of problem in predicting the customer behavior and group for matching their outcome with the profit. The organizations are finding difficult in identifying the customer behavior for the purpose of predicting the product design so as to increase the profit. The prediction of customer group is a challenging task for all the organization due to the current growing entrepreneurs. This results in using the machine learning algorithms to cluster the customer group for predicting the demand of the customers. This helps in decision making process of manufacturing the products. This paper attempts to predict the customer group for the wine data set extracted from UCI Machine Learning repository. The wine data set is subjected to dimensionality reduction with principal component analysis and linear discriminant analysis. A Performance analysis is done with various classification algorithms and comparative study is done with the performance metric such as accuracy, precision, recall, and f-score. Experimental results shows that after applying dimensionality reduction, the 2 component LDA reduced wine data set with the kernel SVM, Random Forest classifier is found to be effective with the accuracy of 100% compared to other classifiers.

Download Full-text

Identifying genetic determinants of complex phenotypes from whole genome sequence data

10.1101/181222 ◽

2017 ◽

Cited By ~ 1

Author(s):

George S. Long ◽

Mohammed Hussen ◽

Jonathan Dench ◽

Stéphane Aris-Brosou

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Association Studies ◽

Machine Learning Algorithms ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Genetic Determinants ◽

Data Set ◽

Adaptive Boosting ◽

Complex Phenotypes

AbstractA critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (in-fectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than RF, it was never < 50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.

Download Full-text

Evolution of Machine Learning Algorithms in the Prediction and Design of Anticancer Peptides

Current Protein and Peptide Science ◽

10.2174/1389203721666200117171403 ◽

2020 ◽

Vol 21 (12) ◽

pp. 1242-1250 ◽

Cited By ~ 5

Author(s):

Shaherin Basith ◽

Balachandran Manavalan ◽

Tae Hwan Shin ◽

Da Yeon Lee ◽

Gwang Lee

Keyword(s):

Machine Learning ◽

Anticancer Agents ◽

Model Building ◽

Sequence Data ◽

Machine Learning Algorithms ◽

Great Promise ◽

Future Directions ◽

Design And Synthesis ◽

Anticancer Peptides ◽

Protein Sequence Data

: Peptides act as promising anticancer agents due to their ease of synthesis and modifications, enhanced tumor penetration, and less systemic toxicity. However, only limited success has been achieved so far, as experimental design and synthesis of anticancer peptides (ACPs) are prohibitively costly and time-consuming. Furthermore, the sequential increase in the protein sequence data via highthroughput sequencing makes it difficult to identify ACPs only through experimentation, which often involves months or years of speculation and failure. All these limitations could be overcome by applying machine learning (ML) approaches, which is a field of artificial intelligence that automates analytical model building for rapid and accurate outcome predictions. Recently, ML approaches hold great promise in the rapid discovery of ACPs, which could be witnessed by the growing number of MLbased anticancer prediction tools. In this review, we aim to provide a comprehensive view on the existing ML approaches for ACP predictions. Initially, we will briefly discuss the currently available ACP databases. This is followed by the main text, where state-of-the-art ML approaches working principles and their performances based on the ML algorithms are reviewed. Lastly, we discuss the limitations and future directions of the ML methods in the prediction of ACPs.

Download Full-text

DeepView: Visualizing Classification Boundaries of Deep Neural Networks as Scatter Plots Using Discriminative Dimensionality Reduction

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/319 ◽

2020 ◽

Author(s):

Alexander Schulz ◽

Fabian Hinder ◽

Barbara Hammer

Keyword(s):

Neural Network ◽

Machine Learning ◽

Dimensionality Reduction ◽

Deep Neural Network ◽

Deep Neural Networks ◽

Decision Function ◽

Two Dimensions ◽

Machine Learning Algorithms ◽

Data Set ◽

Scatter Plots

Machine learning algorithms using deep architectures have been able to implement increasingly powerful and successful models. However, they also become increasingly more complex, more difficult to comprehend and easier to fool. So far, most methods in the literature investigate the decision of the model for a single given input datum. In this paper, we propose to visualize a part of the decision function of a deep neural network together with a part of the data set in two dimensions with discriminative dimensionality reduction. This enables us to inspect how different properties of the data are treated by the model, such as outliers, adversaries or poisoned data. Further, the presented approach is complementary to the mentioned interpretation methods from the literature and hence might be even more useful in combination with those. Code is available at https://github.com/LucaHermes/DeepView

Download Full-text

Swarm Intelligence Optimization: An Exploration and Application of Machine Learning Technology

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0084 ◽

2021 ◽

Vol 30 (1) ◽

pp. 460-469

Author(s):

Yinying Cai ◽

Amit Sharma

Keyword(s):

Machine Learning ◽

Swarm Intelligence ◽

Research Result ◽

Machine Learning Algorithms ◽

Learning Technology ◽

Data Set ◽

Rice Pests ◽

Machine Leaning ◽

Smart Agriculture ◽

Swarm Intelligence Optimization

Abstract In the agriculture development and growth, the efficient machinery and equipment plays an important role. Various research studies are involved in the implementation of the research and patents to aid the smart agriculture and authors and reviewers that machine leaning technologies are providing the best support for this growth. To explore machine learning technology and machine learning algorithms, the most of the applications are studied based on the swarm intelligence optimization. An optimized V3CFOA-RF model is built through V3CFOA. The algorithm is tested in the data set collected concerning rice pests, later analyzed and compared in detail with other existing algorithms. The research result shows that the model and algorithm proposed are not only more accurate in recognition and prediction, but also solve the time lagging problem to a degree. The model and algorithm helped realize a higher accuracy in crop pest prediction, which ensures a more stable and higher output of rice. Thus they can be employed as an important decision-making instrument in the agricultural production sector.

Download Full-text

Why machine learning algorithms fail in misuse detection on KDD intrusion detection data set

Intelligent Data Analysis ◽

10.3233/ida-2004-8406 ◽

2004 ◽

Vol 8 (4) ◽

pp. 403-415 ◽

Cited By ~ 72

Author(s):

Maheshkumar Sabhnani ◽

Gursel Serpen

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Misuse Detection ◽

Data Set

Download Full-text

Birds Sound Classification Based on Machine Learning Algorithms

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v9i430227 ◽

2021 ◽

pp. 1-11

Author(s):

Aska E. Mehyadin ◽

Adnan Mohsin Abdulazeez ◽

Dathar Abas Hasan ◽

Jwan N. Saeed

Keyword(s):

Machine Learning ◽

Noise Suppression ◽

Bird Species ◽

Machine Learning Algorithms ◽

Data Sets ◽

Learning Technology ◽

Species Classification ◽

Data Set ◽

Sound Classification ◽

Mel Frequency Cepstral Coefficient

The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).

Download Full-text

PERFORMANCE COMPARISON OF MACHINE LEARNING ALGORITHMS FOR PREDICTIVE MAINTENANCE

Informatyka Automatyka Pomiary w Gospodarce i Ochronie Środowiska ◽

10.35784/iapgos.1834 ◽

2020 ◽

Vol 10 (3) ◽

pp. 32-35

Author(s):

Jakub Gęca

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Performance Comparison ◽

Machine Learning Algorithms ◽

Predictive Maintenance ◽

Model Parameters ◽

Data Set ◽

Reduction Techniques ◽

Machine Reliability ◽

Dimensionality Reduction Techniques

The consequences of failures and unscheduled maintenance are the reasons why engineers have been trying to increase the reliability of industrial equipment for years. In modern solutions, predictive maintenance is a frequently used method. It allows to forecast failures and alert about their possibility. This paper presents a summary of the machine learning algorithms that can be used in predictive maintenance and comparison of their performance. The analysis was made on the basis of data set from Microsoft Azure AI Gallery. The paper presents a comprehensive approach to the issue including feature engineering, preprocessing, dimensionality reduction techniques, as well as tuning of model parameters in order to obtain the highest possible performance. The conducted research allowed to conclude that in the analysed case , the best algorithm achieved 99.92% accuracy out of over 122 thousand test data records. In conclusion, predictive maintenance based on machine learning represents the future of machine reliability in industry.

Download Full-text

Non-Invasive Risk Stratification of Hypertension: A Systematic Comparison of Machine Learning Algorithms

Journal of Sensor and Actuator Networks ◽

10.3390/jsan9030034 ◽

2020 ◽

Vol 9 (3) ◽

pp. 34

Author(s):

Giovanna Sannino ◽

Ivanoe De Falco ◽

Giuseppe De Pietro

Keyword(s):

Machine Learning ◽

Blood Pressure ◽

Risk Stratification ◽

Learning Algorithms ◽

Circulatory System ◽

Machine Learning Algorithms ◽

Learning Mechanisms ◽

Data Set ◽

Non Invasive ◽

Blood Pressure Estimation

One of the most important physiological parameters of the cardiovascular circulatory system is Blood Pressure. Several diseases are related to long-term abnormal blood pressure, i.e., hypertension; therefore, the early detection and assessment of this condition are crucial. The identification of hypertension, and, even more the evaluation of its risk stratification, by using wearable monitoring devices are now more realistic thanks to the advancements in Internet of Things, the improvements of digital sensors that are becoming more and more miniaturized, and the development of new signal processing and machine learning algorithms. In this scenario, a suitable biomedical signal is represented by the PhotoPlethysmoGraphy (PPG) signal. It can be acquired by using a simple, cheap, and wearable device, and can be used to evaluate several aspects of the cardiovascular system, e.g., the detection of abnormal heart rate, respiration rate, blood pressure, oxygen saturation, and so on. In this paper, we take into account the Cuff-Less Blood Pressure Estimation Data Set that contains, among others, PPG signals coming from a set of subjects, as well as the Blood Pressure values of the latter that is the hypertension level. Our aim is to investigate whether or not machine learning methods applied to these PPG signals can provide better results for the non-invasive classification and evaluation of subjects’ hypertension levels. To this aim, we have availed ourselves of a wide set of machine learning algorithms, based on different learning mechanisms, and have compared their results in terms of the effectiveness of the classification obtained.

Download Full-text

Spoken words as biomarkers: using machine learning to gain insight into communication as a predictor of anxiety

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa049 ◽

2020 ◽

Vol 27 (6) ◽

pp. 929-933

Author(s):

George Demiris ◽

Kristin L Corey Magan ◽

Debra Parker Oliver ◽

Karla T Washington ◽

Chad Chadwick ◽

...

Keyword(s):

Machine Learning ◽

Secondary Data ◽

Health Indicators ◽

Machine Learning Algorithms ◽

Standardized Assessments ◽

Learning Tools ◽

Data Set ◽

Problem Solving Therapy ◽

Audio Communication ◽

The Impact

Abstract Objective The goal of this study was to explore whether features of recorded and transcribed audio communication data extracted by machine learning algorithms can be used to train a classifier for anxiety. Materials and Methods We used a secondary data set generated by a clinical trial examining problem-solving therapy for hospice caregivers consisting of 140 transcripts of multiple, sequential conversations between an interviewer and a family caregiver along with standardized assessments of anxiety prior to each session; 98 of these transcripts (70%) served as the training set, holding the remaining 30% of the data for evaluation. Results A classifier for anxiety was developed relying on language-based features. An 86% precision, 78% recall, 81% accuracy, and 84% specificity were achieved with the use of the trained classifiers. High anxiety inflections were found among recently bereaved caregivers and were usually connected to issues related to transitioning out of the caregiving role. This analysis highlighted the impact of lowering anxiety by increasing reciprocity between interviewers and caregivers. Conclusion Verbal communication can provide a platform for machine learning tools to highlight and predict behavioral health indicators and trends.

Download Full-text