prediction problems
Recently Published Documents


TOTAL DOCUMENTS

374
(FIVE YEARS 149)

H-INDEX

22
(FIVE YEARS 6)

2022 ◽  
Vol 54 (8) ◽  
pp. 1-36
Author(s):  
Shubhra Kanti Karmaker (“Santu”) ◽  
Md. Mahadi Hassan ◽  
Micah J. Smith ◽  
Lei Xu ◽  
Chengxiang Zhai ◽  
...  

As big data becomes ubiquitous across domains, and more and more stakeholders aspire to make the most of their data, demand for machine learning tools has spurred researchers to explore the possibilities of automated machine learning (AutoML). AutoML tools aim to make machine learning accessible for non-machine learning experts (domain experts), to improve the efficiency of machine learning, and to accelerate machine learning research. But although automation and efficiency are among AutoML’s main selling points, the process still requires human involvement at a number of vital steps, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training dataset, and selecting a promising machine learning technique. These steps often require a prolonged back-and-forth that makes this process inefficient for domain experts and data scientists alike and keeps so-called AutoML systems from being truly automatic. In this review article, we introduce a new classification system for AutoML systems, using a seven-tiered schematic to distinguish these systems based on their level of autonomy. We begin by describing what an end-to-end machine learning pipeline actually looks like, and which subtasks of the machine learning pipeline have been automated so far. We highlight those subtasks that are still done manually—generally by a data scientist—and explain how this limits domain experts’ access to machine learning. Next, we introduce our novel level-based taxonomy for AutoML systems and define each level according to the scope of automation support provided. Finally, we lay out a roadmap for the future, pinpointing the research required to further automate the end-to-end machine learning pipeline and discussing important challenges that stand in the way of this ambitious goal.


2022 ◽  
Vol 12 (1) ◽  
Author(s):  
Manyun Guo ◽  
Yucheng Ma ◽  
Wanyuan Liu ◽  
Zuyi Yuan

AbstractNucleocapsid protein (NC) in the group-specific antigen (gag) of retrovirus is essential in the interactions of most retroviral gag proteins with RNAs. Computational method to predict NCs would benefit subsequent structure analysis and functional study on them. However, no computational method to predict the exact locations of NCs in retroviruses has been proposed yet. The wide range of length variation of NCs also increases the difficulties. In this paper, a computational method to identify NCs in retroviruses is proposed. All available retrovirus sequences with NC annotations were collected from NCBI. Models based on random forest (RF) and weighted support vector machine (WSVM) were built to predict initiation and termination sites of NCs. Factor analysis scales of generalized amino acid information along with position weight matrix were utilized to generate the feature space. Homology based gene prediction methods were also compared and integrated to bring out better predicting performance. Candidate initiation and termination sites predicted were then combined and screened according to their intervals, decision values and alignment scores. All available gag sequences without NC annotations were scanned with the model to detect putative NCs. Geometric means of sensitivity and specificity generated from prediction of initiation and termination sites under fivefold cross-validation are 0.9900 and 0.9548 respectively. 90.91% of all the collected retrovirus sequences with NC annotations could be predicted totally correct by the model combining WSVM, RF and simple alignment. The composite model performs better than the simplex ones. 235 putative NCs in unannotated gags were detected by the model. Our prediction method performs well on NC recognition and could also be expanded to solve other gene prediction problems, especially those whose training samples have large length variations.


Author(s):  
Metin Mutlu Aydın ◽  

Passengers’ boarding times at bus stops have a great importance to calculate dwell and travel time for scheduling process in public transport operations. However, there are not so much observed boarding times data in the actual bus transport systems and it may cause some prediction problems in scheduling process of public transport operations. For this reason, accurate estimation of the boarding times will ensure correct calculation of dwell and total travel time for bus transport systems. Based on this idea, this study aims to model boarding times of each passengers by evaluating different parameters using two different (statistical and optimization analysis) methods. For this purpose, a comprehensive data collection process was conducted in total seven different cities of Turkey based upon their population. Two new models were developed for boarding time estimation by evaluating various parameters using a multiple Ordinary Least Square (OLS) regression and Artificial Bee Colony (ABC) algorithm as statistical and optimization methods, respectively. Study results showed that modeling of boarding times by considering various parameters is an effective strategy to improve the performance of bus transport systems by using developed two models


2021 ◽  
Vol 6 (2(62)) ◽  
pp. 15-17
Author(s):  
Eduard Kinshakov ◽  
Yuliia Parfenenko ◽  
Vira Shendryk

The object of research is the process of choosing a method for predicting continuous numerical features on big datasets. The importance of the study is due to the fact that today in various subject areas it is necessary to solve the problem of predicting performance indicators based on data collected from different sources and presented in different formats, which is the task of big data analysis. To solve the problem, the methods of statistical analysis were considered, namely multiple linear regression, decision trees and a random forest. An array of extensive data was built without specifying the subject area, its preliminary processing, analysis was carried out to establish the correlation between the features. The processing of the big data array was carried out using the technology of parallel computing by means of the Dask library of the Python language. Since working with big data requires significant computing resources, this approach does not require the use of powerful computer technology. Prediction models were built using multiple linear regression methods, decision trees and a random forest, visualization of the prediction results and analysis of the reliability of the constructed models. Based on the results of calculating the prediction error, it was found that the greatest prediction accuracy among the considered methods is the random forest method. When applying this method, the prediction accuracy for a dataset of numerical features was approximately 97 %, which indicates a high reliability of the constructed model. Thus, it is possible to conclude that the random forest method is suitable for solving prediction problems using large data sets, it can be used for datasets with a large number of features and is not sensitive to data scaling. The developed software application in Python can be used to predict numerical features from different subject areas, the prediction results are imported into a text file.


Author(s):  
Bao-fei Feng ◽  
Yin-shan Xu ◽  
Tao Zhang ◽  
Xiao Zhang

Abstract In general, accurate hydrological time series prediction information is of great significance for the rational planning and management of water resource system. Extreme learning machine (ELM) is an effective tool proposed for the single-layer feedforward neural network in the regression and classification problems. However, the standard ELM model falls into local minimum with a high probability in hydrological prediction problems since the randomly assigned parameters (like input-hidden weights and hidden biases) often remain unchanged at the learning process. For effectively improving the prediction accuracy, this paper develops a hybrid hydrological forecasting model where the emerging sparrow search algorithm (SSA) is firstly used to determine the satisfying parameter combinations of the ELM model, and then the Moore-Penrose generalized inverse method is chosen to analytically obtain the weight matrix between the hidden layer and output layer. The proposed method is used to forecast the long-term daily runoff series collected from three real-world hydrological stations in China. Based on several performance evaluation indexes, the results show that the proposed method outperforms several ELM variants optimized by other evolutionary algorithms in both training and testing phases. Hence, an effective evolutionary machine learning tool is developed for accurate hydrological time series forecasting. HIGHLIGHT Hydrologic forecasting, sparrow search algorithm, extreme machine learning.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
István Gyertyán ◽  
Jana Lubec ◽  
Alíz Judit Ernyey ◽  
Christopher Gerner ◽  
Ferenc Kassai ◽  
...  

AbstractThe lack of novel cognitive enhancer drugs in the clinic highlights the prediction problems of animal assays. The objective of the current study was to test a putative cognitive enhancer in a rodent cognitive test system with improved translational validity and clinical predictivity. Cognitive profiling was complemented with post mortem proteomic analysis. Twenty-seven male Lister Hooded rats (26 months old) having learned several cognitive tasks were subchronically treated with S-CE-123 (CE-123) in a randomized blind experiment. Rats were sacrificed after the last behavioural procedure and plasma and brains were collected. A label-free quantification approach was used to characterize proteomic changes in the synaptosomal fraction of the prefrontal cortex. CE-123 markedly enhanced motivation which resulted in superior performance in a new-to-learn operant discrimination task and in a cooperation assay of social cognition, and mildly increased impulsivity. The compound did not affect attention, spatial and motor learning. Proteomic quantification revealed 182 protein groups significantly different between treatment groups containing several proteins associated with aging and neurodegeneration. Bioinformatic analysis showed the most relevant clusters delineating synaptic vesicle recycling, synapse organisation and antioxidant activity. The cognitive profile of CE-123 mapped by the test system resembles that of modafinil in the clinic showing the translational validity of the test system. The findings of modulated synaptic systems are paralleling behavioral results and are in line with previous evidence for the role of altered synaptosomal protein groups in mechanisms of cognitive function.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Xinyu Bai ◽  
Yuxin Yin

AbstractPredicting compound–protein interactions (CPIs) is of great importance for drug discovery and repositioning, yet still challenging mainly due to the sparse nature of CPI matrixes, resulting in poor generalization performance. Hence, unlike typical CPI prediction models focused on representation learning or model selection, we propose a deep neural network-based strategy, PCM-AAE, that re-explores and augments the pharmacological space of kinase inhibitors by introducing the adversarial auto-encoder model (AAE) to improve the generalization of the prediction model. To complete the data space, we constructed Ensemble of PCM-AAE (EPA), an ensemble model that quickly and accurately yields quantitative predictions of binding affinity between any human kinase and inhibitor. In rigorous internal validation, EPA showed excellent performance, consistently outperforming the model trained with the imbalanced set, especially for targets with relatively fewer training data points. Improved prediction accuracy of EPA for external datasets enhances its generalization ability, making it possible to gracefully handle previously unseen kinases and inhibitors. EPA showed promising potential when directly applied to virtual screening and off-target prediction, exhibiting its practicality in hit prediction. Our strategy is expected to facilitate kinase-centric drug development, as well as to solve more challenging prediction problems with insufficient data points.


Author(s):  
Satish Tirumalapudi

Abstract: Chat bots are software applications that help users to communicate with the machine and get the required result, this is where Natural Language Processing (NLP) comes into the picture. Natural language processing is based on deep learning that enables computers to acquire meaning from inputs given by the users. Natural language processing techniques can make possible the use of natural language to express ideas, thus drastically increasing accessibility. NLP engines rely on the elements of intent, utterance, entity, context, and session. Here in this project, we will be using Deep learning techniques which will be trained on the dataset which contains categories, patterns, and responses. Long Short-Term Memory (LSTM) is a Recurrent Neural Network that is capable of learning order dependence in sequence prediction problems. One of the most popular RNN approaches is LSTM to identify and control a dynamic system. We use an RNN to classify the category user’s message belongs to and then will give a response from the list of responses. Keywords: NLP – Natural Language Processing, LSTM – Long Short Term Memory, RNN – Recurrent Neural Networks.


Author(s):  
Taekyeong Jeong ◽  
Janggon Yoo ◽  
Daegyoum Kim

Abstract Inspired by the lateral line systems of various aquatic organisms that are capable of hydrodynamic imaging using ambient flow information, this study develops a deep learning-based object localization model that can detect the location of objects using flow information measured from a moving sensor array. In numerical simulations with the assumption of a potential flow, a two-dimensional hydrofoil navigates around four stationary cylinders in a uniform flow and obtains two types of sensory data during a simulation, namely flow velocity and pressure, from an array of sensors located on the surface of the hydrofoil. Several neural network models are constructed using the flow velocity and pressure data, and these are used to detect the positions of the hydrofoil and surrounding objects. The model based on a long short-term memory network, which is capable of learning order dependence in sequence prediction problems, outperforms the other models. The number of sensors is then optimized using feature selection techniques. This sensor optimization leads to a new object localization model that achieves impressive accuracy in predicting the locations of the hydrofoil and objects with only 40$\%$ of the sensors used in the original model.


2021 ◽  
Vol 13 (23) ◽  
pp. 4822
Author(s):  
Waytehad Rose Moskolaï ◽  
Wahabou Abdou ◽  
Albert Dipanda ◽  
Kolyang

Satellite image time series (SITS) is a sequence of satellite images that record a given area at several consecutive times. The aim of such sequences is to use not only spatial information but also the temporal dimension of the data, which is used for multiple real-world applications, such as classification, segmentation, anomaly detection, and prediction. Several traditional machine learning algorithms have been developed and successfully applied to time series for predictions. However, these methods have limitations in some situations, thus deep learning (DL) techniques have been introduced to achieve the best performance. Reviews of machine learning and DL methods for time series prediction problems have been conducted in previous studies. However, to the best of our knowledge, none of these surveys have addressed the specific case of works using DL techniques and satellite images as datasets for predictions. Therefore, this paper concentrates on the DL applications for SITS prediction, giving an overview of the main elements used to design and evaluate the predictive models, namely the architectures, data, optimization functions, and evaluation metrics. The reviewed DL-based models are divided into three categories, namely recurrent neural network-based models, hybrid models, and feed-forward-based models (convolutional neural networks and multi-layer perceptron). The main characteristics of satellite images and the major existing applications in the field of SITS prediction are also presented in this article. These applications include weather forecasting, precipitation nowcasting, spatio-temporal analysis, and missing data reconstruction. Finally, current limitations and proposed workable solutions related to the use of DL for SITS prediction are also highlighted.


Sign in / Sign up

Export Citation Format

Share Document