Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study

Recent developments in cloud computing and the Internet of Things have enabled smart environments, in terms of both monitoring and actuation. Unfortunately, this often results in unsustainable cloud-based solutions, whereby, in the interest of simplicity, a wealth of raw (unprocessed) data are pushed from sensor nodes to the cloud. Herein, we advocate the use of machine learning at sensor nodes to perform essential data-cleaning operations, to avoid the transmission of corrupted (often unusable) data to the cloud. Starting from a public pollution dataset, we investigate how two machine learning techniques (kNN and missForest) may be embedded on Raspberry Pi to perform data imputation, without impacting the data collection process. Our experimental results demonstrate the accuracy and computational efficiency of edge-learning methods for filling in missing data values in corrupted data series. We find that kNN and missForest correctly impute up to 40% of randomly distributed missing values, with a density distribution of values that is indistinguishable from the benchmark. We also show a trade-off analysis for the case of bursty missing values, with recoverable blocks of up to 100 samples. Computation times are shorter than sampling periods, allowing for data imputation at the edge in a timely manner.

Download Full-text

Big Data to Knowledge: Application of Machine Learning to Predictive Modeling of Therapeutic Response in Cancer.

Current Genomics ◽

10.2174/1389202921999201224110101 ◽

2020 ◽

Vol 21 ◽

Author(s):

Sukanya Panja ◽

Sarra Rahem ◽

Cassandra J. Chu ◽

Antonina Mitrofanova

Keyword(s):

Machine Learning ◽

Missing Values ◽

Therapeutic Response ◽

Patient Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Complex Data ◽

Human Machine Interaction ◽

Data Repositories ◽

Response Modeling

Background: In recent years, the availability of high throughput technologies, establishment of large molecular patient data repositories, and advancement in computing power and storage have allowed elucidation of complex mechanisms implicated in therapeutic response in cancer patients. The breadth and depth of such data, alongside experimental noise and missing values, requires a sophisticated human-machine interaction that would allow effective learning from complex data and accurate forecasting of future outcomes, ideally embedded in the core of machine learning design. Objective: In this review, we will discuss machine learning techniques utilized for modeling of treatment response in cancer, including Random Forests, support vector machines, neural networks, and linear and logistic regression. We will overview their mathematical foundations and discuss their limitations and alternative approaches all in light of their application to therapeutic response modeling in cancer. Conclusion: We hypothesize that the increase in the number of patient profiles and potential temporal monitoring of patient data will define even more complex techniques, such as deep learning and causal analysis, as central players in therapeutic response modeling.

Download Full-text

Stock Market Prediction Using Machine Learning Techniques: A Decade Survey on Methodologies, Recent Developments, and Future Directions

Electronics ◽

10.3390/electronics10212717 ◽

2021 ◽

Vol 10 (21) ◽

pp. 2717

Author(s):

Nusrat Rouf ◽

Majid Bashir Malik ◽

Tasleem Arif ◽

Sparsh Sharma ◽

Saurabh Singh ◽

...

Keyword(s):

Machine Learning ◽

Stock Market ◽

Digital Libraries ◽

Stock Price ◽

Ensemble Methods ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Stock Market Prediction ◽

Research Areas ◽

Recent Developments

With the advent of technological marvels like global digitization, the prediction of the stock market has entered a technologically advanced era, revamping the old model of trading. With the ceaseless increase in market capitalization, stock trading has become a center of investment for many financial investors. Many analysts and researchers have developed tools and techniques that predict stock price movements and help investors in proper decision-making. Advanced trading models enable researchers to predict the market using non-traditional textual data from social platforms. The application of advanced machine learning approaches such as text data analytics and ensemble methods have greatly increased the prediction accuracies. Meanwhile, the analysis and prediction of stock markets continue to be one of the most challenging research areas due to dynamic, erratic, and chaotic data. This study explains the systematics of machine learning-based approaches for stock market prediction based on the deployment of a generic framework. Findings from the last decade (2011–2021) were critically analyzed, having been retrieved from online digital libraries and databases like ACM digital library and Scopus. Furthermore, an extensive comparative analysis was carried out to identify the direction of significance. The study would be helpful for emerging researchers to understand the basics and advancements of this emerging area, and thus carry-on further research in promising directions.

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

A Contemporary Machine Learning Method for Accurate Prediction of Cervical Cancer

SHS Web of Conferences ◽

10.1051/shsconf/202110204004 ◽

2021 ◽

Vol 102 ◽

pp. 04004

Author(s):

Jesse Jeremiah Tanimu ◽

Mohamed Hamada ◽

Mohammed Hassan ◽

Saratu Yusuf Ilu

Keyword(s):

Machine Learning ◽

Cervical Cancer ◽

Feature Selection ◽

Decision Tree ◽

Sensitivity And Specificity ◽

Missing Values ◽

New Technologies ◽

Machine Learning Techniques ◽

Screening Tests ◽

Tree Classifier

With the advent of new technologies in the medical field, huge amounts of cancerous data have been collected and are readily accessible to the medical research community. Over the years, researchers have employed advanced data mining and machine learning techniques to develop better models that can analyze datasets to extract the conceived patterns, ideas, and hidden knowledge. The mined information can be used as a support in decision making for diagnostic processes. These techniques, while being able to predict future outcomes of certain diseases effectively, can discover and identify patterns and relationships between them from complex datasets. In this research, a predictive model for predicting the outcome of patients’ cervical cancer results has been developed, given risk patterns from individual medical records and preliminary screening tests. This work presents a Decision tree (DT) classification algorithm and shows the advantage of feature selection approaches in the prediction of cervical cancer using recursive feature elimination technique for dimensionality reduction for improving the accuracy, sensitivity, and specificity of the model. The dataset employed here suffers from missing values and is highly imbalanced. Therefore, a combination of under and oversampling techniques called SMOTETomek was employed. A comparative analysis of the proposed model has been performed to show the effectiveness of feature selection and class imbalance based on the classifier’s accuracy, sensitivity, and specificity. The DT with the selected features and SMOTETomek has better results with an accuracy of 98%, sensitivity of 100%, and specificity of 97%. Decision Tree classifier is shown to have excellent performance in handling classification assignment when the features are reduced, and the problem of imbalance class is addressed.

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Download Full-text

Overview of Machine Learning Approaches for Wireless Communication

Research Anthology on Artificial Intelligence Applications in Security ◽

10.4018/978-1-7998-7705-9.ch069 ◽

2021 ◽

pp. 1579-1597

Author(s):

Tolga Ensari ◽

Melike Günay ◽

Yağız Nalçakan ◽

Eyyüp Yildiz

Keyword(s):

Machine Learning ◽

Wireless Communications ◽

Smart Phones ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Research Areas ◽

Learning Techniques ◽

Recent Developments ◽

Security And Reliability ◽

Day By Day

Machine learning is one of the most popular research areas, and it is commonly used in wireless communications and networks. Security and fast communication are among of the key requirements for next generation wireless networks. Machine learning techniques are getting more important day-by-day since the types, amount, and structure of data is continuously changing. Recent developments in smart phones and other devices like drones, wearable devices, machines with sensors need reliable communication within internet of things (IoT) systems. For this purpose, artificial intelligence can increase the security and reliability and manage the data that is generated by the wireless systems. In this chapter, the authors investigate several machine learning techniques for wireless communications including deep learning, which represents a branch of artificial neural networks.

Download Full-text

Heartbeat Abnormality Detection using Machine Learning Models and Rate Variability (HRV) Data

10.20944/preprints201807.0488.v1 ◽

2018 ◽

Author(s):

Pratik Vyas ◽

Diptangshu Pandit

Keyword(s):

Machine Learning ◽

Heart Rate ◽

Machine Learning Techniques ◽

Data Series ◽

Abnormality Detection ◽

Learning Models ◽

Heart Rate Data ◽

Learning Techniques ◽

Minimal Data ◽

Machine Learning Models

The use of machine learning techniques in predictive health care is on the rise with minimal data used for training machine-learning models to derive high accuracy predictions. In this paper, we propose such a system, which utilizes Heart Rate Variability (HRV) as features for training machine learning models. This paper further benchmarks the usefulness of HRV as features calculated from basic heart-rate data using a window shifting method. The benchmarking has been conducted using different machine-learning classifiers such as artificial neural network, decision tree, k-nearest neighbour and naive bays classifier. Empirical results using MIT-BIH Arrhythmia database shows that the proposed system can be used for highly efficient predictability of abnormality in heartbeat data series.

Download Full-text

Predicting Metabolic Reaction Networks with Perturbation-Theory Machine Learning (PTML) Models

Current Topics in Medicinal Chemistry ◽

10.2174/1568026621666210331161144 ◽

2021 ◽

Vol 21 ◽

Author(s):

Karel Diéguez-Santana ◽

Gerardo M. Casañola-Martin ◽

James R. Green ◽

Bakhtiyor Rasulev ◽

Humberto González-Díaz

Keyword(s):

Machine Learning ◽

Perturbation Theory ◽

External Validation ◽

Machine Learning Techniques ◽

Reaction Networks ◽

Data Series ◽

Metabolic Reaction ◽

Large Set ◽

Validation Data ◽

Specificity And Sensitivity

Background: Checking the connectivity (structure) of complex Metabolic Reaction Networks (MRNs) models proposed for new microorganisms with promising properties is an important goal for chemical biology. Objective: In principle, we can perform a hand-on checking (Manual Curation). However, this is a hard task due to the high number of combinations of pairs of nodes (possible metabolic reactions). Method: In this work, we used Combinatorial, Perturbation Theory, and Machine Learning, techniques to seek a CPTML model for MRNs >40 organisms compiled by Barabasis’ group. First, we quantified the local structure of a very large set of nodes in each MRN using a new class of node index called Markov linear indices fk. Next, we calculated CPT operators for 150000 combinations of query and reference nodes of MRNs. Last, we used these CPT operators as inputs of different ML algorithms. Results: The CPTML linear model obtained using LDA algorithm is able to discriminate nodes (metabolites) with correct assignation of reactions from not correct nodes with values of accuracy, specificity, and sensitivity in the range of 85-100% in both training and external validation data series. Conclusion: Meanwhile, PTML models based on Bayesian network, J48-Decision Tree and Random Forest algorithms were identified as the three best non-linear models with accuracy greater than 97.5%. The present work opens a door to the study of MRNs of multiple organisms using PTML models.

Download Full-text

Comparison of Machine Learning Techniques for the Identification of Human Activities from Inertial Sensors Available in a Mobile Device after the application of Data Imputation Techniques

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2021.104638 ◽

2021 ◽

pp. 104638

Author(s):

Ivan Miguel Pires ◽

Faisal Hussain ◽

Gonçalo Marques ◽

Nuno M. Garcia

Keyword(s):

Machine Learning ◽

Mobile Device ◽

Human Activities ◽

Inertial Sensors ◽

Machine Learning Techniques ◽

Data Imputation ◽

Learning Techniques

Download Full-text

Comparison of Machine Learning Techniques for Prediction of Hospitalization in Heart Failure Patients

Journal of Clinical Medicine ◽

10.3390/jcm8091298 ◽

2019 ◽

Vol 8 (9) ◽

pp. 1298 ◽

Cited By ~ 3

Author(s):

Giulia Lorenzoni ◽

Stefano Santo Sabato ◽

Corrado Lanera ◽

Daniele Bottigliengo ◽

Clara Minto ◽

...

Keyword(s):

Machine Learning ◽

Heart Failure ◽

Health Care ◽

Predictive Value ◽

Machine Learning Techniques ◽

Data Imputation ◽

Missing Data Imputation ◽

Heart Failure Patients ◽

Learning Techniques ◽

Using Data

The present study aims to compare the performance of eight Machine Learning Techniques (MLTs) in the prediction of hospitalization among patients with heart failure, using data from the Gestione Integrata dello Scompenso Cardiaco (GISC) study. The GISC project is an ongoing study that takes place in the region of Puglia, Southern Italy. Patients with a diagnosis of heart failure are enrolled in a long-term assistance program that includes the adoption of an online platform for data sharing between general practitioners and cardiologists working in hospitals and community health districts. Logistic regression, generalized linear model net (GLMN), classification and regression tree, random forest, adaboost, logitboost, support vector machine, and neural networks were applied to evaluate the feasibility of such techniques in predicting hospitalization of 380 patients enrolled in the GISC study, using data about demographic characteristics, medical history, and clinical characteristics of each patient. The MLTs were compared both without and with missing data imputation. Overall, models trained without missing data imputation showed higher predictive performances. The GLMN showed better performance in predicting hospitalization than the other MLTs, with an average accuracy, positive predictive value and negative predictive value of 81.2%, 87.5%, and 75%, respectively. Present findings suggest that MLTs may represent a promising opportunity to predict hospital admission of heart failure patients by exploiting health care information generated by the contact of such patients with the health care system.

Download Full-text