scholarly journals IMPLEMENTATION OF MISSING VALUES HANDLING METHOD FOR EVALUATING THE SYSTEM/COMPONENT MAINTENANCE HISTORICAL DATA

2017 ◽  
Vol 19 (1) ◽  
pp. 11 ◽  
Author(s):  
Entin Hartini

Missing values are problems in data evaluation. Missing values analysis can resolve the problem of incomplete data that is not stored properly. The missing data can reduce the precision of calculation, since the amount of information is incomplete. The purpose of this study is to implement missing values handling method for systems/components maintenance historical data evaluation in RSG GAS. Statistical methods, such as listwise deletion and mean substitution, and machine learning (KNNI), were used to determine the missing data that correspond to the systems/components maintenance historical data. Mean substitution and KNNI methods were chosen since those methods do not require the formation of predictive models for each item which is experiencing missing data. Implementation of missing data analysis on systems/components maintenance data using KNNI method results in the smallest RMSE value. The result shows that KNNI method is the best method to handle missing value compared with listwise deletion or mean substitution.Keywords: missing value, data evaluation, alghorithm, implementation IMPLEMENTASI METODE PENANGANAN DATA HILANG  UNTUK MENGEVALUASI DATA SEJARAH PERAWATAN SISTEM/KOMPONEN. Data hilang merupakan masalah dalam melakukan evaluasi data. Analisis data hilang dapat menyelesaikan permasalahan ketidaklengkapan data yang tidak tersimpan dengan baik. Data yang hilang akan memperkecil presisi dari perhitungan, dikarenakan jumlah informasi yang tidak lengkap. Tujuan dari penelitian ini adalah implementasi  metode penanganan data hilang untuk evaluasi data sejarah perawatan sistem/komponen RSG GAS. Metodologi yang digunakan untuk menentukan data hilang yang berhubungan dengan data sejarah perawatan sistem/komponen adalah statistics, listwise deletion dan mean substitution, dan machine learning (KNNI). Metode mean substitution dan KNNI dipilih karena metode ini tidak memerlukan informasi untuk pembentukan model prediksi untuk setiap item yang mengandung data hilang. Implementasi analisis data hilang pada data perawatan sistem/komponen menggunakan metode KNNI menghasilkan nilai RMSE terkecil. Hasil ini menunjukan bahwa metode KNNI merupakan metode terbaik untuk menangani data hilang dibanding dengan listwise deletion atau mean substitution.Kata kunci: data hilang, evaluasi data, algoritma, implementasi

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nishith Kumar ◽  
Md. Aminul Hoque ◽  
Masahiro Sugimoto

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.


Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


2021 ◽  
Author(s):  
Markus Deppner ◽  
Bedartha Goswami

<p>The impact of the El Niño Southern Oscillation (ENSO) on rivers are well known, but most existing studies involving streamflow data are severely limited by data coverage. Time series of gauging stations fade in and out over time, which makes hydrological large scale and long time analysis or studies of rarely occurring extreme events challenging. Here, we use a machine learning approach to infer missing streamflow data based on temporal correlations of stations with missing values to others with data. By using 346 stations, from the “Global Streamflow Indices and Metadata archive” (GSIM), that initially cover the 40 year timespan in conjunction with Gaussian processes we were able to extend our data by estimating missing data for an additional 646 stations, allowing us to include a total of 992 stations. We then investigate the impact of the 6 strongest El Niño (EN) events on rivers in South America between 1960 and 2000. Our analysis shows a strong correlation between ENSO events and extreme river dynamics in the southeast of Brazil, Carribean South America and parts of the Amazon basin. Furthermore we see a peak in the number of stations showing maximum river discharge all over Brazil during the EN of 1982/83 which has been linked to severe floods in the east of Brazil, parts of Uruguay and Paraguay. However EN events in other years with similar intensity did not evoke floods with such magnitude and therefore the additional drivers of the 1982/83  floods need further investigation. By using machine learning methods to infer data for gauging stations with missing data we were able to extend our data by almost three-fold, revealing a possible heavier and spatially larger impact of the 1982/83 EN on South America's hydrology than indicated in literature.</p>


2019 ◽  
Author(s):  
Ananya Bhattacharjee ◽  
Md. Shamsuzzoha Bayzid

AbstractBackgroundDue to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.ResultsWe introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques match or improve upon the best alternate techniques for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate methods fail.ConclusionsThis study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. The autoencoder based deep learning technique is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).


2019 ◽  
Author(s):  
Tabea Kossen ◽  
Michelle Livne ◽  
Vince I Madai ◽  
Ivana Galinovic ◽  
Dietmar Frey ◽  
...  

AbstractBackground and purposeHandling missing values is a prevalent challenge in the analysis of clinical data. The rise of data-driven models demands an efficient use of the available data. Methods to impute missing values are thus crucial. Here, we developed a publicly available framework to test different imputation methods and compared their impact in a typical stroke clinical dataset as a use case.MethodsA clinical dataset based on the 1000Plus stroke study with 380 completed-entries patients was used. 13 common clinical parameters including numerical and categorical values were selected. Missing values in a missing-at-random (MAR) and missing-completely-at-random (MCAR) fashion from 0% to 60% were simulated and consequently imputed using the mean, hot-deck, multiple imputation by chained equations, expectation maximization method and listwise deletion. The performance was assessed by the root mean squared error, the absolute bias and the performance of a linear model for discharge mRS prediction.ResultsListwise deletion was the worst performing method and started to be significantly worse than any imputation method from 2% (MAR) and 3% (MCAR) missing values on. The underlying missing value mechanism seemed to have a crucial influence on the identified best performing imputation method. Consequently no single imputation method outperformed all others. A significant performance drop of the linear model started from 11% (MAR+MCAR) and 18% (MCAR) missing values.ConclusionsIn the presented case study of a typical clinical stroke dataset we confirmed that listwise deletion should be avoided for dealing with missing values. Our findings indicate that the underlying missing value mechanism and other dataset characteristics strongly influence the best choice of imputation method. For future studies with similar data structure, we thus suggest to use the developed framework in this study to select the most suitable imputation method for a given dataset prior to analysis.


2022 ◽  
Vol 9 (3) ◽  
pp. 0-0

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.


2021 ◽  
Author(s):  
Tlamelo Emmanuel ◽  
Thabiso Maupong ◽  
Dimane Mpoeleng ◽  
Thabo Semong ◽  
Mphago Banyatsang ◽  
...  

Abstract Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur as a result of various factors like missing completely at random, missing at random or missing not at random. All these may be as a result of system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for. Finally, we experiment on the K nearest neighbor and random forest imputation techniques on novel power plant induced fan data and offer some possible future research direction.


2021 ◽  
Vol 9 ◽  
Author(s):  
Chao-Yu Guo ◽  
Ying-Chen Yang ◽  
Yi-Hau Chen

An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.


Author(s):  
Karim H. Erian ◽  
Pedro H. Regalado ◽  
James M. Conrad

This paper discusses a novel algorithm for solving a missing data problem in the machine learning pre-processing stage. A model built to help lenders evaluate home loans based on numerous factors by learning from available user data, is adopted in this paper as an example. If one of the factors is missing for a person in the dataset, the currently used methods delete the whole entry therefore reducing the size of the dataset and affecting the machine learning model accuracy. The novel algorithm aims to avoid losing entries for missing factors by breaking the dataset into multiple subsets, building a different machine learning model for each subset, then combining the models into one machine learning model. In this manner, the model makes use of all available data and only neglects the missing values. Overall, the new algorithm improved the prediction accuracy by 5% from 93% accuracy to 98% in the home loan example.


2021 ◽  
Author(s):  
Nishith Kumar ◽  
Md. Hoque ◽  
Masahiro Sugimoto

Abstract Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomics analyses. It yields a high dimensional large scale matrix (samples × metabolites) of quantified data that often contain missing cell in the data matrix as well as outliers which originate from several reasons, including technical and biological sources. Although, in the literature, several missing data imputation techniques can be found, however all the conventional existing techniques can only solve the missing value problems but not relieve the problems of outliers. Therefore, outliers in the dataset, deteriorate the accuracy of imputation. To overcome both the missing data imputation and outlier’s problem, here, we developed a new kernel weight function based missing data imputation technique (proposed) that resolves both the missing values and outliers. We evaluated the performance of the proposed method and other nine conventional missing imputation techniques using both artificially generated data and experimentally measured data analysis in both absence and presence of different rates of outliers. Performance based on both artificial data and real metabolomics data indicates that our proposed kernel weight based missing data imputation technique is a better performer than some existing alternatives. For user convenience, an R package of the proposed kernel weight based missing value imputation technique has been developed which is available at https://github.com/NishithPaul/tWLSA .


Sign in / Sign up

Export Citation Format

Share Document