A systematic review of machine learning-based missing value imputation techniques

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Tressy Thomas ◽  
Enayat Rajabi

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

2019 ◽  
Vol 26 (6) ◽  
pp. 1709-1727
Author(s):  
Prafulla Bafna ◽  
Dhanya Pramod ◽  
Shailaja Shrwaikar ◽  
Atiya Hassan

Purpose Document management is growing in importance proportionate to the growth of unstructured data, and its applications are increasing from process benchmarking to customer relationship management and so on. The purpose of this paper is to improve important components of document management that is keyword extraction and document clustering. It is achieved through knowledge extraction by updating the phrase document matrix. The objective is to manage documents by extending the phrase document matrix and achieve refined clusters. The study achieves consistency in cluster quality in spite of the increasing size of data set. Domain independence of the proposed method is tested and compared with other methods. Design/methodology/approach In this paper, a synset-based phrase document matrix construction method is proposed where semantically similar phrases are grouped to reduce the dimension curse. When a large collection of documents is to be processed, it includes some documents that are very much related to the topic of interest known as model documents and also the documents that deviate from the topic of interest. These non-relevant documents may affect the cluster quality. The first step in knowledge extraction from the unstructured textual data is converting it into structured form either as term frequency-inverse document frequency matrix or as phrase document matrix. Once in structured form, a range of mining algorithms from classification to clustering can be applied. Findings In the enhanced approach, the model documents are used to extract key phrases with synset groups, whereas the other documents participate in the construction of the feature matrix. It gives a better feature vector representation and improved cluster quality. Research limitations/implications Various applications that require managing of unstructured documents can use this approach by specifically incorporating the domain knowledge with a thesaurus. Practical implications Experiment pertaining to the academic domain is presented that categorizes research papers according to the context and topic, and this will help academicians to organize and build knowledge in a better way. The grouping and feature extraction for resume data can facilitate the candidate selection process. Social implications Applications like knowledge management, clustering of search engine results, different recommender systems like hotel recommender, task recommender, and so on, will benefit from this study. Hence, the study contributes to improving document management in business domains or areas of interest of its users from various strata’s of society. Originality/value The study proposed an improvement to document management approach that can be applied in various domains. The efficacy of the proposed approach and its enhancement is validated on three different data sets of well-articulated documents from data sets such as biography, resume and research papers. These results can be used for benchmarking further work carried out in these areas.


2020 ◽  
Vol 6 ◽  
Author(s):  
Jaime de Miguel Rodríguez ◽  
Maria Eugenia Villafañe ◽  
Luka Piškorec ◽  
Fernando Sancho Caparrini

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Jiawei Lian ◽  
Junhong He ◽  
Yun Niu ◽  
Tianze Wang

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Lam Hoang Viet Le ◽  
Toan Luu Duc Huynh ◽  
Bryan S. Weber ◽  
Bao Khac Quoc Nguyen

PurposeThis paper aims to identify the disproportionate impacts of the COVID-19 pandemic on labor markets.Design/methodology/approachThe authors conduct a large-scale survey on 16,000 firms from 82 industries in Ho Chi Minh City, Vietnam, and analyze the data set by using different machine-learning methods.FindingsFirst, job loss and reduction in state-owned enterprises have been significantly larger than in other types of organizations. Second, employees of foreign direct investment enterprises suffer a significantly lower labor income than those of other groups. Third, the adverse effects of the COVID-19 pandemic on the labor market are heterogeneous across industries and geographies. Finally, firms with high revenue in 2019 are more likely to adopt preventive measures, including the reduction of labor forces. The authors also find a significant correlation between firms' revenue and labor reduction as traditional econometrics and machine-learning techniques suggest.Originality/valueThis study has two main policy implications. First, although government support through taxes has been provided, the authors highlight evidence that there may be some additional benefit from targeting firms that have characteristics associated with layoffs or other negative labor responses. Second, the authors provide information that shows which firm characteristics are associated with particular labor market responses such as layoffs, which may help target stimulus packages. Although the COVID-19 pandemic affects most industries and occupations, heterogeneous firm responses suggest that there could be several varieties of targeted policies-targeting firms that are likely to reduce labor forces or firms likely to face reduced revenue. In this paper, the authors outline several industries and firm characteristics which appear to more directly be reducing employee counts or having negative labor responses which may lead to more cost–effect stimulus.


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


Author(s):  
Aska E. Mehyadin ◽  
Adnan Mohsin Abdulazeez ◽  
Dathar Abas Hasan ◽  
Jwan N. Saeed

The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).


Electronics ◽  
2022 ◽  
Vol 11 (2) ◽  
pp. 245
Author(s):  
Konstantinos G. Liakos ◽  
Georgios K. Georgakilas ◽  
Fotis C. Plessas ◽  
Paris Kitsos

A significant problem in the field of hardware security consists of hardware trojan (HT) viruses. The insertion of HTs into a circuit can be applied for each phase of the circuit chain of production. HTs degrade the infected circuit, destroy it or leak encrypted data. Nowadays, efforts are being made to address HTs through machine learning (ML) techniques, mainly for the gate-level netlist (GLN) phase, but there are some restrictions. Specifically, the number and variety of normal and infected circuits that exist through the free public libraries, such as Trust-HUB, are based on the few samples of benchmarks that have been created from circuits large in size. Thus, it is difficult, based on these data, to develop robust ML-based models against HTs. In this paper, we propose a new deep learning (DL) tool named Generative Artificial Intelligence Netlists SynthesIS (GAINESIS). GAINESIS is based on the Wasserstein Conditional Generative Adversarial Network (WCGAN) algorithm and area–power analysis features from the GLN phase and synthesizes new normal and infected circuit samples for this phase. Based on our GAINESIS tool, we synthesized new data sets, different in size, and developed and compared seven ML classifiers. The results demonstrate that our new generated data sets significantly enhance the performance of ML classifiers compared with the initial data set of Trust-HUB.


2017 ◽  
Vol 24 (4) ◽  
pp. 1052-1064 ◽  
Author(s):  
Yong Joo Lee ◽  
Seong-Jong Joo ◽  
Hong Gyun Park

Purpose The purpose of this paper is to measure the comparative efficiency of 18 Korean commercial banks under the presence of negative observations and examine performance differences among them by grouping them according to their market conditions. Design/methodology/approach The authors employ two data envelopment analysis (DEA) models such as a Banker, Charnes, and Cooper (BCC) model and a modified slacks-based measure of efficiency (MSBM) model, which can handle negative data. The BCC model is proven to be translation invariant for inputs or outputs depending on output or input orientation. Meanwhile, the MSBM model is unit invariant in addition to translation invariant. The authors compare results from both models and choose one for interpreting results. Findings Most Korean banks recovered from the worst performance in 2011 and showed similar performance in recent years. Among three groups such as national banks, regional banks, and special banks, the most special banks demonstrated superb performance across models and years. Especially, the performance difference between the special banks and the regional banks was statistically significant. The authors concluded that the high performance of the special banks was due to their nationwide market access and ownership type. Practical implications This study demonstrates how to analyze and measure the efficiency of entities when variables contain negative observations using a data set for Korean banks. The authors have tried two major DEA models that are able to handle negative data and proposed a practical direction for future studies. Originality/value Although there are research papers for measuring the performance of banks in Korea, all of the papers in the topic have studied efficiency or productivity using positive data sets. However, variables such as net incomes and growth rates frequently include negative observations in bank data sets. This is the first paper to investigate the efficiency of bank operations in the presence of negative data in Korea.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Samar Ali Shilbayeh ◽  
Sunil Vadera

Purpose This paper aims to describe the use of a meta-learning framework for recommending cost-sensitive classification methods with the aim of answering an important question that arises in machine learning, namely, “Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?” Design/methodology/approach This paper describes the use of a meta-learning framework for recommending cost-sensitive classification methods for the aim of answering an important question that arises in machine learning, namely, “Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?” The framework is based on the idea of applying machine learning techniques to discover knowledge about the performance of different machine learning algorithms. It includes components that repeatedly apply different classification methods on data sets and measures their performance. The characteristics of the data sets, combined with the algorithms and the performance provide the training examples. A decision tree algorithm is applied to the training examples to induce the knowledge, which can then be used to recommend algorithms for new data sets. The paper makes a contribution to both meta-learning and cost-sensitive machine learning approaches. Those both fields are not new, however, building a recommender that recommends the optimal case-sensitive approach for a given data problem is the contribution. The proposed solution is implemented in WEKA and evaluated by applying it on different data sets and comparing the results with existing studies available in the literature. The results show that a developed meta-learning solution produces better results than METAL, a well-known meta-learning system. The developed solution takes the misclassification cost into consideration during the learning process, which is not available in the compared project. Findings The proposed solution is implemented in WEKA and evaluated by applying it to different data sets and comparing the results with existing studies available in the literature. The results show that a developed meta-learning solution produces better results than METAL, a well-known meta-learning system. Originality/value The paper presents a major piece of new information in writing for the first time. Meta-learning work has been done before but this paper presents a new meta-learning framework that is costs sensitive.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Hendri Murfi

PurposeThe aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.Design/methodology/approachThe eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.FindingsOur simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.Originality/valueThis research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.


Sign in / Sign up

Export Citation Format

Share Document