missing values
Recently Published Documents





2022 ◽  
Vol 9 (3) ◽  
pp. 0-0

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Jesmeen Mohd Zebaral Hoque ◽  
Jakir Hossen ◽  
Shohel Sayeed ◽  
Chy. Mohammed Tawsif K. ◽  
Jaya Ganesan ◽  

Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may lead to flawed analysis and thus yield unacceptable conclusions. Hence, transforming the improper data from the entire data set into useful data is essential. Machine learning (ML) technique was used to overcome the issues due to incomplete data. A new architecture, automatic missing value imputation (AMVI) was developed to predict missing values in the dataset, including data sampling and feature selection. Four prediction models (i.e., logistic regression, support vector machine (SVM), AdaBoost, and random forest algorithms) were selected from the well-known classification. The complete AMVI architecture performance was evaluated using a structured data set obtained from the UCI repository. Accuracy of around 90% was achieved. It was also confirmed from cross-validation that the trained ML model is suitable and not over-fitted. This trained model is developed based on the dataset, which is not dependent on a specific environment. It will train and obtain the outperformed model depending on the data available.

2022 ◽  
Vol 16 (4) ◽  
pp. 1-24
Kui Yu ◽  
Yajing Yang ◽  
Wei Ding

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.

2022 ◽  
Vol 16 (2) ◽  
pp. 1-28
Liang Zhao ◽  
Yuyang Gao ◽  
Jieping Ye ◽  
Feng Chen ◽  
Yanfang Ye ◽  

The forecasting of significant societal events such as civil unrest and economic crisis is an interesting and challenging problem which requires both timeliness, precision, and comprehensiveness. Significant societal events are influenced and indicated jointly by multiple aspects of a society, including its economics, politics, and culture. Traditional forecasting methods based on a single data source find it hard to cover all these aspects comprehensively, thus limiting model performance. Multi-source event forecasting has proven promising but still suffers from several challenges, including (1) geographical hierarchies in multi-source data features, (2) hierarchical missing values, (3) characterization of structured feature sparsity, and (4) difficulty in model’s online update with incomplete multiple sources. This article proposes a novel feature learning model that concurrently addresses all the above challenges. Specifically, given multi-source data from different geographical levels, we design a new forecasting model by characterizing the lower-level features’ dependence on higher-level features. To handle the correlations amidst structured feature sets and deal with missing values among the coupled features, we propose a novel feature learning model based on an N th-order strong hierarchy and fused-overlapping group Lasso. An efficient algorithm is developed to optimize model parameters and ensure global optima. More importantly, to enable the model update in real time, the online learning algorithm is formulated and active set techniques are leveraged to resolve the crucial challenge when new patterns of missing features appear in real time. Extensive experiments on 10 datasets in different domains demonstrate the effectiveness and efficiency of the proposed models.

2022 ◽  
Vol 22 (1) ◽  
Huimin Wang ◽  
Jianxiang Tang ◽  
Mengyao Wu ◽  
Xiaoyu Wang ◽  
Tao Zhang

Abstract Background There are often many missing values in medical data, which directly affect the accuracy of clinical decision making. Discharge assessment is an important part of clinical decision making. Taking the discharge assessment of patients with spontaneous supratentorial intracerebral hemorrhage as an example, this study adopted the missing data processing evaluation criteria more suitable for clinical decision making, aiming at systematically exploring the performance and applicability of single machine learning algorithms and ensemble learning (EL) under different data missing scenarios, as well as whether they had more advantages than traditional methods, so as to provide basis and reference for the selection of suitable missing data processing method in practical clinical decision making. Methods The whole process consisted of four main steps: (1) Based on the original complete data set, missing data was generated by simulation under different missing scenarios (missing mechanisms, missing proportions and ratios of missing proportions of each group). (2) Machine learning and traditional methods (eight methods in total) were applied to impute missing values. (3) The performances of imputation techniques were evaluated and compared by estimating the sensitivity, AUC and Kappa values of prediction models. (4) Statistical tests were used to evaluate whether the observed performance differences were statistically significant. Results The performances of missing data processing methods were different to a certain extent in different missing scenarios. On the whole, machine learning had better imputation performance than traditional methods, especially in scenarios with high missing proportions. Compared with single machine learning algorithms, the performance of EL was more prominent, followed by neural networks. Meanwhile, EL was most suitable for missing imputation under MAR (the ratio of missing proportion 2:1) mechanism, and its average sensitivity, AUC and Kappa values reached 0.908, 0.924 and 0.596 respectively. Conclusions In clinical decision making, the characteristics of missing data should be actively explored before formulating missing data processing strategies. The outstanding imputation performance of machine learning methods, especially EL, shed light on the development of missing data processing technology, and provided methodological support for clinical decision making in presence of incomplete data.

2022 ◽  
pp. 016555152110695
Ahmed Hamed ◽  
Mohamed Tahoun ◽  
Hamed Nassar

The original K-nearest neighbour ( KNN) algorithm was meant to classify homogeneous complete data, that is, data with only numerical features whose values exist completely. Thus, it faces problems when used with heterogeneous incomplete (HI) data, which has also categorical features and is plagued with missing values. Many solutions have been proposed over the years but most have pitfalls. For example, some solve heterogeneity by converting categorical features into numerical ones, inflicting structural damage. Others solve incompleteness by imputation or elimination, causing semantic disturbance. Almost all use the same K for all query objects, leading to misclassification. In the present work, we introduce KNNHI, a KNN-based algorithm for HI data classification that avoids all these pitfalls. Leveraging rough set theory, KNNHI preserves both categorical and numerical features, leaves missing values untouched and uses a different K for each query. The end result is an accurate classifier, as demonstrated by extensive experimentation on nine datasets mostly from the University of California Irvine repository, using a 10-fold cross-validation technique. We show that KNNHI outperforms six recently published KNN-based algorithms, in terms of precision, recall, accuracy and F-Score. In addition to its function as a mighty classifier, KNNHI can also serve as a K calculator, helping KNN-based algorithms that use a single K value for all queries that find the best such value. Sure enough, we show how four such algorithms improve their performance using the K obtained by KNNHI. Finally, KNNHI exhibits impressive resilience to the degree of incompleteness, degree of heterogeneity and the metric used to measure distance.

2022 ◽  
Vol 1049 ◽  
pp. 75-84
Sergei Kurashkin ◽  
Daria Rogova ◽  
Alexander Lavrishchev ◽  
Vadim Sergeevich Tynchenko ◽  
Alexander Murygin

The article deals with the problem of obtaining the dependence of the product strength parameter on the welding time, welding temperature and pressure during mechanical tests, leak tests. The relevance of this work is due to the complexity of carrying out field experiments to identify dependencies. In particular, the complexity arises from the duration of diffusion welding and the high cost. Application of the method of regression analysis based on a non-compositional plan of the second order for three factors will allow to restore the dependence of the product strength parameter on the time during which welding was carried out, the temperature at which diffusion welding was carried out or could be carried out and on the applied pressure at which mechanical tests were carried out. In the current study, a non-compositional design of the second order for three factors was used - allowing to restore the dependence of the missing values of the strength of the product. The aim of the research is to improve the quality of mathematical modeling. Application of the proposed approach will make it possible to obtain the strength distribution function depending on time, temperature and pressure using the example of a product made of VT14 titanium alloy and 12X18H10T stainless steel. This will make it possible to obtain optimal parameters for the diffusion welding mode and to improve the quality of the resulting products.

Motohide Miyahara

In a population-based developmental screening program, healthcare providers face a practical problem with respect to the formation of groups to efficiently address the needs of the parents whose children are screened positive. This small-scale pilot study explored the usefulness of cluster analysis to form type-specific support groups based on the Family Needs Survey (FNS) scores. All parents (N = 68), who accompanied their 5-year-old children to appointments for formal assessment and diagnostic interviews in the second phase of screening, completed the FNS as part of a developmental questionnaire package. The FNS scores of a full dataset (N = 55) without missing values were subjected to hierarchical and K-means cluster analyses. As the final solution, hierarchical clustering with a three-cluster solution was selected over K-means clustering because the hierarchical clustering solution produced three clusters that were similar in size and meaningful in each profile pattern: Cluster 1—high need for information and professional support (N = 20); cluster 2—moderate need for information support (N = 16); cluster 3—high need for information and moderate need for other support (N = 19). The range of cluster sizes was appropriate for managing and providing tailored services and support for each group. Thus, this pilot study demonstrated the utility of cluster analysis to classify parents into support groups, according to their needs.

2022 ◽  
Vol 13 (1) ◽  
Sofani Tafesse Gebreyesus ◽  
Asad Ali Siyal ◽  
Reta Birhanu Kitata ◽  
Eric Sheng-Wen Chen ◽  
Bayarmaa Enkhbayar ◽  

AbstractSingle-cell proteomics can reveal cellular phenotypic heterogeneity and cell-specific functional networks underlying biological processes. Here, we present a streamlined workflow combining microfluidic chips for all-in-one proteomic sample preparation and data-independent acquisition (DIA) mass spectrometry (MS) for proteomic analysis down to the single-cell level. The proteomics chips enable multiplexed and automated cell isolation/counting/imaging and sample processing in a single device. Combining chip-based sample handling with DIA-MS using project-specific mass spectral libraries, we profile on average ~1,500 protein groups across 20 single mammalian cells. Applying the chip-DIA workflow to profile the proteomes of adherent and non-adherent malignant cells, we cover a dynamic range of 5 orders of magnitude with good reproducibility and <16% missing values between runs. Taken together, the chip-DIA workflow offers all-in-one cell characterization, analytical sensitivity and robustness, and the option to add additional functionalities in the future, thus providing a basis for advanced single-cell proteomics applications.

2022 ◽  
Alexandre Perez-Lebel ◽  
Gaël Varoquaux ◽  
Marine Le Morvan ◽  
Julie Josse ◽  
Jean-Baptiste Poline

BACKGROUND As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative --rather than generative-- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values --with missing incorporated attribute-- leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.

Sign in / Sign up

Export Citation Format

Share Document