Method to Design Pattern Classification Model with Block Missing Training Data

Author(s):  
Won-Chol Yang

Missing data is a usual drawback in many real-world applications of pattern classification. Methods of pattern classification with missing data are grouped into four types: (a) deletion of incomplete samples and classifier design using only the complete data portion, (b) imputation of missing data and learning of the classifier using the edited set, (c) use of model-based procedures and (d) use of machine learning procedures. These methods can be useful in case of small amount of missing values, but they may be unsuitable in case of relatively large amount of missing values. We proposed a method to design pattern classification model with block missing training data. First, we separated submatrices from the block missing training data. Second, we designed classification submodels using each submatrix. Third, we designed final classification model using a linear combination of these submodels. We tested the classifying accuracy rate and data usage rate of the classification model designed by means of the proposed method by simulation experiments on some datasets, and verified that the proposed method was effective from the viewpoint of classifying accuracy rate and data usage rate.

2012 ◽  
Vol 8 (1) ◽  
pp. 1-23 ◽  
Author(s):  
Philicity K. Williams ◽  
Caio V. Soares ◽  
Juan E. Gilbert

Predictive models, such as rule based classifiers, often have difficulty with incomplete data (e.g., erroneous/missing values). So, this work presents a technique used to reduce the severity of the effects of missing data on the performance of rule base classifiers using divisive data clustering. The Clustering Rule based Approach (CRA) clusters the original training data and builds a separate rule based model on the cluster wise data. The individual models are combined into a larger model and evaluated against test data. The effects of the missing attribute information for ordered and unordered rule sets is evaluated and the collective model (CRA) is experimentally used to show that its performance is less affected than the traditional model when the test data has missing attribute values, thus making it more resilient and robust to missing data.


Author(s):  
Fatma Karem ◽  
Mounir Dhibi ◽  
Arnaud Martin ◽  
Med Salim Bouhlel

This paper reports on an investigation in classification technique employed to classify noised and uncertain data. However, classification is not an easy task. It is a significant challenge to discover knowledge from uncertain data. In fact, we can find many problems. More time we don't have a good or a big learning database for supervised classification. Also, when training data contains noise or missing values, classification accuracy will be affected dramatically. So to extract groups from  data is not easy to do. They are overlapped and not very separated from each other. Another problem which can be cited here is the uncertainty due to measuring devices. Consequentially classification model is not so robust and strong to classify new objects. In this work, we present a novel classification algorithm to cover these problems. We materialize our main idea by using belief function theory to do combination between classification and clustering. This theory treats very well imprecision and uncertainty linked to classification. Experimental results show that our approach has ability to significantly improve the quality of classification of generic database.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rahi Jain ◽  
Wei Xu

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
S Gao ◽  
D Stojanovski ◽  
A Parker ◽  
P Marques ◽  
S Heitner ◽  
...  

Abstract Background Correctly identifying views acquired in a 2D echocardiographic examination is paramount to post-processing and quantification steps often performed as part of most clinical workflows. In many exams, particularly in stress echocardiography, microbubble contrast is used which greatly affects the appearance of the cardiac views. Here we present a bespoke, fully automated convolutional neural network (CNN) which identifies apical 2, 3, and 4 chamber, and short axis (SAX) views acquired with and without contrast. The CNN was tested in a completely independent, external dataset with the data acquired in a different country than that used to train the neural network. Methods Training data comprised of 2D echocardiograms was taken from 1014 subjects from a prospective multisite, multi-vendor, UK trial with the number of frames in each view greater than 17,500. Prior to view classification model training, images were processed using standard techniques to ensure homogenous and normalised image inputs to the training pipeline. A bespoke CNN was built using the minimum number of convolutional layers required with batch normalisation, and including dropout for reducing overfitting. Before processing, the data was split into 90% for model training (211,958 frames), and 10% used as a validation dataset (23,946 frames). Image frames from different subjects were separated out entirely amongst the training and validation datasets. Further, a separate trial dataset of 240 studies acquired in the USA was used as an independent test dataset (39,401 frames). Results Figure 1 shows the confusion matrices for both validation data (left) and independent test data (right), with an overall accuracy of 96% and 95% for the validation and test datasets respectively. The accuracy for the non-contrast cardiac views of >99% exceeds that seen in other works. The combined datasets included images acquired across ultrasound manufacturers and models from 12 clinical sites. Conclusion We have developed a CNN capable of automatically accurately identifying all relevant cardiac views used in “real world” echo exams, including views acquired with contrast. Use of the CNN in a routine clinical workflow could improve efficiency of quantification steps performed after image acquisition. This was tested on an independent dataset acquired in a different country to that used to train the model and was found to perform similarly thus indicating the generalisability of the model. Figure 1. Confusion matrices Funding Acknowledgement Type of funding source: Private company. Main funding source(s): Ultromics Ltd.


2021 ◽  
Vol 13 (12) ◽  
pp. 2301
Author(s):  
Zander Venter ◽  
Markus Sydenham

Land cover maps are important tools for quantifying the human footprint on the environment and facilitate reporting and accounting to international agreements addressing the Sustainable Development Goals. Widely used European land cover maps such as CORINE (Coordination of Information on the Environment) are produced at medium spatial resolutions (100 m) and rely on diverse data with complex workflows requiring significant institutional capacity. We present a 10 m resolution land cover map (ELC10) of Europe based on a satellite-driven machine learning workflow that is annually updatable. A random forest classification model was trained on 70K ground-truth points from the LUCAS (Land Use/Cover Area Frame Survey) dataset. Within the Google Earth Engine cloud computing environment, the ELC10 map can be generated from approx. 700 TB of Sentinel imagery within approx. 4 days from a single research user account. The map achieved an overall accuracy of 90% across eight land cover classes and could account for statistical unit land cover proportions within 3.9% (R2 = 0.83) of the actual value. These accuracies are higher than that of CORINE (100 m) and other 10 m land cover maps including S2GLC and FROM-GLC10. Spectro-temporal metrics that capture the phenology of land cover classes were most important in producing high mapping accuracies. We found that the atmospheric correction of Sentinel-2 and the speckle filtering of Sentinel-1 imagery had a minimal effect on enhancing the classification accuracy (< 1%). However, combining optical and radar imagery increased accuracy by 3% compared to Sentinel-2 alone and by 10% compared to Sentinel-1 alone. The addition of auxiliary data (terrain, climate and night-time lights) increased accuracy by an additional 2%. By using the centroid pixels from the LUCAS Copernicus module polygons we increased accuracy by <1%, revealing that random forests are robust against contaminated training data. Furthermore, the model requires very little training data to achieve moderate accuracies—the difference between 5K and 50K LUCAS points is only 3% (86 vs. 89%). This implies that significantly less resources are necessary for making in situ survey data (such as LUCAS) suitable for satellite-based land cover classification. At 10 m resolution, the ELC10 map can distinguish detailed landscape features like hedgerows and gardens, and therefore holds potential for aerial statistics at the city borough level and monitoring property-level environmental interventions (e.g., tree planting). Due to the reliance on purely satellite-based input data, the ELC10 map can be continuously updated independent of any country-specific geographic datasets.


Author(s):  
Maria Lucia Parrella ◽  
Giuseppina Albano ◽  
Cira Perna ◽  
Michele La Rocca

AbstractMissing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability $$1-\alpha $$ 1 - α . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability $$1-\alpha $$ 1 - α . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios.


Agriculture ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 727
Author(s):  
Yingpeng Fu ◽  
Hongjian Liao ◽  
Longlong Lv

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nishith Kumar ◽  
Md. Aminul Hoque ◽  
Masahiro Sugimoto

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.


Sign in / Sign up

Export Citation Format

Share Document