scholarly journals Missing value imputation and data cleaning in untargeted food chemical safety assessment by LC-HRMS

2019 ◽  
Vol 188 ◽  
pp. 54-62 ◽  
Author(s):  
Grégoire Delaporte ◽  
Mathieu Cladière ◽  
Valérie Camel
2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Zhenyuan Wang ◽  
Chih-Fong Tsai ◽  
Wei-Chao Lin

PurposeClass imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.Design/methodology/approachIn this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.FindingsThe experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.Originality/valueThe novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.


2018 ◽  
Author(s):  
Stefan Bischof ◽  
Andreas Harth ◽  
Benedikt KKmpgen ◽  
Axel Polleres ◽  
Patrik Schneider

2020 ◽  
Vol 30 (Supplement_5) ◽  
Author(s):  
F Madia ◽  
A Worth ◽  
M Whelan ◽  
R Corvi

Abstract The rising rates of cancer incidence and prevalence identified by the WHO are of serious concern. The scientific advances of the past twenty years have helped to describe major properties of the cancer disease, enabling therapies that are more sophisticated. It has become clear that the management of relevant risk factors can also significantly reduce cancer occurrence worldwide. Public health policy actions cannot be decoupled from environmental policy actions, since exposure to chemicals through air, soil, water and food can contribute to cancer as well as other chronic diseases. Furthermore, due to the increasing global trend of chemical production including novel compounds, chemical exposure patterns are foreseen to change, posing high demands on chemical safety assessment, and creating potential protection gaps. The safety assessment of carcinogenicity needs to evolve to keep pace with changes in the chemical environment and cancer epidemiology. The presentation focusses on EC-JRC recommendations and future strategies for carcinogenicity safety assessment. This also includes discussion on how the traditional data streams of regulatory toxicology, together with new available assessment methods can inform, along with indicators of public health status based on biomonitoring and clinical data, a more holistic human-relevant and impactful approach to carcinogenicity assessment and overall prevention of cancer disease.


2021 ◽  
pp. 126438
Author(s):  
Luana de Morais e Silva ◽  
Vinicius M. Alves ◽  
Edilma R.B. Dantas ◽  
Luciana Scotti ◽  
Wilton Silva Lopes ◽  
...  

Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Tressy Thomas ◽  
Enayat Rajabi

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.


2016 ◽  
Vol 16 (2) ◽  
pp. 77-85 ◽  
Author(s):  
Sergio Arciniegas-Alarcón ◽  
Marisol García-Peña ◽  
Wojtek Krzanowski

Sign in / Sign up

Export Citation Format

Share Document