incomplete datasets
Recently Published Documents


TOTAL DOCUMENTS

73
(FIVE YEARS 31)

H-INDEX

9
(FIVE YEARS 3)

2022 ◽  
Vol 14 (1) ◽  
pp. 0-0

Many real world datasets may contain missing values for various reasons. These incomplete datasets can pose severe issues to the underlying machine learning algorithms and decision support systems. It may result in high computational cost, skewed output and invalid deductions. Various solutions exist to mitigate this issue; the most popular strategy is to estimate the missing values by applying inferential techniques such as linear regression, decision trees or Bayesian inference. In this paper, the missing data problem is discussed in detail with a comprehensive review of the approaches to tackle it. The paper concludes with a discussion on the effectiveness of three imputation methods namely, imputation based on Multiple Linear Regression (MLR), Predictive Mean Matching (PMM) and Classification And Regression Tree (CART) in the context of subspace clustering. The experimental results obtained on real benchmark datasets and high-dimensional synthetic datasets highlight that, MLR based imputation method is more efficient on high-dimensional incomplete datasets.


Author(s):  
Vincent Graves ◽  
Bridgette Cooper ◽  
Jonathan Tennyson

Abstract There are many measurements and calculations of total electron impact ionisation cross sections. However, many applications, particularly in plasma physics, also require fragmentation patterns. Approximate methods of deducing partial cross sections are tested based on the use of total cross section computed within the well-used Binary Encounter Bethe (BEB) approximation. Partial ionisation cross sections for three series of molecules including CH$_4$, CF$_4$ and CCl$_4$; SiH$_4$ and SiCl$_4$; NH$_3$ and PH$_3$, were estimated using two methods. Method one is semi-empirical and uses mass spectroscopy data to fix the partial cross sections at a single electron energy. The second is a a fully computational method proposed by Huber {\it et al.} (2019, J. Chem. Phys., 150, 024306). Comparisons with experimental results suggest that the mass spectroscopy method is more accurate. However, as Huber's method requires no experimental input, this method could be used as a first approximation when no experimental data is available. As mass spectroscopy sometimes provides incomplete datasets, a hybrid method based on the use of both methods is also explored.


2021 ◽  
Vol 11 (12) ◽  
pp. 1356
Author(s):  
Carlos Traynor ◽  
Tarjinder Sahota ◽  
Helen Tomkinson ◽  
Ignacio Gonzalez-Garcia ◽  
Neil Evans ◽  
...  

Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful values for analysis. Imputation methods are general strategies that replace missing values with plausible values. Using the Flatiron NSCLC dataset, including more than 35,000 subjects, we compare the imputation performance of six such methods on missing data: predictive mean matching, expectation-maximisation, factorial analysis, random forest, generative adversarial networks and multivariate imputations with tabular networks. We also conduct extensive synthetic data experiments with structural causal models. Statistical learning from incomplete datasets should select an appropriate imputation algorithm accounting for the nature of missingness, the impact of missing data, and the distribution shift induced by the imputation algorithm. For our synthetic data experiments, tabular networks had the best overall performance. Methods using neural networks are promising for complex datasets with non-linearities. However, conventional methods such as predictive mean matching work well for the Flatiron NSCLC biomarker dataset.


Author(s):  
Ghazanfar Ali Shah ◽  
Jean-Philippe Pernot ◽  
Arnaud Polette ◽  
Franca Giannini ◽  
Marina Monti

Abstract This paper introduces a novel reverse engineering technique for the reconstruction of editable CAD models of mechanical parts' assemblies. The input is a point cloud of a mechanical parts' assembly that has been acquired as a whole, i.e. without disassembling it prior to its digitization. The proposed framework allows for the reconstruction of the parametric CAD assembly model through a multi-step reconstruction and fitting approach. It is modular and it supports various exploitation scenarios depending on the available data and starting point. It also handles incomplete datasets. The reconstruction process starts from roughly sketched and parameterized geometries (i.e 2D sketches, 3D parts or assemblies) that are then used as input of a simulated annealing-based fitting algorithm, which minimizes the deviation between the point cloud and the reconstructed geometries. The coherence of the CAD models is maintained by a CAD modeler that performs the updates and satisfies the geometric constraints as the fitting process goes on. The optimization process leverages a two-level filtering technique able to capture and manage the boundaries of the geometries inside the overall point cloud in order to allow for local fitting and interfaces detection. It is a user-driven approach where the user decides what are the most suitable steps and sequence to operate. It has been tested and validated on both real scanned point clouds and as-scanned virtually generated point clouds incorporating several artifacts that would appear with real acquisition devices.


IUCrJ ◽  
2021 ◽  
Vol 8 (6) ◽  
Author(s):  
D. Tchoń ◽  
A. Makal

Sufficiently high completeness of diffraction data is necessary to correctly determine the space group, observe solid-state structural transformations or investigate charge density distribution under pressure. Regrettably, experiments performed at high pressure in a diamond anvil cell (DAC) yield inherently incomplete datasets. The present work systematizes the combined influence of radiation wavelength, DAC opening angle and sample orientation in a DAC on the completeness of diffraction data collected in a single-crystal high-pressure (HP) experiment with the help of dedicated software. In particular, the impact of the sample orientation on the achievable data completeness is quantified and proved to be substantial. Graphical guides for estimating the most beneficial sample orientation depending on the sample Laue class and assuming a few commonly used experimental setups are proposed. The usefulness of these guides has been tested in the case of luminescent 1,3-diacetylpyrene, suspected to undergo transitions from the α phase (Pnma) to the γ phase (Pn21 a) and δ phase (P1121/a) under pressure. Effective sample orientation has ensured over 90% coverage even for the monoclinic system and enabled unrestrained structure refinements and access to complete systematic extinction patterns.


Author(s):  
Jyothi Vishnu Vardhan Kola ◽  

In many real world scenarios, regression is a commonly used technique to predict continuous variables. In case of noisy(inconsistent) and incomplete datasets, a large number of previous works adopted complex non traditional machine learning approaches in order to get accurate predictions. However, compromising on time and space overheads. In this paper, we work with complex data yet by using traditional machine learning regression algorithms by working on data cleaning and data transformation according to the working principle of those machine learning algorithms.


2021 ◽  
Author(s):  
Tal Einav ◽  
Brian Cleary

SummaryCharacterizing the antibody response against large panels of viral variants provides unique insight into key processes that shape viral evolution and host antibody repertoires, and has become critical to the development of new vaccine strategies. Given the enormous diversity of circulating virus strains and antibody responses, exhaustive testing of all antibody-virus interactions is unfeasible. However, prior studies have demonstrated that, despite the complexity of these interactions, their functional phenotypes can be characterized in a vastly simpler and lower-dimensional space, suggesting that matrix completion of relatively few measurements could accurately predict unmeasured antibody-virus interactions. Here, we combine available data from several of the largest-scale studies for both influenza and HIV-1 and demonstrate how matrix completion can substantially expedite experiments. We explore how prediction accuracy evolves as the number of available measurements changes and approximate the number of additional measurements necessary in several highly incomplete datasets (suggesting ∼250,000 measurements could be saved). In addition, we show how the method can be used to combine disparate datasets, even when the number of available measurements is below the theoretical limit for successful prediction. Our results suggest new approaches to improve ongoing experimental design, and could be readily generalized to other viruses or more broadly to other low-dimensional biological datasets.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254720
Author(s):  
Maritza Mera-Gaona ◽  
Ursula Neumann ◽  
Rubiel Vargas-Canas ◽  
Diego M. López

Handling missing values is a crucial step in preprocessing data in Machine Learning. Most available algorithms for analyzing datasets in the feature selection process and classification or estimation process analyze complete datasets. Consequently, in many cases, the strategy for dealing with missing values is to use only instances with full data or to replace missing values with a mean, mode, median, or a constant value. Usually, discarding missing samples or replacing missing values by means of fundamental techniques causes bias in subsequent analyzes on datasets. Aim: Demonstrate the positive impact of multivariate imputation in the feature selection process on datasets with missing values. Results: We compared the effects of the feature selection process using complete datasets, incomplete datasets with missingness rates between 5 and 50%, and imputed datasets by basic techniques and multivariate imputation. The feature selection algorithms used are well-known methods. The results showed that the datasets imputed by multivariate imputation obtained the best results in feature selection compared to datasets imputed by basic techniques or non-imputed incomplete datasets. Conclusions: Considering the results obtained in the evaluation, applying multivariate imputation by MICE reduces bias in the feature selection process.


2021 ◽  
Vol 4 ◽  
Author(s):  
Sebastian Jäger ◽  
Arndt Allhorn ◽  
Felix Bießmann

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.


2021 ◽  
Vol 42 (Supplement_1) ◽  
pp. S70-S70
Author(s):  
Kevin N Foster ◽  
Larisa M Krueger ◽  
Karen J Richey

Abstract Introduction Evidence-based criteria for burn patient admission are poorly defined. Attempts have been made by commercial entities to align payors and providers with evidence-based admission criteria to optimize resource use. However, these admission criteria have not be examined critically to see if they are appropriate and effective. We developed an admission criteria algorithm based on these existing standards and have utilized it for nearly 18 months. The purpose of this study is to retrospectively review this algorithm with respect to inpatient needs and outcome to assess its effectiveness. Methods A retrospective chart review of patients admitted the burn center over a 1-year period was performed. Incomplete datasets were excluded. Patients were grouped by TBSA, < 10%, 10–20% and > 20%. Appropriateness of admission was measured used length of stay (LOS) as surrogate marker, hospitalizations of < 3 days, unless deceased, were deemed inappropriate (IAP) and 3 days or more as appropriate (AP). Results There were complete datasets for 530 patients, < 10% (n=423), 10–20% (n= 72), >20% (n=35). There were no significant differences in age, gender, or payor sources between the groups. Patients with larger TBSA burns were more likely to have suffered a flame/flash injury. All patients in the two larger TBSA groups met admission criteria per algorithm. All IAP were in the < 10% group. When compared to AP, IAP were younger, 31.6 vs. 44.0 years (p< .0001), had smaller TBSA injuries 2.8% vs. 3.5% (p=.0045), had fewer clinical findings 1.4 vs 1.8 (p< .0001), fewer interventions 1.8 vs 2.6 (p< .0001) but were more likely to have suffered burns to the head 30% vs 13% (< .00001) and neck 9% vs 3% (=.0164). AP patients were more likely to have suffered contact burns 27% vs. 17% (p=.0323), full-thickness injuries 39% vs 14% (p< .0001), involvement of a major joint 42% vs 29% (p=.0085), combined burn and trauma 3% vs. 0% (p=.0444) and burns to the buttocks 7% vs 2% (p=.0357). AP patients were also more likely to require IV analgesia 82% vs 71% (p=.0107) and evaluated as likely needing surgery 82% vs 15% (p< .00001). Conclusions The admission criteria algorithm performed perfectly in patients with a ≥ 10% TBSA injury. For patients with burn < 10% TBSA the algorithm was not followed as closely leading to some inappropriate admissions. Patients with smaller burns admitted appropriately were more likely to have full thickness burns, contact burns, burns over joints and to require surgery. The algorithm was highly accurate in patients with large burns, however additional refinement is needed for those patients with smaller burn injuries.


Sign in / Sign up

Export Citation Format

Share Document