Completing missing values using discovered formal concepts

Author(s):  
S. Ben Yahia ◽  
Kh. Arour ◽  
A. Jaoua
Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2017 ◽  
Author(s):  
Natalia Sizochenko ◽  
Alicja Mikolajczyk ◽  
Karolina Jagiello ◽  
Tomasz Puzyn ◽  
Jerzy Leszczynski ◽  
...  

Application of predictive modeling approaches is able solve the problem of the missing data. There are a lot of studies that investigate the effects of missing values on qualitative or quantitative modeling, but only few publications have been<br>discussing it in case of applications to nanotechnology related data. Current project aimed at the development of multi-nano-read-across modeling technique that helps in predicting the toxicity of different species: bacteria, algae, protozoa, and mammalian cell lines. In this study, the experimental toxicity for 184 metal- and silica oxides (30 unique chemical types) nanoparticles from 15 experimental datasets was analyzed. A hybrid quantitative multi-nano-read-across approach that combines interspecies correlation analysis and self-organizing map analysis was developed. At the first step, hidden patterns of toxicity among the nanoparticles were identified using a combination of methods. Then the developed model that based on categorization of metal oxide nanoparticles’ toxicity outcomes was evaluated by means of combination of supervised and unsupervised machine learning techniques to find underlying factors responsible for toxicity.


2017 ◽  
Author(s):  
Natalia Sizochenko ◽  
Alicja Mikolajczyk ◽  
Karolina Jagiello ◽  
Tomasz Puzyn ◽  
Jerzy Leszczynski ◽  
...  

Application of predictive modeling approaches is able solve the problem of the missing data. There are a lot of studies that investigate the effects of missing values on qualitative or quantitative modeling, but only few publications have been<br>discussing it in case of applications to nanotechnology related data. Current project aimed at the development of multi-nano-read-across modeling technique that helps in predicting the toxicity of different species: bacteria, algae, protozoa, and mammalian cell lines. In this study, the experimental toxicity for 184 metal- and silica oxides (30 unique chemical types) nanoparticles from 15 experimental datasets was analyzed. A hybrid quantitative multi-nano-read-across approach that combines interspecies correlation analysis and self-organizing map analysis was developed. At the first step, hidden patterns of toxicity among the nanoparticles were identified using a combination of methods. Then the developed model that based on categorization of metal oxide nanoparticles’ toxicity outcomes was evaluated by means of combination of supervised and unsupervised machine learning techniques to find underlying factors responsible for toxicity.


2020 ◽  
Vol 21 ◽  
Author(s):  
Sukanya Panja ◽  
Sarra Rahem ◽  
Cassandra J. Chu ◽  
Antonina Mitrofanova

Background: In recent years, the availability of high throughput technologies, establishment of large molecular patient data repositories, and advancement in computing power and storage have allowed elucidation of complex mechanisms implicated in therapeutic response in cancer patients. The breadth and depth of such data, alongside experimental noise and missing values, requires a sophisticated human-machine interaction that would allow effective learning from complex data and accurate forecasting of future outcomes, ideally embedded in the core of machine learning design. Objective: In this review, we will discuss machine learning techniques utilized for modeling of treatment response in cancer, including Random Forests, support vector machines, neural networks, and linear and logistic regression. We will overview their mathematical foundations and discuss their limitations and alternative approaches all in light of their application to therapeutic response modeling in cancer. Conclusion: We hypothesize that the increase in the number of patient profiles and potential temporal monitoring of patient data will define even more complex techniques, such as deep learning and causal analysis, as central players in therapeutic response modeling.


Author(s):  
B. Mathura Bai ◽  
N. Mangathayaru ◽  
B. Padmaja Rani ◽  
Shadi Aljawarneh

: Missing attribute values in medical datasets are one of the most common problems faced when mining medical datasets. Estimation of missing values is a major challenging task in pre-processing of datasets. Any wrong estimate of missing attribute values can lead to inefficient and improper classification thus resulting in lower classifier accuracies. Similarity measures play a key role during the imputation process. The use of an appropriate and better similarity measure can help to achieve better imputation and improved classification accuracies. This paper proposes a novel imputation measure for finding similarity between missing and non-missing instances in medical datasets. Experiments are carried by applying both the proposed imputation technique and popular benchmark existing imputation techniques. Classification is carried using KNN, J48, SMO and RBFN classifiers. Experiment analysis proved that after imputation of medical records using proposed imputation technique, the resulting classification accuracies reported by the classifiers KNN, J48 and SMO have improved when compared to other existing benchmark imputation techniques.


2020 ◽  
Vol 14 (16) ◽  
pp. 3288-3300
Author(s):  
Hang Liu ◽  
Youyuan Wang ◽  
WeiGen Chen

BMJ Open ◽  
2020 ◽  
Vol 10 (2) ◽  
pp. e032864
Author(s):  
Geraldine Rauch ◽  
Lorena Hafermann ◽  
Ulrich Mansmann ◽  
Iris Pigeot

ObjectivesTo assess biostatistical quality of study protocols submitted to German medical ethics committees according to personal appraisal of their statistical members.DesignWe conducted a web-based survey among biostatisticians who have been active as members in German medical ethics committees during the past 3 years.SettingThe study population was identified by a comprehensive web search on websites of German medical ethics committees.ParticipantsThe final list comprised 86 eligible persons. In total, 57 (66%) completed the survey.QuestionnaireThe first item checked whether the inclusion criterion was met. The last item assessed satisfaction with the survey. Four items aimed to characterise the medical ethics committee in terms of type and location, one item asked for the urgency of biostatistical training addressed to the medical investigators. The main 2×12 items reported an individual assessment of the quality of biostatistical aspects in the submitted study protocols, while distinguishing studies according to the German Medicines Act (AMG)/German Act on Medical Devices (MPG) and studies non-regulated by these laws.Primary and secondary outcome measuresThe individual assessment of the quality of biostatistical aspects corresponds to the primary objective. Thus, participants were asked to complete the sentence ‘In x% of the submitted study protocols, the following problem occurs’, where 12 different statistical problems were formulated. All other items assess secondary endpoints.ResultsFor all biostatistical aspects, 45 of 49 (91.8%) participants judged the quality of AMG/MPG study protocols much better than that of ‘non-regulated’ studies. The latter are in median affected 20%–60% more often by statistical problems. The highest need for training was reported for sample size calculation, missing values and multiple comparison procedures.ConclusionsBiostatisticians being active in German medical ethics committees classify the biostatistical quality of study protocols as low for ‘non-regulated’ studies, whereas quality is much better for AMG/MPG studies.


Sign in / Sign up

Export Citation Format

Share Document