Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

Author(s):  
Lucas Santiago Rodrigues ◽  
Mirela Teixeira Cazzolato ◽  
Agma Juci Machado Traina ◽  
Caetano Traina
2020 ◽  
Author(s):  
◽  
Sara Bahrami

Respondent burden due to long questionnaires in surveys can negatively affect the response rate as well as the quality of responses. A solution to this problem is to use split questionnaire design (SQD). In an SQD, the items of the long questionnaire are divided into subsets and only a fraction of item-subsets are assigned to random subsamples of individuals. This will lead to several shorter questionnaires which are administered to random subsample of individuals. The completed sub-questionnaires are then combined and the missing values due to design are imputed by means of multiple imputation method. Identification problems can be avoided in advance by ensuring that the combination of variables in the analysis model of interest are jointly observed on at least a subsample of individuals. Furthermore, including an appropriate combination of items in each sub-questionnaire is the most important concern in designing the SQD to reduce the information loss, i.e. highly correlated items that explain each other well should not be jointly missing. For this reason, training data must be available from previous surveys or a pilot study to exploit the association between the variables. In this thesis two SQDs are proposed. In the first study a potential design for NEPS data is introduced. The data consist of items which can be divided and allocated into blocks according to their context, with the objective that the within block correlations are higher relative to the between block correlations. According to the design, the target sample is divided to subsamples. In addition to the items of a whole block which is assigned to each subsample, a fraction of items of the remaining blocks are randomly drawn and assigned to each subsample. Where items that belong to blocks with relatively higher correlations are drawn with lower probability. The design is evaluated by means of several ex-post investigations. The design is imposed on complete data and several models are estimated for both complete data and data deleted by design. The design is also compared with a random multiple matrix sampling design which assigns random subset of items to each sample individual. In the second study, a genetic algorithm is used to search among a vast number of SQDs to find the optimal design. The algorithm evaluates the designs by the fraction of missing information (FMI) induced by the design. The optimal design is the one with the smallest FMI. The optimal design is evaluated by means of several simulation studies and is compared with a random MMS design.


Author(s):  
S. Nickolas ◽  
K. Shobha

Data pre-processing plays a vital role in the life cycle of data mining for accomplishing quality outcomes. In this paper, it is experimentally shown the importance of data pre-processing to achieve highly accurate classifier outcomes by imputing missing values using a novel imputation method, CLUSTPRO, by selecting highly correlated features using Correlation-based Variable Selection (CVS) and by handling imbalanced data using Synthetic Minority Over-sampling Technique (SMOTE). The proposed CLUSTPRO method makes use of Random Forest (RF) and Expectation Maximization (EM) algorithms to impute missing. The imputed results are evaluated using standard evaluation metrics. The CLUSTPRO imputation method outperforms existing, state-of-the-art imputation methods. The combined approach of imputation, feature selection, and imbalanced data handling techniques has significantly contributed to attaining an improved classification accuracy (AUC curve) of 40%–50% in comparison with results obtained without any pre-processing.


2001 ◽  
Vol 60 (2) ◽  
pp. 99-107 ◽  
Author(s):  
Holger Schmid

Cannabis use does not show homogeneous patterns in a country. In particular, urbanization appears to influence prevalence rates, with higher rates in urban areas. A hierarchical linear model (HLM) was employed to analyze these structural influences on individuals in Switzerland. Data for this analysis were taken from the Switzerland survey of Health Behavior in School-Aged Children (HBSC) Study, the most recent survey to assess drug use in a nationally representative sample of 3473 15-year-olds. A total of 1487 male and 1620 female students indicated their cannabis use and their attributions of drug use to friends. As second level variables we included address density in the 26 Swiss Cantons as an indicator of urbanization and officially recorded offences of cannabis use in the Cantons as an indicator of repressive policy. Attribution of drug use to friends is highly correlated with cannabis use. The correlation is even more pronounced in urban Cantons. However, no association between recorded offences and cannabis use was found. The results suggest that structural variables influence individuals. Living in an urban area effects the attribution of drug use to friends. On the other hand repressive policy does not affect individual use.


2013 ◽  
Vol 72 (1) ◽  
pp. 5-11 ◽  
Author(s):  
Elise S. Dan-Glauser ◽  
Klaus R. Scherer

Successful emotion regulation is a key aspect of efficient social functioning and personal well-being. Difficulties in emotion regulation lead to relationship impairments and are presumed to be involved in the onset and maintenance of some psychopathological disorders as well as inappropriate behaviors. Gratz and Roemer (2004 ) developed the Difficulties in Emotion Regulation Scale (DERS), a comprehensive instrument measuring emotion regulation problems that encompasses several dimensions on which difficulties can occur. The aim of the present work was to develop a French translation of this scale and to provide an initial validation of this instrument. The French version was created using translation and backtranslation procedures and was tested on 455 healthy students. Congruence between the original and the translated scales was .98 (Tucker’s phi) and internal consistency of the translation reached .92 (Cronbach’s α). Moreover, test-retest scores were highly correlated. Altogether, the initial validation of the French version of the DERS (DERS-F) offers satisfactory results and permits the use of this instrument to map difficulties in emotion regulation in both clinical and research contexts.


1999 ◽  
Vol 82 (11) ◽  
pp. 1412-1416 ◽  
Author(s):  
Wojciech Zareba ◽  
John Horan ◽  
Arthur Moss ◽  
Joel Kanouse ◽  
◽  
...  

SummaryOur previous prospective study of post-infarction patients described a strong and significant association of increased plasma D-dimer concentrations in those who experienced a subsequent coronary death or non-fatal myocardial infarction. In the present study, we compare results on stored plasma obtained two months after the index myocardial infarction from 1,038 patients of this trial, using a simple automated latex agglutination (LA) assay in parallel with the standard ELISA test. Results show a somewhat higher mean value for the LA assay (702 ± 1092 vs. 638 ± 986 ng/ml, p = 0.0002), a strong linear correlation of the two assays (r = 0.86) and 88% agreement for values below 500 ng/ml by the ELISA test. D-dimer concentrations determined by each assay were highly correlated in patients with subsequent coronary artery events (p = 0.93) and quartile values for both the LA and ELISA were equally predictive of such events (p = 0.003 and p = 0.001, respectively). This is the first demonstration that a latex agglutination assay for D-dimer can be used to assess the prognostic risk of recurrent coronary thrombotic disease after myocardial infarction


1988 ◽  
Vol 59 (02) ◽  
pp. 273-276 ◽  
Author(s):  
J Dawes ◽  
D A Pratt ◽  
M S Dewar ◽  
F E Preston

SummaryThrombospondin, a trimeric glycoprotein contained in the platelet α-granules, has been proposed as a marker of in vivo platelet activation. However, it is also synthesised by a range of other cells. The extraplatelet contribution to plasma levels of thrombospondin was therefore estimated by investigating the relationship between plasma thrombospondin levels and platelet count in samples from profoundly thrombocytopenic patients with marrow hypoplasia, using the platelet-specific α-granule protein β-thromboglobulin as control. Serum concentrations of both proteins were highly correlated with platelet count, but while plasma β-thromboglobulin levels and platelet count also correlated, there was no relationship between the number of platelets and thrombospondin concentrations in plasma. Serial sampling of patients recovering from bone marrow depression indicated that the plasma thrombospondin contributed by platelets is superimposed on a background concentration of at least 50 ng/ml probably derived from a non-platelet source, and plasma thrombospondin levels do not simply reflect platelet release.


1996 ◽  
Vol 75 (05) ◽  
pp. 772-777 ◽  
Author(s):  
Sybille Albrecht ◽  
Matthias Kotzsch ◽  
Gabriele Siegert ◽  
Thomas Luther ◽  
Heinz Großmann ◽  
...  

SummaryThe plasma tissue factor (TF) concentration was correlated to factor VII concentration (FVIIag) and factor VII activity (FVIIc) in 498 healthy volunteers ranging in age from 17 to 64 years. Immunoassays using monoclonal antibodies (mAbs) were developed for the determination of TF and FVIIag in plasma. The mAbs and the test systems were characterized. The mean value of the TF concentration was 172 ± 135 pg/ml. TF showed no age- and gender-related differences. For the total population, FVIIc, determined by a clotting test, was 110 ± 15% and the factor VIlag was 0.77 ± 0.19 μg/ml. FVII activity was significantly increased with age, whereas the concentration demonstrated no correlation to age in this population. FVII concentration is highly correlated with the activity as measured by clotting assay using rabbit thromboplastin. The ratio between FVIIc and FVIIag was not age-dependent, but demonstrated a significant difference between men and women. Between TF and FVII we could not detect a correlation.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


Sign in / Sign up

Export Citation Format

Share Document