scholarly journals Taxonomy-based data representation for data mining: an example of the magnitude of risk associated with H. pylori infection

2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Inese Polaka ◽  
Danute Razuka-Ebela ◽  
Jin Young Park ◽  
Marcis Leja

Abstract Background The amount of available and potentially significant data describing study subjects is ever growing with the introduction and integration of different registries and data banks. The single specific attribute of these data are not always necessary; more often, membership to a specific group (e.g. diet, social ‘bubble’, living area) is enough to build a successful machine learning or data mining model without overfitting it. Therefore, in this article we propose an approach to building taxonomies using clustering to replace detailed data from large heterogenous data sets from different sources, while improving interpretability. We used the GISTAR study data base that holds exhaustive self-assessment questionnaire data to demonstrate this approach in the task of differentiating between H. pylori positive and negative study participants, and assessing their potential risk factors. We have compared the results of taxonomy-based classification to the results of classification using raw data. Results Evaluation of our approach was carried out using 6 classification algorithms that induce rule-based or tree-based classifiers. The taxonomy-based classification results show no significant loss in information, with similar and up to 2.5% better classification accuracy. Information held by 10 and more attributes can be replaced by one attribute demonstrating membership to a cluster in a hierarchy at a specific cut. The clusters created this way can be easily interpreted by researchers (doctors, epidemiologists) and describe the co-occurring features in the group, which is significant for the specific task. Conclusions While there are always features and measurements that must be used in data analysis as they are, the use of taxonomies for the description of study subjects in parallel allows using membership to specific naturally occurring groups and their impact on an outcome. This can decrease the risk of overfitting (picking attributes and values specific to the training set without explaining the underlying conditions), improve the accuracy of the models, and improve privacy protection of study participants by decreasing the amount of specific information used to identify the individual.

2019 ◽  
Vol 16 (2) ◽  
pp. 445-452
Author(s):  
Kishore S. Verma ◽  
A. Rajesh ◽  
Adeline J. S. Johnsana

K anonymization is one of the worldwide used approaches to protect the individual records from the privacy leakage attack of Privacy Preserving Data Mining (PPDM) arena. Typically anonymized dataset will impact the effectiveness of data mining results. Anyhow, currently researchers of PPDM progress in driving their efforts in finding out the optimum trade-off between privacy and utility. This work tends in bringing out the optimum classifier from a set of best classifiers of data mining approaches that are capable enough in generating value-added classifying results on utility aware k-anonymized data set. We performed the analytical approach on the data set that are anonymized in sense of accompanying the anonymity utility factors like null values count and transformation pattern loss. The experimentation is done with three widely used classifiers HNB, PART and J48 and these classifiers are analysed with Accuracy, F-measure, and ROC-AUC which are literately proved to be the perfect measures of classification. Our experimental analysis reveals the best classifiers on the utility aware anonymized data sets of Cell oriented Anonymization (CoA), Attribute oriented Anonymization (AoA) and Record oriented Anonymization (RoA).


Author(s):  
J. Anuradha ◽  
B. K. Tripathy

The data used in the real world applications are uncertain and vague. Several models to handle such data efficiently have been put forth so far. It has been found that the individual models have some strong points and certain weak points. Efforts have been made to combine these models so that the hybrid models will cash upon the strong points of the constituent models. Dubois and Prade in 1990 combined rough set and fuzzy set together to develop two models of which rough fuzzy model is a popular one and is used in many fields to handle uncertainty-based data sets very well. Particle Swarm Optimization (PSO) further combined with the rough fuzzy model is expected to produce optimized solutions. Similarly, multi-label classification in the context of data mining deals with situations where an object or a set of objects can be assigned to multiple classes. In this chapter, the authors present a rough fuzzy PSO algorithm that performs classification of multi-label data sets, and through experimental analysis, its efficiency and superiority has been established.


Author(s):  
Protima Banerjee

Over the past few decades, data mining has emerged as a field of research critical to understanding and assimilating the large stores of data accumulated by corporations, government agencies, and laboratories. Early on, mining algorithms and techniques were limited to relational data sets coming directly from On-Line Transaction Processing (OLTP) systems, or from a consolidated enterprise data warehouse. However, recent work has begun to extend the limits of data mining strategies to include “semi-structured data such as HTML and XML texts, symbolic sequences, ordered trees and relations represented by advanced logics.” (Washio and Motoda, 2003) The goal of any data mining endeavor is to detect and extract patterns in the data sets being examined. Semantic data mining is a novel approach that makes use of graph topology, one of the most fundamental and generic mathematical constructs, and semantic meaning, to scan semi-structured data for patterns. This technique has the potential to be especially powerful as graph data representation can capture so many types of semantic relationships. Current research efforts in this field are focused on utilizing graph-structured semantic information to derive complex and meaningful relationships in a wide variety of application areas- - national security and web mining being foremost among these. In this article, we review significant segments of recent data mining research that feed into semantic data mining and describe some promising application areas.


Author(s):  
Riyaz Sikora ◽  
O'la Al-Laymoun

Distributed data mining and ensemble learning are two methods that aim to address the issue of data scaling, which is required to process the large amount of data collected these days. Distributed data mining looks at how data that is distributed can be effectively mined without having to collect the data at one central location. Ensemble learning techniques aim to create a meta-classifier by combining several classifiers created on the same data and improve their performance. In this chapter, the authors use concepts from both of these fields to create a modified and improved version of the standard stacking ensemble learning technique by using a Genetic Algorithm (GA) for creating the meta-classifier. They test the GA-based stacking algorithm on ten data sets from the UCI Data Repository and show the improvement in performance over the individual learning algorithms as well as over the standard stacking algorithm.


Author(s):  
S. K. Saravanan ◽  
G. N. K. Suresh Babu

In contemporary days the more secured data transfer occurs almost through internet. At same duration the risk also augments in secure data transfer. Having the rise and also light progressiveness in e – commerce, the usage of credit card (CC) online transactions has been also dramatically augmenting. The CC (credit card) usage for a safety balance transfer has been a time requirement. Credit-card fraud finding is the most significant thing like fraudsters that are augmenting every day. The intention of this survey has been assaying regarding the issues associated with credit card deception behavior utilizing data-mining methodologies. Data mining has been a clear procedure which takes data like input and also proffers throughput in the models forms or patterns forms. This investigation is very beneficial for any credit card supplier for choosing a suitable solution for their issue and for the researchers for having a comprehensive assessment of the literature in this field.


Author(s):  
Göran Friman

Objective: To describe the distribution of risk, diagnosis and pharmacological treatments for diabetes and hypertension after seven years among patients provided with opportunistic medical screening in a dental setting. Material and Methods: The initial screening’s 170 participants were asked to take part in a seven-year follow-up study. Data were collected through self-reported information in a written health declaration. Outcome measures: • Number of study participants who had passed away • Prescription of antidiabetics or antihypertensives • Changes in weight and height to calculate body mass index (BMI) Results: The follow-up study consisted of 151 participants. Twenty had passed away. The risk needs for medicating with antihypertensive drugs after seven years for those not receiving pharmacological treatment at the initial screening was 3.7 times greater (p=0.025 CI 1.2-11.3) for participants with a diastolic blood pressure (BP) ≥ 90 mm Hg (85 for diabetics) than for the others. The risk was 3.9 times greater (p=0.020 CI 1.2-12.6) for those with a systolic BP of 140-159 mm Hg and 54.2 times greater (p<0.0001 CI 9.8-300.3) for those with a systolic BP ≥ 160 mm Hg than for those with a systolic BP 140 mm Hg. There were no changes in BMI. Conclusion: At least one in ten cases of incorrect medication or undiagnosed hypertension may be identifiable through opportunistic medical screening


2021 ◽  
Vol 115 (3) ◽  
pp. 204-214
Author(s):  
Michele C. McDonnall ◽  
Zhen S. McKnight

Introduction: The purpose of this study was to investigate the effect of visual impairment and correctable visual impairment (i.e., uncorrected refractive errors) on being out of the labor force and on unemployment. The effect of health on labor force status was also investigated. Method: National Health and Nutrition Examination Survey (NHANES) data from 1999 to 2008 ( N = 15,650) was used for this study. Participants were classified into three vision status groups: normal, correctable visual impairment, and visual impairment. Statistical analyses utilized were chi-square and logistic regression. Results: Having a visual impairment was significantly associated with being out of the labor force, while having a correctable visual impairment was not. Conversely, having a correctable visual impairment was associated with unemployment, while having a visual impairment was not. Being out of the labor force was not significantly associated with health for those with a visual impairment, although it was for those with correctable visual impairments and normal vision. Discussion: Given previous research, it was surprising to find that health was not associated with being out of the labor force for those with visual impairments. Perhaps other disadvantages for the people with visual impairments identified in this study contributed to their higher out-of-the-labor-force rates regardless of health. Implications for practitioners: Researchers utilizing national data sets that rely on self-reports to identify visual impairments should realize that some of those who self-identify as being visually impaired may actually have correctable visual impairments. Current research is needed to understand why a majority of people with visual impairments are not seeking employment and have removed themselves from the labor force.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Jonas Albers ◽  
Angelika Svetlove ◽  
Justus Alves ◽  
Alexander Kraupner ◽  
Francesca di Lillo ◽  
...  

AbstractAlthough X-ray based 3D virtual histology is an emerging tool for the analysis of biological tissue, it falls short in terms of specificity when compared to conventional histology. Thus, the aim was to establish a novel approach that combines 3D information provided by microCT with high specificity that only (immuno-)histochemistry can offer. For this purpose, we developed a software frontend, which utilises an elastic transformation technique to accurately co-register various histological and immunohistochemical stainings with free propagation phase contrast synchrotron radiation microCT. We demonstrate that the precision of the overlay of both imaging modalities is significantly improved by performing our elastic registration workflow, as evidenced by calculation of the displacement index. To illustrate the need for an elastic co-registration approach we examined specimens from a mouse model of breast cancer with injected metal-based nanoparticles. Using the elastic transformation pipeline, we were able to co-localise the nanoparticles to specifically stained cells or tissue structures into their three-dimensional anatomical context. Additionally, we performed a semi-automated tissue structure and cell classification. This workflow provides new insights on histopathological analysis by combining CT specific three-dimensional information with cell/tissue specific information provided by classical histology.


2021 ◽  
pp. 1-13
Author(s):  
Yikai Zhang ◽  
Yong Peng ◽  
Hongyu Bian ◽  
Yuan Ge ◽  
Feiwei Qin ◽  
...  

Concept factorization (CF) is an effective matrix factorization model which has been widely used in many applications. In CF, the linear combination of data points serves as the dictionary based on which CF can be performed in both the original feature space as well as the reproducible kernel Hilbert space (RKHS). The conventional CF treats each dimension of the feature vector equally during the data reconstruction process, which might violate the common sense that different features have different discriminative abilities and therefore contribute differently in pattern recognition. In this paper, we introduce an auto-weighting variable into the conventional CF objective function to adaptively learn the corresponding contributions of different features and propose a new model termed Auto-Weighted Concept Factorization (AWCF). In AWCF, on one hand, the feature importance can be quantitatively measured by the auto-weighting variable in which the features with better discriminative abilities are assigned larger weights; on the other hand, we can obtain more efficient data representation to depict its semantic information. The detailed optimization procedure to AWCF objective function is derived whose complexity and convergence are also analyzed. Experiments are conducted on both synthetic and representative benchmark data sets and the clustering results demonstrate the effectiveness of AWCF in comparison with the related models.


Sign in / Sign up

Export Citation Format

Share Document