categorical attribute
Recently Published Documents


TOTAL DOCUMENTS

20
(FIVE YEARS 9)

H-INDEX

3
(FIVE YEARS 2)

Author(s):  
Elena Battaglia ◽  
Simone Celano ◽  
Ruggero G. Pensa

AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.


Author(s):  
J. Avanijaa, Et. al.

House price fluctuates each and every year due to changes in land value and change in infrastructure in and around the area. Centralised system should be available for prediction of house price in correlation with neighbourhood and infrastructure, will help customer to estimate the price of the house. Also, it assists the customer to come to a conclusion where to buy a house and when to purchase the house. Different factors are taken into consideration while predicting the worth of the house like location, neighbourhood and various amenities like garage space etc. Developing a model starts with Pre-processing data to remove all sort of discrepancies and fill null values or remove data outliers and make data ready to be processed. The categorical attribute can be converted into required attributes using one hot encoding methodology. Later the house price is predicted using XGBoost regression technique.


Author(s):  
Shihua Liu ◽  
Hao Zhang ◽  
Xianghua Liu

A Two-stage clustering framework and a clustering algorithm for mixed attribute data based on density peaks and Goodall distance are proposed. Firstly, the subset of numerical attributes of the dataset is clustered, and then the result is mapped into one-dimensional categorical attribute and added to the subset of categorical attribute data. Finally, the new dataset is clustered by the density peaks clustering algorithm to obtain the final result. Experiments on three commonly used UCI datasets show that this algorithm can effectively realize mixed attribute clustering and produce better clustering results than the traditional K-prototypes algorithm do. The clustering accuracy on the Acute, Heart and Credit datasets are 17%, 24%, and 21% higher on average than that of the K-prototypes, respectively.


Algorithms ◽  
2020 ◽  
Vol 13 (11) ◽  
pp. 302
Author(s):  
Rosane Minghim ◽  
Liz Huancapaza ◽  
Erasmo Artur ◽  
Guilherme P. Telles ◽  
Ivar V. Belizario

Feature Analysis has become a very critical task in data analysis and visualization. Graph structures are very flexible in terms of representation and may encode important information on features but are challenging in regards to layout being adequate for analysis tasks. In this study, we propose and develop similarity-based graph layouts with the purpose of locating relevant patterns in sets of features, thus supporting feature analysis and selection. We apply a tree layout in the first step of the strategy, to accomplish node placement and overview based on feature similarity. By drawing the remainder of the graph edges on demand, further grouping and relationships among features are revealed. We evaluate those groups and relationships in terms of their effectiveness in exploring feature sets for data analysis. Correlation of features with a target categorical attribute and feature ranking are added to support the task. Multidimensional projections are employed to plot the dataset based on selected attributes to reveal the effectiveness of the feature set. Our results have shown that the tree-graph layout framework allows for a number of observations that are very important in user-centric feature selection, and not easy to observe by any other available tool. They provide a way of finding relevant and irrelevant features, spurious sets of noisy features, groups of similar features, and opposite features, all of which are essential tasks in different scenarios of data analysis. Case studies in application areas centered on documents, images and sound data demonstrate the ability of the framework to quickly reach a satisfactory compact representation from a larger feature set.


2020 ◽  
Vol 9 (1) ◽  
pp. 1607-1612

A new technique is proposed for splitting categorical data during the process of decision tree learning. This technique is based on the class probability representations and manipulations of the class labels corresponding to the distinct values of categorical attributes. For each categorical attribute aggregate similarity in terms of class probabilities is computed and then based on the highest aggregated similarity measure the best attribute is selected and then the data in the current node of the decision tree is divided into the number of sub sets equal to the number of distinct values of the best categorical split attribute. Many experiments are conducted using this proposed method and the results have shown that the proposed technique is better than many other competitive methods in terms of efficiency, ease of use, understanding, and output results and it will be useful in many modern applications.


Understanding the customer sentiment is very important when it comes to advertising. To appeal to their current and potential customers, a company must understand the market interests. Companies can segment their customers by using surveys and telemetry data to get to know the customer’s interests. One way of segmenting the customer is by grouping or clustering them according to their interests and behaviors. In this study, the k-prototypes clustering algorithm, which is an improved combination of k-means and k-modes algorithm, will be used to cluster a behavioral data that contains both numerical and categorical attribute, obtained from a survey conducted on teenagers into clusters of 4, 5, and 6. Each cluster will contain teenagers with certain behavior different from other clusters. And then by analyzing the results, advertisers will be able to define a profile that indicates their interests regarding the internet, social media and text messaging, effectively revealing the kind of ad that would be relatable for them.


Symmetry ◽  
2018 ◽  
Vol 10 (8) ◽  
pp. 333
Author(s):  
Jinyan Wang ◽  
Guoqing Cai ◽  
Chen Liu ◽  
Jingli Wu ◽  
Xianxian Li

Nowadays, more and more applications are dependent on storage and management of semi-structured information. For scientific research and knowledge-based decision-making, such data often needs to be published, e.g., medical data is released to implement a computer-assisted clinical decision support system. Since this data contains individuals’ privacy, they must be appropriately anonymized before to be released. However, the existing anonymization method based on l-diversity for hierarchical data may cause a serious similarity attack, and cannot protect data privacy very well. In this paper, we utilize fuzzy sets to divide levels for sensitive numerical and categorical attribute values uniformly (a categorical attribute value can be converted into a numerical attribute value according to its frequency of occurrences), and then transform the value levels to sensitivity levels. The privacy model ( α l e v h , k)-anonymity for hierarchical data with multi-level sensitivity is proposed. Furthermore, we design a privacy-preserving approach to achieve this privacy model. Experiment results demonstrate that our approach is obviously superior to existing anonymous approach in hierarchical data in terms of utility and security.


Sign in / Sign up

Export Citation Format

Share Document