Dimension Reduction for Objects Composed of Vector Sets

Abstract Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.

Download Full-text

Zero-Shot Feature Selection via Transferring Supervised Knowledge

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-20

Author(s):

Zheng Wang ◽

Qiao Wang ◽

Tingzhang Zhao ◽

Chaokun Wang ◽

Xiaojun Ye

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Real World ◽

Rapid Growth ◽

Learning Systems ◽

Training Data ◽

Effective Technique ◽

Supervised Methods ◽

Real World Datasets

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

OFCOD: On the Fly Clustering Based Outlier Detection Framework

Data ◽

10.3390/data6010001 ◽

2020 ◽

Vol 6 (1) ◽

pp. 1

Author(s):

Ahmed Elmogy ◽

Hamada Rizk ◽

Amany M. Sarhan

Keyword(s):

Data Mining ◽

Image Processing ◽

Intrusion Detection ◽

Real Time ◽

Outlier Detection ◽

Real World ◽

Medical Data ◽

Experimental Results ◽

Real Time Applications ◽

Real World Datasets

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.

Download Full-text

RON-Gauss: Enhancing Utility in Non-Interactive Private Data Release

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2019-0003 ◽

2019 ◽

Vol 2019 (1) ◽

pp. 26-46 ◽

Cited By ~ 2

Author(s):

Thee Chanyaswad ◽

Changchang Liu ◽

Prateek Mittal

Keyword(s):

Machine Learning ◽

Real World ◽

Differential Privacy ◽

Real Data ◽

The Novel ◽

Private Data ◽

Data Release ◽

Machine Learning Applications ◽

Order Of Magnitude ◽

Real World Datasets

Abstract A key challenge facing the design of differential privacy in the non-interactive setting is to maintain the utility of the released data. To overcome this challenge, we utilize the Diaconis-Freedman-Meckes (DFM) effect, which states that most projections of high-dimensional data are nearly Gaussian. Hence, we propose the RON-Gauss model that leverages the novel combination of dimensionality reduction via random orthonormal (RON) projection and the Gaussian generative model for synthesizing differentially-private data. We analyze how RON-Gauss benefits from the DFM effect, and present multiple algorithms for a range of machine learning applications, including both unsupervised and supervised learning. Furthermore, we rigorously prove that (a) our algorithms satisfy the strong ɛ-differential privacy guarantee, and (b) RON projection can lower the level of perturbation required for differential privacy. Finally, we illustrate the effectiveness of RON-Gauss under three common machine learning applications – clustering, classification, and regression – on three large real-world datasets. Our empirical results show that (a) RON-Gauss outperforms previous approaches by up to an order of magnitude, and (b) loss in utility compared to the non-private real data is small. Thus, RON-Gauss can serve as a key enabler for real-world deployment of privacy-preserving data release.

Download Full-text

An Optimal Categorization of Feature Selection Methods for Knowledge Discovery

Data Mining ◽

10.4018/978-1-4666-2455-9.ch005 ◽

2013 ◽

pp. 92-106

Author(s):

Harleen Kaur ◽

Ritu Chauhan ◽

M. Alam

Keyword(s):

Data Mining ◽

Feature Selection ◽

Discriminant Analysis ◽

Medical Data ◽

Stepwise Discriminant Analysis ◽

Selection Methods ◽

Medical Databases ◽

Active Research ◽

Potential Improvement ◽

Large Effort

With the continuous availability of massive experimental medical data has given impetus to a large effort in developing mathematical, statistical and computational intelligent techniques to infer models from medical databases. Feature selection has been an active research area in pattern recognition, statistics, and data mining communities. However, there have been relatively few studies on preprocessing data used as input for data mining systems in medical data. In this chapter, the authors focus on several feature selection methods as to their effectiveness in preprocessing input medical data. They evaluate several feature selection algorithms such as Mutual Information Feature Selection (MIFS), Fast Correlation-Based Filter (FCBF) and Stepwise Discriminant Analysis (STEPDISC) with machine learning algorithm naive Bayesian and Linear Discriminant analysis techniques. The experimental analysis of feature selection technique in medical databases has enable the authors to find small number of informative features leading to potential improvement in medical diagnosis by reducing the size of data set, eliminating irrelevant features, and decreasing the processing time.

Download Full-text

A NOVEL FEATURE SELECTION ALGORITHM WITH SUPERVISED MUTUAL INFORMATION FOR CLASSIFICATION

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500279 ◽

2013 ◽

Vol 22 (04) ◽

pp. 1350027

Author(s):

JAGANATHAN PALANICHAMY ◽

KUPPUCHAMY RAMASAMY

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Mutual Information ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Class A ◽

Selection Algorithms ◽

The Relationship ◽

Class Variable

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.

Download Full-text

Data Mining based Prediction of Demand in Indian Market for Refurbished Electronics

Journal of Soft Computing Paradigm - September 2019 ◽

10.36548/jscp.2020.3.002 ◽

2020 ◽

Vol 2 (3) ◽

pp. 153-159

Author(s):

Dr. V. Suma

Keyword(s):

Data Mining ◽

Real World ◽

Research Work ◽

Analysis Data ◽

Business Environment ◽

Customer Behavior ◽

The Real ◽

Market Factors ◽

Real World Datasets ◽

The Impact

There has been an increasing demand in the e-commerce market for refurbished products across India during the last decade. Despite these demands, there has been very little research done in this domain. The real-world business environment, market factors and varying customer behavior of the online market are often ignored in the conventional statistical models evaluated by existing research work. In this paper, we do an extensive analysis of the Indian e-commerce market using data-mining approach for prediction of demand of refurbished electronics. The impact of the real-world factors on the demand and the variables are also analyzed. Real-world datasets from three random e-commerce websites are considered for analysis. Data accumulation, processing and validation is carried out by means of efficient algorithms. Based on the results of this analysis, it is evident that highly accurate prediction can be made with the proposed approach despite the impacts of varying customer behavior and market factors. The results of analysis are represented graphically and can be used for further analysis of the market and launch of new products.

Download Full-text

Dictionary learning allows model-free pseudotime estimation of transcriptomic data

BMC Genomics ◽

10.1186/s12864-021-08276-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Mona Rams ◽

Tim O.F. Conrad

Keyword(s):

Dimension Reduction ◽

Dictionary Learning ◽

Real World ◽

Estimation Methods ◽

Dynamic Processes ◽

Dimensional Representation ◽

Transcriptomic Data ◽

Model Free ◽

Real World Datasets ◽

Low Dimensional

Abstract Background Pseudotime estimation from dynamic single-cell transcriptomic data enables characterisation and understanding of the underlying processes, for example developmental processes. Various pseudotime estimation methods have been proposed during the last years. Typically, these methods start with a dimension reduction step because the low-dimensional representation is usually easier to analyse. Approaches such as PCA, ICA or t-SNE belong to the most widely used methods for dimension reduction in pseudotime estimation methods. However, these methods usually make assumptions on the derived dimensions, which can result in important dataset properties being missed. In this paper, we suggest a new dictionary learning based approach, dynDLT, for dimension reduction and pseudotime estimation of dynamic transcriptomic data. Dictionary learning is a matrix factorisation approach that does not restrict the dependence of the derived dimensions. To evaluate the performance, we conduct a large simulation study and analyse 8 real-world datasets. Results The simulation studies reveal that firstly, dynDLT preserves the simulated patterns in low-dimension and the pseudotimes can be derived from the low-dimensional representation. Secondly, the results show that dynDLT is suitable for the detection of genes exhibiting the simulated dynamic patterns, thereby facilitating the interpretation of the compressed representation and thus the dynamic processes. For the real-world data analysis, we select datasets with samples that are taken at different time points throughout an experiment. The pseudotimes found by dynDLT have high correlations with the experimental times. We compare the results to other approaches used in pseudotime estimation, or those that are method-wise closely connected to dictionary learning: ICA, NMF, PCA, t-SNE, and UMAP. DynDLT has the best overall performance for the simulated and real-world datasets. Conclusions We introduce dynDLT, a method that is suitable for pseudotime estimation. Its main advantages are: (1) It presents a model-free approach, meaning that it does not restrict the dependence of the derived dimensions; (2) Genes that are relevant in the detected dynamic processes can be identified from the dictionary matrix; (3) By a restriction of the dictionary entries to positive values, the dictionary atoms are highly interpretable.

Download Full-text

An application of machine learning based on real-world data: Mining features of fibrinogen in clinical stages of lung cancer between sexes

Annals of Translational Medicine ◽

10.21037/atm-20-4704 ◽

2021 ◽

Vol 9 (8) ◽

pp. 623-623

Author(s):

Fangtao Yin ◽

Hongyu Zhu ◽

Songlin Hong ◽

Chen Sun ◽

Jie Wang ◽

...

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Data Mining ◽

Real World ◽

Real World Data ◽

World Data ◽

Clinical Stages

Download Full-text

Latest Tools for Data Mining and Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1003.0789s19 ◽

2019 ◽

Vol 8 (9S) ◽

pp. 18-23 ◽

Cited By ~ 2

Keyword(s):

Machine Learning ◽

Data Mining ◽

Decision Making ◽

Feature Selection ◽

Open Source ◽

Predictive Analysis ◽

Learning Tools ◽

Pros And Cons ◽

Selection For ◽

Extract Information

Nowadays, Data Mining is used everywhere for extracting information from the data and in turn, acquires knowledge for decision making. Data Mining analyzes patterns which are used to extract information and knowledge for making decisions. Many open source and licensed tools like Weka, RapidMiner, KNIME, and Orange are available for Data Mining and predictive analysis. This paper discusses about different tools available for Data Mining and Machine Learning, followed by the description, pros and cons of these tools. The article provides details of all the algorithms like classification, regression, characterization, discretization, clustering, visualization and feature selection for Data Mining and Machine Learning tools. It will help people for efficient decision making and suggests which tool is suitable according to their requirement.

Download Full-text

Multimodal Linear Discriminant Analysis via Structural Sparsity

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/482 ◽

2017 ◽

Author(s):

Yu Zhang ◽

Yuan Jiang

Keyword(s):

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Real World ◽

Specific Class ◽

Linear Discriminant ◽

Class Separability ◽

Dimensionality Reduction Technique ◽

Real World Applications ◽

Data Points ◽

Real World Datasets

Linear discriminant analysis (LDA) is a widely used supervised dimensionality reduction technique. Even though the LDA method has many real-world applications, it has some limitations such as the single-modal problem that each class follows a normal distribution. To solve this problem, we propose a method called multimodal linear discriminant analysis (MLDA). By generalizing the between-class and within-class scatter matrices, the MLDA model can allow each data point to have its own class mean which is called the instance-specific class mean. Then in each class, data points which share the same or similar instance-specific class means are considered to form one cluster or modal. In order to learn the instance-specific class means, we use the ratio of the proposed generalized between-class scatter measure over the proposed generalized within-class scatter measure, which encourages the class separability, as a criterion. The observation that each class will have a limited number of clusters inspires us to use a structural sparse regularizor to control the number of unique instance-specific class means in each class. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLDA method.

Download Full-text