scholarly journals Research of Clustering Algorithms using Enhanced Feature Selection

In Present situation, a huge quantity of data is recorded in variety of forms like text, image, video, and audio and is estimated to enhance in future. The major tasks related to text are entity extraction, information extraction, entity relation modeling, document summarization are performed by using text mining. This paper main focus is on document clustering, a sub task of text mining and to measure the performance of different clustering techniques. In this paper we are using an enhanced features selection for clustering of text documents to prove that it produces better results compared to traditional feature selection.

Author(s):  
Harsha Patil ◽  
R. S. Thakur

As we know use of Internet flourishes with its full velocity and in all dimensions. Enormous availability of Text documents in digital form (email, web pages, blog post, news articles, ebooks and other text files) on internet challenges technology to appropriate retrieval of document as a response for any search query. As a result there has been an eruption of interest in people to mine these vast resources and classify them properly. It invigorates researchers and developers to work on numerous approaches of document clustering. Researchers got keen interest in this problem of text mining. The aim of this chapter is to summarised different document clustering algorithms used by researchers.


2017 ◽  
Vol 2017 ◽  
pp. 1-12 ◽  
Author(s):  
Muhammad Shafiq ◽  
Xiangzhan Yu ◽  
Asif Ali Laghari ◽  
Dawei Wang

Recently, machine learning (ML) algorithms have widely been applied in Internet traffic classification. However, due to the inappropriate features selection, ML-based classifiers are prone to misclassify Internet flows as that traffic occupies majority of traffic flows. To address this problem, a novel feature selection metric named weighted mutual information (WMI) is proposed. We develop a hybrid feature selection algorithm named WMI_ACC, which filters most of the features with WMI metric. It further uses a wrapper method to select features for ML classifiers with accuracy (ACC) metric. We evaluate our approach using five ML classifiers on the two different network environment traces captured. Furthermore, we also apply Wilcoxon pairwise statistical test on the results of our proposed algorithm to find out the robust features from the selected set of features. Experimental results show that our algorithm gives promising results in terms of classification accuracy, recall, and precision. Our proposed algorithm can achieve 99% flow accuracy results, which is very promising.


Author(s):  
Harsha Patil ◽  
R. S. Thakur

As we know use of Internet flourishes with its full velocity and in all dimensions. Enormous availability of Text documents in digital form (email, web pages, blog post, news articles, ebooks and other text files) on internet challenges technology to appropriate retrieval of document as a response for any search query. As a result there has been an eruption of interest in people to mine these vast resources and classify them properly. It invigorates researchers and developers to work on numerous approaches of document clustering. Researchers got keen interest in this problem of text mining. The aim of this chapter is to summarised different document clustering algorithms used by researchers.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Heyong Wang ◽  
Ming Hong

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.


2019 ◽  
Vol 22 (3) ◽  
Author(s):  
Samuel Bruno da Silva Sousa ◽  
Ronaldo de Castro Del-Fiaco ◽  
Lilian Berton

Homicide is recognized as one of the most violent types of crime. In some countries, it is a hard problem to tackle because of its high occurrence and the lack of research on it. In Brazil, this problem is even harder, since this country is responsible for about 10% of the homicides in the world. Some Brazilian states suffer from the rise of homicide rates, like the state of Goi´as, in which its homicide rate increased from 24.5 per 100,000 in 2002 to 42.6 per 100,000 in 2014, becoming one of the five most violent states of Brazil, despite of having few population. This paper aims at applying clustering algorithms and feature selection models on criminal data concerning homicides and socio-economic variables in the state of Goi´as. We employed three clustering algorithms: K-means, Densitybased, and Hierarchical; as well as two feature selection models: Univariate Selection and Feature Importance. Our results indicate that homicide rates are more recurrent in large urban centers, although these cities have the best socio-economic indicators. Population and the educational level of the adult population were the variables which most influenced the results. K-means clustering brought the optimum outcomes, and Univariate Selection better selected attributes of the database.


2016 ◽  
Author(s):  
Tallulah S. Andrews ◽  
Martin Hemberg

AbstractFeatures selection is a key step in many single-cell RNASeq (scRNASeq) analyses. Feature selection is intended to preserve biologically relevant information while removing genes only subject to technical noise. As it is frequently performed prior to dimensionality reduction, clustering and pseudotime analyses, feature selection can have a major impact on the results. Several different approaches have been proposed for unsupervised feature selection from unprocessed single-cell expression matrices, most based upon identifying highly variable genes in the dataset. We present two methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. We show that dropout-based feature selection outperforms variance-based feature selection for multiple applications of single-cell RNASeq.


Sign in / Sign up

Export Citation Format

Share Document