Sentiment Analysis on Microblogging with K-Means Clustering and Artificial Bee Colony
Microblogging is a type of blog used by people to express their opinions, attitudes, and feelings toward entities with a short message and this message is easily shared through the network of connected people. Knowing their sentiments would be beneficial for decision-making, planning, visualization, and so on. Grouping similar microblogging messages can convey some meaningful sentiments toward an entity. This task can be accomplished by using a simple and fast clustering algorithm, [Formula: see text]-means. As the microblogging messages are short and noisy they cause high sparseness and high-dimensional dataset. To overcome this problem, term frequency–inverse document frequency (tf–idf) technique is employed for selecting the relevant features, and singular value decomposition (SVD) technique is employed for reducing the high-dimensional dataset while still retaining the most relevant features. These two techniques adjust dataset to improve the [Formula: see text]-means efficiently. Another problem comes from [Formula: see text]-means itself. [Formula: see text]-means result relies on the initial state of centroids, the random initial state of centroids usually causes convergence to a local optimum. To find a global optimum, artificial bee colony (ABC), a novel swarm intelligence algorithm, is employed to find the best initial state of centroids. Silhouette analysis technique is also used to find optimal [Formula: see text]. After clustering into [Formula: see text] groups, each group will be scored by SentiWordNet and we analyzed the sentiment polarities of each group. Our approach shows that combining various techniques (i.e., tf–idf, SVD, and ABC) can significantly improve [Formula: see text]-means result (41% from normal [Formula: see text]-means).