A Systematic Comparative Analysis of Clustering Techniques

AbstractClustering has now become a very important tool to manage the data in many areas such as pattern recognition, machine learning, information retrieval etc. The database is increasing day by day and thus it is required to maintain the data in such a manner that useful information can easily be extracted and used accordingly. In this process, clustering plays an important role as it forms clusters of the data on the basis of similarity in data. There are more than hundred clustering methods and algorithms that can be used for mining the data but all these algorithms do not provide models for their clusters and thus it becomes difficult to categorise all of them. This paper describes the most commonly used and popular clustering techniques and also compares them on the basis of their merits, demerits and time complexity.

Download Full-text

Measurement of clustering effectiveness for document collections

Information Retrieval ◽

10.1007/s10791-021-09401-8 ◽

2022 ◽

Author(s):

Meng Yuan ◽

Justin Zobel ◽

Pauline Lin

Keyword(s):

Information Retrieval ◽

Measurement Techniques ◽

High Dimensionality ◽

Clustering Methods ◽

Clustering Method ◽

Similar Material ◽

Document Collections ◽

Clustering Techniques

AbstractClustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.

Download Full-text

A Brief Survey of Machine Learning Methods in Identification of Mitochondria Proteins in Malaria Parasite

Current Pharmaceutical Design ◽

10.2174/1381612826666200310122324 ◽

2020 ◽

Vol 26 (26) ◽

pp. 3049-3058

Author(s):

Ting Liu ◽

Hua Tang

Keyword(s):

Machine Learning ◽

Computational Methods ◽

Future Development ◽

Malaria Parasite ◽

Mitochondrial Proteins ◽

Learning Methods ◽

Machine Learning Methods ◽

Effective Drugs ◽

Construction Strategies ◽

Day By Day

The number of human deaths caused by malaria is increasing day-by-day. In fact, the mitochondrial proteins of the malaria parasite play vital roles in the organism. For developing effective drugs and vaccines against infection, it is necessary to accurately identify mitochondrial proteins of the malaria parasite. Although precise details for the mitochondrial proteins can be provided by biochemical experiments, they are expensive and time-consuming. In this review, we summarized the machine learning-based methods for mitochondrial proteins identification in the malaria parasite and compared the construction strategies of these computational methods. Finally, we also discussed the future development of mitochondrial proteins recognition with algorithms.

Download Full-text

Artificial neural network models for coronary artery disease

Current Bioinformatics ◽

10.2174/1574893615666200214102837 ◽

2020 ◽

Vol 15 ◽

Author(s):

Elham Shamsara ◽

Sara Saffar Soflaei ◽

Mohammad Tajfard ◽

Ivan Yamshchikov ◽

Habibollah Esmaili ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Coronary Artery Disease ◽

Pattern Recognition ◽

Artificial Neural Network ◽

Coronary Artery ◽

Diagnostic Model ◽

Early Prediction ◽

Artificial Neural ◽

Artery Disease

Background: Coronary artery disease (CAD) is an important cause of mortality and morbidity globally. Objective : The early prediction of the CAD would be valuable in identifying individuals at risk, and in focusing resources on its prevention. In this paper, we aimed to establish a diagnostic model to predict CAD by using three approaches of ANN (pattern recognition-ANN, LVQ-ANN, and competitive ANN). Methods: One promising method for early prediction of disease based on risk factors is machine learning. Among diﬀerent machine learning algorithms, the artificial neural network (ANN) algo-rithms have been applied widely in medicine and a variety of real-world classifications. ANN is a non-linear computational model, that is inspired by the human brain to analyze and process complex datasets. Results: Diﬀerent methods of ANN that are investigated in this paper indicates in both pattern recognition ANN and LVQ-ANN methods, the predictions of Angiography+ class have high accuracy. Moreover, in CNN the correlations between the individuals in cluster ”c” with the class of Angiography+ is strongly high. This accuracy indicates the significant diﬀerence among some of the input features in Angiography+ class and the other two output classes. A comparison among the chosen weights in these three methods in separating control class and Angiography+ shows that hs-CRP, FSG, and WBC are the most substantial excitatory weights in recognizing the Angiography+ individuals although, HDL-C and MCH are determined as inhibitory weights. Furthermore, the effect of decomposition of a multi-class problem to a set of binary classes and random sampling on the accuracy of the diagnostic model is investigated. Conclusion : This study confirms that pattern recognition-ANN had the most accuracy of performance among diﬀerent methods of ANN. That’s due to the back-propagation procedure of the process in which the network classify input variables based on labeled classes. The results of binarization show that decomposition of the multi-class set to binary sets could achieve higher accuracy.

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text