The comparative study of text documents clustering algorithms

Clustering is one of the most significant research area in the field of data mining and considered as an important tool in the fast developing information explosion era.Clustering systems are used more and more often in text mining, especially in analyzing texts and to extracting knowledge they contain. Data are grouped into clusters in such a way that the data of the same group are similar and those in other groups are dissimilar. It aims to minimizing intra-class similarity and maximizing inter-class dissimilarity. Clustering is useful to obtain interesting patterns and structures from a large set of data. It can be applied in many areas, namely, DNA analysis, marketing studies, web documents, and classification. This paper aims to study and compare three text documents clustering, namely, k-means, k-medoids, and SOM through F-measure.

Download Full-text

A Weighted Frequent Item-Set Mining using WD-FIM Algorithm

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3683.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 4792-4796

Keyword(s):

Data Mining ◽

Decision Making ◽

Research Area ◽

Data Sets ◽

Weight Factor ◽

Smart Systems ◽

Frequent Item ◽

Significant Research ◽

Downward Closure ◽

The One

Smart systems are the one of the most significant inventions of our times. These systems rely on powerful information mining techniques to achieve intelligence in decision making. Frequent item set mining (FIM), has become one of the most significant research area of data mining. The information present in databases is in-general ambiguous and uncertain. In such databases, one should think of weighted FIM to discover item sets which are significant from end user’s perspective. Be that as it may, with introduction of weight-factor for FIM makes the weighted continuous item sets may not fulfil the descending conclusion property anymore. Subsequently, the pursuit space of successive item set can't be limited by descending conclusion property which prompts a poor time effectiveness. In this paper, we introduce two properties for FIM, first one is, weight judgment downward closure property (WD-FIM), it is for weighted FIM and the second one is existence property for its subsets. In view of above two properties, the WD-FIM calculation is proposed to limit the looking through space of the weighted regular item sets and improve the time effectiveness. In addition, the culmination and time productivity of WD-FIM calculation are examined hypothetically. At last, the exhibition of the proposed WD-FIM calculation is confirmed on both engineered and genuine data sets

Download Full-text

A Survey for Association Rule Mining in Data Mining

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.58 ◽

2017 ◽

Vol 7 (8) ◽

pp. 245

Author(s):

Shivangee Agrawal ◽

Nivedita Bairagi

Keyword(s):

Data Mining ◽

Association Rules ◽

Association Rule ◽

Association Rule Mining ◽

Research Area ◽

Knowledge Discovery In Databases ◽

Rule Mining ◽

Base Level ◽

Significant Research ◽

Total Usage

Data mining, also identified as knowledge discovery in databases has well-known its place as an important and significant research area. The objective of data mining (DM) is to take out higher-level unknown detail from a great quantity of raw data. DM has been used in a variety of data domains. DM can be considered as an algorithmic method that takes data as input and yields patterns, such as classification rules, itemsets, association rules, or summaries, as output. The ’classical’ associations rule issue manages the age of association rules by support portraying a base level of confidence and support that the roduced rules should meet. The most standard and classical algorithm used for ARM is Apriori algorithm. It is used for delivering frequent itemsets for the database. The essential thought behind this algorithm is that numerous passes are made the database. The total usage of association rule strategies strengthens the knowledge management process and enables showcasing faculty to know their customers well to give better quality organizations. In this paper, the detailed description has been performed on the Genetic algorithm and FP-Growth with the applications of the Association Rule Mining.

Download Full-text

Ontology Based Data Mining Approach on Web Documents

Computer and Information Science ◽

10.5539/cis.v7n4p123 ◽

2014 ◽

Vol 7 (4) ◽

pp. 123

Author(s):

Hamideh Hajiabadi

Keyword(s):

Data Mining ◽

Data Sources ◽

Text Documents ◽

Web Documents ◽

Valuable Data ◽

Data Mining Techniques ◽

Data Mining Approach ◽

Huge Data ◽

Data Source ◽

Extract Information

Internet which is included plenty of huge data source is now rapidly increasing in all domains. It is considered as valuable data sources if the data can be processed that results in information. Data mining techniques are widely utilized in web documents in order to extract information. In this paper a data mining approach based on Ontology is proposed to classify web documents in order to facilitate applications based on classified text documents like search engines. The proposed approach is implemented and applied on several web documents. The experimental results show considerable progress.

Download Full-text

Early Detection of Red Palm Weevil, Rhynchophorus ferrugineus (Olivier), Infestation Using Data Mining

Plants ◽

10.3390/plants10010095 ◽

2021 ◽

Vol 10 (1) ◽

pp. 95

Author(s):

Heba Kurdi ◽

Amal Al-Aldawsari ◽

Isra Al-Turaiki ◽

Abdulrahman S. Aldawood

Keyword(s):

Data Mining ◽

Plant Size ◽

Support Vector ◽

Classification Algorithms ◽

Palm Tree ◽

Rhynchophorus Ferrugineus ◽

Red Palm Weevil ◽

Palm Weevil ◽

Using Data ◽

F Measure

In the past 30 years, the red palm weevil (RPW), Rhynchophorus ferrugineus (Olivier), a pest that is highly destructive to all types of palms, has rapidly spread worldwide. However, detecting infestation with the RPW is highly challenging because symptoms are not visible until the death of the palm tree is inevitable. In addition, the use of automated RPW weevil identification tools to predict infestation is complicated by a lack of RPW datasets. In this study, we assessed the capability of 10 state-of-the-art data mining classification algorithms, Naive Bayes (NB), KSTAR, AdaBoost, bagging, PART, J48 Decision tree, multilayer perceptron (MLP), support vector machine (SVM), random forest, and logistic regression, to use plant-size and temperature measurements collected from individual trees to predict RPW infestation in its early stages before significant damage is caused to the tree. The performance of the classification algorithms was evaluated in terms of accuracy, precision, recall, and F-measure using a real RPW dataset. The experimental results showed that infestations with RPW can be predicted with an accuracy up to 93%, precision above 87%, recall equals 100%, and F-measure greater than 93% using data mining. Additionally, we found that temperature and circumference are the most important features for predicting RPW infestation. However, we strongly call for collecting and aggregating more RPW datasets to run more experiments to validate these results and provide more conclusive findings.

Download Full-text

Life-cycle Asset Management in Residential Developments Building on Transport System Critical Attributes via a Data-mining Algorithm

Buildings ◽

10.3390/buildings9010001 ◽

2018 ◽

Vol 9 (1) ◽

pp. 1 ◽

Cited By ~ 3

Author(s):

Umair Hasan ◽

Andrew Whyte ◽

Hamad Al Jassmi

Keyword(s):

Data Mining ◽

Life Cycle ◽

Built Environment ◽

Transport System ◽

Asset Management ◽

Public Transport ◽

Categorical Variables ◽

Data Mining Algorithm ◽

Large Set ◽

Mining Algorithm

Public transport can discourage individual car usage as a life-cycle asset management strategy towards carbon neutrality. An effective public transport system contributes greatly to the wider goal of a sustainable built environment, provided the critical transit system attributes are measured and addressed to (continue to) improve commuter uptake of public systems by residents living and working in local communities. Travel data from intra-city travellers can advise discrete policy recommendations based on a residential area or development’s public transport demand. Commuter segments related to travelling frequency, satisfaction from service level, and its value for money are evaluated to extract econometric models/association rules. A data mining algorithm with minimum confidence, support, interest, syntactic constraints and meaningfulness measure as inputs is designed to exploit a large set of 31 variables collected for 1,520 respondents, generating 72 models. This methodology presents an alternative to multivariate analyses to find correlations in bigger databases of categorical variables. Results here augment literature by highlighting traveller perceptions related to frequency of buses, journey time, and capacity, as a net positive effect of frequent buses operating on rapid transit routes. Policymakers can address public transport uptake through service frequency variation during peak-hours with resultant reduced car dependence apt to reduce induced life-cycle environmental burdens of buildings by altering residents’ mode choices, and a potential design change of buildings towards a public transit-based, compact, and shared space urban built environment.

Download Full-text

Latent Topic Model for Indexing Arabic Documents

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2014010102 ◽

2014 ◽

Vol 4 (1) ◽

pp. 29-45 ◽

Cited By ~ 3

Author(s):

Rami Ayadi ◽

Mohsen Maraoui ◽

Mounir Zrigui

Keyword(s):

Topic Model ◽

Inflectional Morphology ◽

Arabic Text ◽

Text Representation ◽

Text Documents ◽

Latent Topic ◽

Latent Topics ◽

F Measure

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.

Download Full-text

Biomedical Document Clustering Based on Accelerated Symbiotic Organisms Search Algorithm

International Journal of Swarm Intelligence Research ◽

10.4018/ijsir.2021100109 ◽

2021 ◽

Vol 12 (4) ◽

pp. 169-185

Author(s):

Saida Ishak Boushaki ◽

Omar Bendjeghaba ◽

Nadjet Kamel

Keyword(s):

Clustering Algorithm ◽

Search Algorithm ◽

Clustering Algorithms ◽

Document Clustering ◽

Latent Semantic Indexing ◽

Research Area ◽

Semantic Indexing ◽

Local Optima ◽

Symbiotic Organisms Search ◽

Symbiotic Organisms

Clustering is an important unsupervised analysis technique for big data mining. It finds its application in several domains including biomedical documents of the MEDLINE database. Document clustering algorithms based on metaheuristics is an active research area. However, these algorithms suffer from the problems of getting trapped in local optima, need many parameters to adjust, and the documents should be indexed by a high dimensionality matrix using the traditional vector space model. In order to overcome these limitations, in this paper a new documents clustering algorithm (ASOS-LSI) with no parameters is proposed. It is based on the recent symbiotic organisms search metaheuristic (SOS) and enhanced by an acceleration technique. Furthermore, the documents are represented by semantic indexing based on the famous latent semantic indexing (LSI). Conducted experiments on well-known biomedical documents datasets show the significant superiority of ASOS-LSI over five famous algorithms in terms of compactness, f-measure, purity, misclassified documents, entropy, and runtime.

Download Full-text

Synthesis, Characterization and Development of Energy Harvesting Techniques Incorporated with Antennas: A Review Study

Sensors ◽

10.3390/s20102772 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2772 ◽

Cited By ~ 3

Author(s):

Husam Hamid Ibrahim ◽

Mandeep S. J. Singh ◽

Samir Salem Al-Bawri ◽

Mohammad Tariqul Islam

Keyword(s):

Energy Harvesting ◽

Research Area ◽

Voltage Regulation ◽

Input Power ◽

Energy Sources ◽

Promising Alternative ◽

Significant Research ◽

Rf Energy ◽

Self Powered ◽

The Military

The investigation into new sources of energy with the highest efficiency which are derived from existing energy sources is a significant research area and is attracting a great deal of interest. Radio frequency (RF) energy harvesting is a promising alternative for obtaining energy for wireless devices directly from RF energy sources in the environment. An overview of the energy harvesting concept will be discussed in detail in this paper. Energy harvesting is a very promising method for the development of self-powered electronics. Many applications, such as the Internet of Things (IoT), smart environments, the military or agricultural monitoring depend on the use of sensor networks which require a large variety of small and scattered devices. The low-power operation of such distributed devices requires wireless energy to be obtained from their surroundings in order to achieve safe, self-sufficient and maintenance-free systems. The energy harvesting circuit is known to be an interface between piezoelectric and electro-strictive loads. A modern view of circuitry for energy harvesting is based on power conditioning principles that also involve AC-to-DC conversion and voltage regulation. Throughout the field of energy conversion, energy harvesting circuits often impose electric boundaries for devices, which are important for maximizing the energy that is harvested. The power conversion efficiency (PCE) is described as the ratio between the rectifier’s output DC power and the antenna-based RF-input power (before its passage through the corresponding network).

Download Full-text

A qualitative survey on frequent subgraph mining

Open Computer Science ◽

10.1515/comp-2018-0018 ◽

2018 ◽

Vol 8 (1) ◽

pp. 194-209 ◽

Cited By ~ 1

Author(s):

Büsra Güvenoglu ◽

Belgin Ergenç Bostanoglu

Keyword(s):

Data Mining ◽

Graph Mining ◽

Research Area ◽

Heterogeneous Data ◽

Graph Representation ◽

Frequent Subgraph Mining ◽

Subgraph Mining ◽

Frequent Subgraph ◽

Input Type ◽

Frequent Subgraphs

AbstractData mining is a popular research area that has been studied by many researchers and focuses on finding unforeseen and important information in large databases. One of the popular data structures used to represent large heterogeneous data in the field of data mining is graphs. So, graph mining is one of the most popular subdivisions of data mining. Subgraphs that are more frequently encountered than the user-defined threshold in a database are called frequent subgraphs. Frequent subgraphs in a database can give important information about this database. Using this information, data can be classified, clustered and indexed. The purpose of this survey is to examine frequent subgraph mining algorithms (i) in terms of frequent subgraph discovery process phases such as candidate generation and frequency calculation, (ii) categorize the algorithms according to their general attributes such as input type, dynamicity of graphs, result type, algorithmic approach they are based on, algorithmic design and graph representation as well as (iii) to discuss the performance of algorithms in comparison to each other and the challenges faced by the algorithms recently.

Download Full-text

An Efficient Clustering Approach for Automatic Detection of Calcification in Low Dose Chest CT

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195231 ◽

2019 ◽

pp. 163-168

Author(s):

P. Tamijiselvy ◽

N. Kavitha ◽

K. M. Keerthana ◽

D. Menakha

Keyword(s):

Data Mining ◽

Low Dose ◽

Early Stage ◽

Clustering Algorithms ◽

Automatic Detection ◽

Chest Ct ◽

Data Mining Algorithm ◽

Fuzzy C Means ◽

Data Mining Algorithms ◽

Using Data

The degree of aortic calcification has been appeared to be a risk pointer for vascular occasions including cardiovascular events. The created strategy is fully automated data mining algorithm to segment and measure calcification using Low-dose Chest CT in smokers of age 50 to 70 .The identification of subjects with increased cardiovascular risk can be detected by using data mining algorithms. This paper presents a method for automatic detection of coronary artery calcifications in low-dose chest CT scans using effective clustering algorithms with three phases as Pre-Processing, Segmentation and clustering. Fuzzy C Means algorithm provides accuracy of 80.23% demonstrate that Fuzzy C means detects the Cardio Vascular Disease at early stage.

Download Full-text