Data mining techniques

2012 ◽

pp. 296-308

Author(s):

Ondrej Habala ◽

Martin Šeleng ◽

Viet Tran ◽

Branislav Šimo ◽

Ladislav Hluchý

Keyword(s):

Data Mining ◽

Environmental Data ◽

Environmental Applications ◽

Data Sets ◽

Distributed Data ◽

New Methods ◽

Prospective Application ◽

Using Data ◽

Computer Power

The project Advanced Data Mining and Integration Research for Europe (ADMIRE) is designing new methods and tools for comfortable mining and integration of large, distributed data sets. One of the prospective application domains for such methods and tools is the environmental applications domain, which often uses various data sets from different vendors where data mining is becoming increasingly popular and more computer power becomes available. The authors present a set of experimental environmental scenarios, and the application of ADMIRE technology in these scenarios. The scenarios try to predict meteorological and hydrological phenomena which currently cannot or are not predicted by using data mining of distributed data sets from several providers in Slovakia. The scenarios have been designed by environmental experts and apart from being used as the testing grounds for the ADMIRE technology; results are of particular interest to experts who have designed them.

Download Full-text

An Initial Point Selection Algorithm for K-Means Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.791-793.1289 ◽

2013 ◽

Vol 791-793 ◽

pp. 1289-1292

Author(s):

Le Qiang Bai ◽

Yan Yao Zhou ◽

Shi Hong Zhang

Keyword(s):

Initial Point ◽

Data Sets ◽

Similar Data ◽

Selection Algorithm ◽

Point Selection ◽

Random Data ◽

Data Object ◽

Clustering Center ◽

Data Objects ◽

Standard Sets

Aiming at the problem of K-Means algorithm which is sensitive to select initial clustering center, this paper proposes a kind of initial point of K-Means algorithm. The algorithm processes the properties of the data objects, which determines the density of data object by counting the number of similar data objects and selects the center of categories according to the density of data object. The cluster numbers given and the UCI standard sets of data and the random data sets used, the clustering results demonstrate that the proposed algorithm has good stability, accuracy.

Download Full-text

Pengelompokan Komentar Dataset Sentipol dengan Modified K-Means Clustering

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v6i3.3006 ◽

2020 ◽

Vol 6 (3) ◽

Author(s):

Ruddy Cahyanto ◽

Antonius Rachmat Chrismanto ◽

Danny Sebastian

Keyword(s):

Data Mining ◽

Test Data ◽

Data Sets ◽

Similar Data ◽

Random Factor ◽

Number Of Clusters ◽

Cluster Result ◽

Data Clusters ◽

Kmeans Algorithm

Clustering is a technique in data mining thatgroups data sets into similar data clusters. One of thealgorithms that is commonly used for clustering is K-Means.However, the K-Means algorithm has several weaknesses, oneof them is the random factor in initial centroid selection, sothat cluster result is inconsistent even though it is tested withthe exact same data. The Modified K-Means algorithm focuseson selecting the initial centroid to overcome inconsistencies ofcluster results in the K-Means method. The test was conductedusing sentipol dataset and only focused on comment data.Furthermore, the specified number of clusters is 3 based on thenumber of existing comment labels (positive, negative, andneutral). According to testing result proves that Modified KMeans algorithm produces better purity value than K-Meansalgorithm. Modified K-Means algorithm produces average ofpurity value 0,42, while K-Means produces average of purityvalue 0,391. Meanwhile, from testing related to random factorsconducted 5 times with the same attributes and test data, theresults of the cluster on the Modified K-Means algorithm didnot change, so automatically the resulting purity value was alsothe same. Whereas in the K-Means algorithm, the clusterresults always change in each test, so the result of purity valueis also likely to change.

Download Full-text

Role of Pre-processing Phase in Document Clustering Technique for Gurmukhi Script

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c9105.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 3216-3220

Keyword(s):

Significant Role ◽

Document Clustering ◽

Large Data ◽

Data Sets ◽

Similar Data ◽

Clustering Technique ◽

Two Phases ◽

Data Objects ◽

Gurmukhi Script

Document clustering plays a central role in knowledge discovery and data mining by representing large data-sets into a certain number of data objects called clusters. Each cluster consists similar data objects in such a way that data objects in the same cluster are more similar and dissimilar to the data objects of other clusters. Document clustering technique for Gurmukhi script consists two phases namely: 1) Pre-processing phase 2) Processing phase. This paper concentrates pre-processing phase of document clustering technique for Gurmukhi script. The purpose of pre-processing phase is to convert unstructured text into structured text format. Various sub-phases of pre-processing phase are: segmentation, tokenization, removal of stop words, stemming, and normalization. The purpose of this paper is to present the significant role of pre-processing phase in an overall performance of document clustering technique for Gurmukhi script. The experimental results represent the significant role of pre-processing phase in terms of performance regarding assignment of data objects to the relevant clusters as well as in creation of meaningful cluster title list. .

Download Full-text

Neural networks in data mining

Agricultural Economics (Zemědělská ekonomika) ◽

10.17221/5427-agricecon ◽

2012 ◽

Vol 49 (No. 9) ◽

pp. 427-431 ◽

Cited By ~ 3

Author(s):

AVeselý

Keyword(s):

Data Mining ◽

Neural Networks ◽

Knowledge Engineering ◽

Data Sets ◽

Data Mining Algorithm ◽

Self Organizing Map ◽

New Methods ◽

Area Of Interest ◽

Database Theory ◽

Self Organizing

To posses relevant information is an inevitable condition for successful enterprising in modern business. Information could be parted to data and knowledge. How to gather, store and retrieve data is studied in database theory. In the knowledge engineering, there is in the centre of interest the knowledge and methods of its formalization and gaining are studied. Knowledge could be gained from experts, specialists in the area of interest, or it can be gained by induction from sets of data. Automatic induction of knowledge from data sets, usually stored in large databases, is called data mining. Classical methods of gaining knowledge from data sets are statistical methods. In data mining, new methods besides statistical are used. These new methods have their origin in artificial intelligence. They look for unknown and unexpected relations, which can be uncovered by exploring of data in database. In the article, a utilization of modern methods of data mining is described and especially the methods based on neural networks theory are pursued. The advantages and drawbacks of applications of multiplayer feed forward neural networks and Kohonen’s self-organizing maps are discussed. Kohonen’s self-organizing map is the most promising neural data-mining algorithm regarding its capability to visualize high-dimensional data.

Download Full-text

Cluster Analysis for Outlier Detection

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch035 ◽

2011 ◽

pp. 214-218 ◽

Cited By ~ 1

Author(s):

Frank Klawonn ◽

Frank Rehm

Keyword(s):

Data Mining ◽

Outlier Detection ◽

Clustering Algorithm ◽

Mean Value ◽

Knowledge Discovery In Databases ◽

Data Sets ◽

Data Set ◽

Data Density ◽

Noise Clustering ◽

Data Objects

For many applications in knowledge discovery in databases finding outliers, rare events, is of importance. Outliers are observations, which deviate significantly from the rest of the data, so that it seems they are generated by another process (Hawkins, 1980). Such outlier objects often contain information about an untypical behavior of the system. However, outliers bias the results of many data mining methods like the mean value, the standard deviation or the positions of the prototypes of k-means clustering (Estivill-Castro, 2004; Keller, 2000). Therefore, before further analysis or processing of data is carried out with more sophisticated data mining techniques, identifying outliers is a crucial step. Usually, data objects are considered as outliers, when they occur in a region of extremely low data density. Many clustering techniques like possibilistic clustering (PCM) (Krishnapuram & Keller, 1993; Krishnapuram & Keller, 1996) or noise clustering (NC) (Dave, 1991; Dave & Krishnapuram, 1997) that deal with noisy data and can identify outliers, need good initializations or suffer from lack of adaptability to different cluster sizes (Rehm, Klawonn & Kruse, 2007). Distance-based approaches (Knorr, 1998; Knorr, Ng & Tucakov, 2000) have a global view on the data set. These algorithms can hardly treat data sets containing regions with different data density (Breuning, Kriegel, Ng & Sander, 2000). In this work we present an approach that combines a fuzzy clustering algorithm (Höppner, Klawonn, Kruse & Runkler, 1999) (or any other prototype-based clustering algorithm) with statistical distribution-based outlier detection.

Download Full-text

Mining Environmental Data in the ADMIRE Project Using New Advanced Methods and Tools

International Journal of Distributed Systems and Technologies ◽

10.4018/jdst.2010100101 ◽

2010 ◽

Vol 1 (4) ◽

pp. 1-13

Author(s):

Ondrej Habala ◽

Martin Šeleng ◽

Viet Tran ◽

Branislav Šimo ◽

Ladislav Hluchý

Keyword(s):

Data Mining ◽

Environmental Data ◽

Environmental Applications ◽

Data Sets ◽

Distributed Data ◽

New Methods ◽

Prospective Application ◽

Using Data ◽

Computer Power

The project Advanced Data Mining and Integration Research for Europe (ADMIRE) is designing new methods and tools for comfortable mining and integration of large, distributed data sets. One of the prospective application domains for such methods and tools is the environmental applications domain, which often uses various data sets from different vendors where data mining is becoming increasingly popular and more computer power becomes available. The authors present a set of experimental environmental scenarios, and the application of ADMIRE technology in these scenarios. The scenarios try to predict meteorological and hydrological phenomena which currently cannot or are not predicted by using data mining of distributed data sets from several providers in Slovakia. The scenarios have been designed by environmental experts and apart from being used as the testing grounds for the ADMIRE technology; results are of particular interest to experts who have designed them.

Download Full-text

Improved Density Based Spatial Clustering of Applications of Noise Clustering Algorithm for Knowledge Discovery in Spatial Data

Mathematical Problems in Engineering ◽

10.1155/2016/1564516 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Arvind Sharma ◽

R. K. Gupta ◽

Akhilesh Tiwari

Keyword(s):

Data Mining ◽

Information System ◽

Geographic Information System ◽

Spatial Data ◽

Spatial Clustering ◽

Spatial Data Mining ◽

Geographic Information ◽

Data Sets ◽

Clustering Methods ◽

Data Objects

There are many techniques available in the field of data mining and its subfield spatial data mining is to understand relationships between data objects. Data objects related with spatial features are called spatial databases. These relationships can be used for prediction and trend detection between spatial and nonspatial objects for social and scientific reasons. A huge data set may be collected from different sources as satellite images, X-rays, medical images, traffic cameras, and GIS system. To handle this large amount of data and set relationship between them in a certain manner with certain results is our primary purpose of this paper. This paper gives a complete process to understand how spatial data is different from other kinds of data sets and how it is refined to apply to get useful results and set trends to predict geographic information system and spatial data mining process. In this paper a new improved algorithm for clustering is designed because role of clustering is very indispensable in spatial data mining process. Clustering methods are useful in various fields of human life such as GIS (Geographic Information System), GPS (Global Positioning System), weather forecasting, air traffic controller, water treatment, area selection, cost estimation, planning of rural and urban areas, remote sensing, and VLSI designing. This paper presents study of various clustering methods and algorithms and an improved algorithm of DBSCAN as IDBSCAN (Improved Density Based Spatial Clustering of Application of Noise). The algorithm is designed by addition of some important attributes which are responsible for generation of better clusters from existing data sets in comparison of other methods.

Download Full-text

Ultrastructure of the basal body apparatus in epidermal cells of a sponge larva (Aplysilla SP: Demospongiae)

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010012504x ◽

1992 ◽

Vol 50 (1) ◽

pp. 928-929

Author(s):

R.L. Pinto ◽

R.M. Woollacott

Keyword(s):

Sodium Bicarbonate ◽

Epoxy Resin ◽

Basal Body ◽

Phosphate Buffer ◽

Similar Data ◽

Basal Apparatus ◽

Striated Rootlet ◽

The Common ◽

Basal Foot ◽

Sexually Mature

The basal body and its associated rootlet are the organelles responsible for anchoring the flagellum or cilium in the cytoplasm. Structurally, the common denominators of the basal apparatus are the basal body, a basal foot from which microtubules or microfilaments emanate, and a striated rootlet. A study of the basal apparatus from cells of the epidermis of a sponge larva was initiated to provide a comparison with similar data on adult sponges.Sexually mature colonies of Aplysillasp were collected from Keehi Lagoon Marina, Honolulu, Hawaii. Larvae were fixed in 2.5% glutaraldehyde and 0.14 M NaCl in 0.2 M Millonig’s phosphate buffer (pH 7.4). Specimens were postfixed in 1% OsO4 in 1.25% sodium bicarbonate (pH 7.2) and embedded in epoxy resin. The larva ofAplysilla sp was previously described (as Dendrilla cactus) based on live observations and SEM by Woollacott and Hadfield.

Download Full-text

A Survey on Preparing Data Sets for Data Mining Analysis using Horizontal Aggregations in SQL

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i4/0199 ◽

2017 ◽

Vol 7 (5) ◽

pp. 172-176

Author(s):

Prashant B. Rajole ◽

Keyword(s):

Data Mining ◽

Data Sets ◽

Data Mining Analysis

Download Full-text