Constrained Data Mining

Author(s):  
Brad Morantz

Mining a large data set can be time consuming, and without constraints, the process could generate sets or rules that are invalid or redundant. Some methods, for example clustering, are effective, but can be extremely time consuming for large data sets. As the set grows in size, the processing time grows exponentially. In other situations, without guidance via constraints, the data mining process might find morsels that have no relevance to the topic or are trivial and hence worthless. The knowledge extracted must be comprehensible to experts in the field. (Pazzani, 1997) With time-ordered data, finding things that are in reverse chronological order might produce an impossible rule. Certain actions always precede others. Some things happen together while others are mutually exclusive. Sometimes there are maximum or minimum values that can not be violated. Must the observation fit all of the requirements or just most. And how many is “most?” Constraints attenuate the amount of output (Hipp & Guntzer, 2002). By doing a first-stage constrained mining, that is, going through the data and finding records that fulfill certain requirements before the next processing stage, time can be saved and the quality of the results improved. The second stage also might contain constraints to further refine the output. Constraints help to focus the search or mining process and attenuate the computational time. This has been empirically proven to improve cluster purity. (Wagstaff & Cardie, 2000)(Hipp & Guntzer, 2002) The theory behind these results is that the constraints help guide the clustering, showing where to connect, and which ones to avoid. The application of user-provided knowledge, in the form of constraints, reduces the hypothesis space and can reduce the processing time and improve the learning quality.

Author(s):  
Md. Zakir Hossain ◽  
Md.Nasim Akhtar ◽  
R.B. Ahmad ◽  
Mostafijur Rahman

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets.  The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>


2018 ◽  
Vol 7 (2) ◽  
pp. 100-105
Author(s):  
Simranjit Kaur ◽  
Seema Baghla

Online shopping has a shopping channel or purchasing various items through online medium. Data mining is defined as a process used to extract usable data from a larger set of any raw data. The data set extraction from the demographic profiles and Questionnaire to investigate the gathered based by association. The method for shopping was totally changed with the happening to internet Technology. Association rule mining is one of the important problems of data mining has been used here. The goal of the association rule mining is to detect relationships or associations between specific values of categorical variables in large data sets.


2019 ◽  
Vol 8 (05) ◽  
pp. 24655-24660
Author(s):  
Kranthi K Lammatha

Data Mining on 5G Technology IOT Currently, data mining is regarded as one of the essential factors for the next generation of mobile networks. Through research and data analysis, there are expectations that complexity of these networks will be overcome and it will be possible to carry out dynamic management and operation activities. In order to full comprehend the particulars of 5G network, there are certain kind of information that should be gathered by network components in order to be analyzed by a data mining scheme. The recent years have seen a tremendous effort put in the course of designing the 5th Generation of mobile networks (5G). The innovation of 5G mobile networks have been aimed at providing tailor cut solution for different kinds of industries particularly the telecommunication sector, intelligent transportation industries, health sector and even in smart factories. On the other hand, the scientific community has realized that big data solutions can significantly enhance the operation and management of both current and future mobile networks. Usually, data mining is employed in the course of discovering patterns and relationships between different variables particularly in large data sets. Through the use of statistical analysis, machine learning and artificial intelligence are used in the data set in the course of extracting necessary knowledge from the examined data. Data mining is integral in 5G technology because it is through data mining that 5G is considered different particularly through the ease in decision making process that has been offered by the system in order to mitigate some common challenges through a dynamic and proactive mechanism.


2020 ◽  
Author(s):  
Isha Sood ◽  
Varsha Sharma

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed


2017 ◽  
Vol 7 (1) ◽  
pp. 36-40 ◽  
Author(s):  
Joana Pereira ◽  
Hugo Peixoto ◽  
José Machado ◽  
António Abelha

Abstract The large amounts of data generated by healthcare transactions are too complex and voluminous to be processed and analysed by traditional methods. Data mining can improve decision-making by discovering patterns and trends in large amounts of complex data. In the healthcare industry specifically, data mining can be used to decrease costs by increasing efficiency, improve patient quality of life, and perhaps most importantly, save the lives of more patients. The main goal of this project is to apply data mining techniques in order to make possible the prediction of the degree of disability that patients will present when they leave hospitalization. The clinical data that will compose the data set was obtained from one single hospital and contains information about patients who were hospitalized in Cardio Vascular Disease’s (CVD) unit in 2016 for having suffered a cardiovascular accident. To develop this project, it will be used the Waikato Environment for Knowledge Analysis (WEKA) machine learning Workbench since this one allows users to quickly try out and compare different machine learning methods on new data sets


Author(s):  
Malcolm J. Beynon

The essence of data mining is to investigate for pertinent information that may exist in data (often large data sets). The immeasurably large amount of data present in the world, due to the increasing capacity of storage media, manifests the issue of the presence of missing values (Olinsky et al., 2003; Brown and Kros, 2003). The presented encyclopaedia article considers the general issue of the presence of missing values when data mining, and demonstrates the effect of when managing their presence is or is not undertaken, through the utilisation of a data mining technique. The issue of missing values was first exposited over forty years ago in Afifi and Elashoff (1966). Since then it is continually the focus of study and explanation (El-Masri and Fox-Wasylyshyn, 2005), covering issues such as the nature of their presence and management (Allison, 2000). With this in mind, the naïve consistent aspect of the missing value debate is the limited general strategies available for their management, the main two being either the simple deletion of cases with missing data or a form of imputation of the missing values in someway (see Elliott and Hawthorne, 2005). Examples of the specific investigation of missing data (and data quality), include in; data warehousing (Ma et al., 2000), and customer relationship management (Berry and Linoff, 2000). An alternative strategy considered is the retention of the missing values, and their subsequent ‘ignorance’ contribution in any data mining undertaken on the associated original incomplete data set. A consequence of this retention is that full interpretability can be placed on the results found from the original incomplete data set. This strategy can be followed when using the nascent CaRBS technique for object classification (Beynon, 2005a, 2005b). CaRBS analyses are presented here to illustrate that data mining can manage the presence of missing values in a much more effective manner than the more inhibitory traditional strategies. An example data set is considered, with a noticeable level of missing values present in the original data set. A critical increase in the number of missing values present in the data set further illustrates the benefit from ‘intelligent’ data mining (in this case using CaRBS).


2018 ◽  
Vol 1 (2) ◽  
pp. 83-91
Author(s):  
M. Hasyim Siregar

In the world of business competition today, we are required to continually develop business to always survive in the competition. To achieve this there are a few things that can be done is to improve the quality of the product, adding the type of product and operational cost reduction company with how to use data analysis of the company. Data mining is a technology that automate the process to find interesting patterns and sensitive from the large data sets. This allows human understanding about finding patterns and scalability techniques. The store Adi Bangunan is a shop which is engaged in the sale of building materials and household who have such a system on supermarket namely buyers took own goods that will be purchased. Sales data, purchase goods or reimbursed some unexpected is not well ordered, so that the data is only function as archive for the store and cannot be used for the development of marketing strategy. In this research, data mining applied using the model of the process of K-Means that provides a standard process for the use of data mining in various areas used in the classification of because the results of this method can be easily understood and interpreted.


Data Mining is a technique used to retrieve information for the analysis and discovery of hidden trends in large data sets. Data Mining extends to numerous areas such as education, banking, marketing, retail, communications and agriculture. Agriculture is the backbone of country’s economy. It is the important source of livelihood. Agriculture depends primarily on the weather, geology, soil and biology. Agricultural Mining is a technology that can contribute information for the growth of agriculture. The current study presents the various techniques of data mining, and their role in soil fertility, nutrient analysis. Decision tree is a well-known Data Mining classification approach. C4.5 and Classification and Regression Trees (ID3) are two widely used decision tree algorithms for classification. The C4.5, ID3 and the proposed classifier have been trained using the soil sample data set by taking into account the optimal soil parameters pH (hydrogen power), EC (electrical conductivity) and ESP (exchangeable sodium percentage). The model is evaluated using a collection of soil samples test results. Classification of soil is the division of soil into classes or groups each having similar characteristics and likely similar behavior. Soil classification is easy to allow the farmer to know the type of soil and to plough the crops based on the soil type.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 3687-3693

Clustering is a type of mining process where the data set is categorized into various sub classes. Clustering process is very much essential in classification, grouping, and exploratory pattern of analysis, image segmentation and decision making. And we can explain about the big data as very large data sets which are examined computationally to show techniques and associations and also which is associated to the human behavior and their interactions. Big data is very essential for several organisations but in few cases very complex to store and it is also time saving. Hence one of the ways of overcoming these issues is to develop the many clustering methods, moreover it suffers from the large complexity. Data mining is a type of technique where the useful information is extracted, but the data mining models cannot utilized for the big data because of inherent complexity. The main scope here is to introducing a overview of data clustering divisions for the big data And also explains here few of the related work for it. This survey concentrates on the research of several clustering algorithms which are working basically on the elements of big data. And also the short overview of clustering algorithms which are grouped under partitioning, hierarchical, grid based and model based are seenClustering is major data mining and it is used for analyzing the big data.the problems for applying clustering patterns to big data and also we phase new issues come up with big data


Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


Sign in / Sign up

Export Citation Format

Share Document