Data Mining Algorithms: An Overview

The research on data mining has successfully yielded numerous tools, algorithms, methods and approaches for handling large amounts of data for various purposeful use andÂ Â problem solving. Data mining has become an integral part of many application domains such as data ware housing, predictive analytics, business intelligence, bio-informatics and decision support systems. Prime objective of data mining is to effectively handle large scale data, extract actionable patterns, and gain insightful knowledge. Data mining is part and parcel of knowledge discovery in databases (KDD) process. Success and improved decision making normally depends on how quickly one can discover insights from data. These insights could be used to drive better actions which can be used in operational processes and even predict future behaviour. This paper presents an overview of various algorithms necessary for handling large data sets. These algorithms define various structures and methods implemented to handle big data. The review also discusses the general strengths and limitations of these algorithms. This paper can quickly guide or an eye opener to the data mining researchers on which algorithm(s) to select and apply in solving the problems they will be investigating.

Download Full-text

A comparative study of Distributed Large Scale Data Mining Algorithms

BSSS Journal of Computer ◽

10.51767/jc1102 ◽

2020 ◽

Author(s):

Isha Sood ◽

Varsha Sharma

Keyword(s):

Data Mining ◽

Large Scale ◽

Large Data ◽

Data Sets ◽

Data Set ◽

Data Mining Algorithms ◽

Large Scale Data ◽

Mapreduce Model ◽

Mining Algorithms ◽

Scale Data

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed

Download Full-text

Data mining fool’s gold

Journal of Information Technology ◽

10.1177/0268396220915600 ◽

2020 ◽

Vol 35 (3) ◽

pp. 182-194

Author(s):

Gary Smith

Keyword(s):

Data Mining ◽

Scientific Method ◽

Large Data ◽

Learning Systems ◽

Large Data Sets ◽

Data Sets ◽

Computer Algorithms ◽

Data Mining Algorithms ◽

Rigorous Testing ◽

Mining Algorithms

The scientific method is based on the rigorous testing of falsifiable conjectures. Data mining, in contrast, puts data before theory by searching for statistical patterns without being constrained by prespecified hypotheses. Artificial intelligence and machine learning systems, for example, often rely on data-mining algorithms to construct models with little or no human guidance. However, a plethora of patterns are inevitable in large data sets, and computer algorithms have no effective way of assessing whether the patterns they unearth are truly useful or meaningless coincidences. While data mining sometimes discovers useful relationships, the data deluge has caused the number of possible patterns that can be discovered relative to the number that are genuinely useful to grow exponentially—which makes it increasingly likely that what data mining unearths is likely to be fool’s gold.

Download Full-text

Application of Data Mining Method Using Association Rules Apriori To Shopping Cart Analysis On Sale Transactions (Case Study Alfamidi Burnt Stone)

Journal Of Computer Networks, Architecture and High Performance Computing ◽

10.47709/cnapc.v2i2.425 ◽

2020 ◽

Vol 2 (2) ◽

pp. 222-226

Author(s):

Vanessa Siregar ◽

Paska Marto Hasugian

Keyword(s):

Data Mining ◽

Knowledge Discovery In Databases ◽

Data Sets ◽

Cart Analysis ◽

Data Mining Algorithms ◽

Use Of Data ◽

Large Size ◽

Using Data ◽

Mining Algorithms

Also Often data mining is called knowledge discovery in databases (KDD), ie activities include the collection, historical use of data to find regularities, patterns or relationships in data sets with a large size. The company may be interested to know if some groups consistently goods items purchased together. This study analyzes the transaction of data information retrieval from the sale of skin care and hair care using data mining algorithms priori Alfamidi Burnt Stones with the highest support value is 8% and the highest value is 5% confidance

Download Full-text

Cloud for Distributed Data Analysis Based on the Actor Model

Scientific Programming ◽

10.1155/2016/1050293 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Ivan Kholod ◽

Ilya Petukhov ◽

Andrey Shorov

Keyword(s):

Data Mining ◽

Data Analysis ◽

Data Sets ◽

Distributed Data ◽

Actor Model ◽

Confidential Information ◽

Data Mining Algorithms ◽

Functional Blocks ◽

Mining Algorithms ◽

Distributed Data Analysis

This paper describes the construction of a Cloud for Distributed Data Analysis (CDDA) based on the actor model. The design uses an approach to map the data mining algorithms on decomposed functional blocks, which are assigned to actors. Using actors allows users to move the computation closely towards the stored data. The process does not require loading data sets into the cloud and allows users to analyze confidential information locally. The results of experiments show that the efficiency of the proposed approach outperforms established solutions.

Download Full-text

Introduction to Fuzzy Data Mining Methods

Handbook of Research on Fuzzy Information Processing in Databases ◽

10.4018/978-1-59904-853-6.ch003 ◽

2011 ◽

pp. 55-95 ◽

Cited By ~ 12

Author(s):

Balazs Feil ◽

Janos Abonyi

Keyword(s):

Data Mining ◽

Model Performance ◽

Fuzzy Rule ◽

Classification Problem ◽

Data Sets ◽

Rule Based ◽

Data Mining Algorithms ◽

Rule Bases ◽

Mining Algorithms ◽

Extraction Model

This chapter aims to give a comprehensive view about the links between fuzzy logic and data mining. It will be shown that knowledge extracted from simple data sets or huge databases can be represented by fuzzy rule-based expert systems. It is highlighted that both model performance and interpretability of the mined fuzzy models are of major importance, and effort is required to keep the resulting rule bases small and comprehensible. Therefore, in the previous years, soft computing based data mining algorithms have been developed for feature selection, feature extraction, model optimization, and model reduction (rule based simplification). Application of these techniques is illustrated using the wine data classification problem. The results illustrate that fuzzy tools can be applied in a synergistic manner through the nine steps of knowledge discovery.

Download Full-text

Automated Phenotyping Tool for Identifying Developmental Language Disorder Cases in Health Systems Data (APT-DLD): A New Research Algorithm for Deployment in Large-Scale Electronic Health Record Systems

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-19-00397 ◽

2020 ◽

Vol 63 (9) ◽

pp. 3019-3035

Author(s):

Courtney E. Walters ◽

Rachana Nitin ◽

Katherine Margulis ◽

Olivia Boorom ◽

Daniel E. Gustavson ◽

...

Keyword(s):

Data Mining ◽

Large Scale ◽

Language Disorder ◽

Developmental Language Disorder ◽

Replication Sample ◽

Data Mining Algorithms ◽

Discovery Sample ◽

Automated Phenotyping ◽

Mining Algorithms

Purpose Data mining algorithms using electronic health records (EHRs) are useful in large-scale population-wide studies to classify etiology and comorbidities ( Casey et al., 2016 ). Here, we apply this approach to developmental language disorder (DLD), a prevalent communication disorder whose risk factors and epidemiology remain largely undiscovered. Method We first created a reliable system for manually identifying DLD in EHRs based on speech-language pathologist (SLP) diagnostic expertise. We then developed and validated an automated algorithmic procedure, called, Automated Phenotyping Tool for identifying DLD cases in health systems data (APT-DLD), that classifies a DLD status for patients within EHRs on the basis of ICD (International Statistical Classification of Diseases and Related Health Problems) codes. APT-DLD was validated in a discovery sample ( N = 973) using expert SLP manual phenotype coding as a gold-standard comparison and then applied and further validated in a replication sample of N = 13,652 EHRs. Results In the discovery sample, the APT-DLD algorithm correctly classified 98% (concordance) of DLD cases in concordance with manually coded records in the training set, indicating that APT-DLD successfully mimics a comprehensive chart review. The output of APT-DLD was also validated in relation to independently conducted SLP clinician coding in a subset of records, with a positive predictive value of 95% of cases correctly classified as DLD. We also applied APT-DLD to the replication sample, where it achieved a positive predictive value of 90% in relation to SLP clinician classification of DLD. Conclusions APT-DLD is a reliable, valid, and scalable tool for identifying DLD cohorts in EHRs. This new method has promising public health implications for future large-scale epidemiological investigations of DLD and may inform EHR data mining algorithms for other communication disorders. Supplemental Material https://doi.org/10.23641/asha.12753578

Download Full-text

A Comparison Study of Data Mining Algorithms for blood Cancer Prediction

passer ◽

10.24271/psr.29 ◽

2019 ◽

Vol 3 (1) ◽

pp. 174-179

Author(s):

Noor Bahjat ◽

Snwr Jamak

Keyword(s):

Data Mining ◽

Machine Learning Algorithms ◽

Common Disease ◽

Data Sets ◽

Blood Cancer ◽

Cancer Data ◽

Data Mining Algorithms ◽

Detection And Diagnosis ◽

Mining Algorithms ◽

Early Detection And Diagnosis

Cancer is a common disease that threats the life of one of every three people. This dangerous disease urgently requires early detection and diagnosis. The recent progress in data mining methods, such as classification, has proven the need for machine learning algorithms to apply to large datasets. This paper mainly aims to utilise data mining techniques to classify cancer data sets into blood cancer and non-blood cancer based on pre-defined information and post-defined information obtained after blood tests and CT scan tests. This research conducted using the WEKA data mining tool with 10-fold cross-validation to evaluate and compare different classification algorithms, extract meaningful information from the dataset and accurately identify the most suitable and predictive model. This paper depicted that the most suitable classifier with the best ability to predict the cancerous dataset is Multilayer perceptron with an accuracy of 99.3967%.

Download Full-text

Migrating From Data Mining to Big Data Mining

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.4.14667 ◽

2018 ◽

Vol 7 (3.4) ◽

pp. 13

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Data Mining ◽

Big Data ◽

Response Time ◽

Large Scale ◽

Naive Bayes ◽

Naïve Bayes ◽

Data Mining Algorithm ◽

Big Data Mining ◽

Data Mining Algorithms ◽

Mining Algorithms

Data mining is one of the most researched fields in computer science. Several researches have been carried out to extract and analyse important information from raw data. Traditional data mining algorithms like classification, clustering and statistical analysis can process small scale of data with great efficiency and accuracy. Social networking interactions, business transactions and other communications result in Big data. It is large scale of data which is not in competency for traditional data mining techniques. It is observed that traditional data mining algorithms are not capable for storage and processing of large scale of data. If some algorithms are capable, then response time is very high. Big data have hidden information, if that is analysed in intelligent manner can be highly beneficial for business organizations. In this paper, we have analysed the advancement from traditional data mining algorithms to Big data mining algorithms. Applications of traditional data mining algorithms can be straight forward incorporated in Big data mining algorithm. Several studies have analysed traditional data mining with Big data mining, but very few have analysed most important algortihsm within one research work, which is the core motive of our paper. Readers can easily observe the difference between these algorthithms with pros and cons. Mathemtics concepts are applied in data mining algorithms. Means and Euclidean distance calculation in Kmeans, Vectors application and margin in SVM and Bayes therorem, conditional probability in Naïve Bayes algorithm are real examples. Classification and clustering are the most important applications of data mining. In this paper, Kmeans, SVM and Naïve Bayes algorithms are analysed in detail to observe the accuracy and response time both on concept and empirical perspective. Hadoop, Mapreduce etc. Big data technologies are used for implementing Big data mining algorithms. Performace evaluation metrics like speedup, scaleup and response time are used to compare traditional mining with Big data mining.

Download Full-text

Penggunaan Association Rule Data Mining Untuk Menentukan Pola Lama Studi Mahasiswa F-MIPA UNSRAT

d'CARTESIAN ◽

10.35799/dc.3.1.2014.3777 ◽

2014 ◽

Vol 3 (1) ◽

pp. 1

Author(s):

M. Zainal Mahmudin ◽

Altien Rindengan ◽

Winsy Weku

Keyword(s):

Data Mining ◽

Association Rule ◽

Large Data ◽

Apriori Algorithm ◽

Data Mining Algorithms ◽

Adequate Information ◽

Processing Data ◽

Rule Method ◽

Mining Algorithms ◽

Confidence Value

Abstract The requirement of highest information sometimes is not balance with the provision of adequate information, so that the information must be re-excavated in large data. By using the technique of association rule we can obtain information from large data such as the college data. The purposes of this research is to determine the patterns of study from student in F-MIPA UNSRAT by using association rule method of data mining algorithms and to compare in the apriori method and a hash-based algorithms. The major’s student data of F-MIPA UNSRAT as a data were processed by association rule method of data mining with the apriori algorithm and a hash-based algorithm by using support and confidance at least 1 %. The results of processing data with apriori algorithms was same with the processing results of hash-based algorithms is as much as 49 combinations of 2-itemset. The pattern that formed between 7,5% of graduates from mathematics major that studied for more 5 years with confidence value is 38,5%. Keywords: Apriori algorithm, hash-based algorithm, association rule, data mining. Abstrak Kebutuhan informasi yang sangat tinggi terkadang tidak diimbangi dengan pemberian informasi yang memadai, sehingga informasi tersebut harus kembali digali dalam data yang besar. Dengan menggunakan teknik association rule kita dapat memperoleh informasi dari data yang besar seperti data yang ada di perguruan tinggi. Tujuan penelitian ini adalah menentukan pola lama studi mahasiswa F-MIPA UNSRAT dengan menggunakan metode association rule data mining serta membandingkan algoritma apriori dan algoritma hash-based. Data yang digunakan adalah data induk mahasiswa F-MIPA UNSRAT yang diolah menggunakan teknik association rule data mining dengan algoritma apriori dan algoritma hash-based dengan minimum support 1% dan minimum confidance 1%. Hasil pengolahan data dengan algoritma apriori sama dengan hasil pengolahan data dengan algoritma hash-based yaitu sebanyak 49 kombinasi 2-itemset. Pola yang terbentuk antara lain 7,5% lulusan yang berasal dari jurusan matematika menempuh studi selama lebih dari 5 tahun dengan nilai confidence 38,5%. Kata kunci : Association rule data mining, algoritma apriori, algoritma hash-based

Download Full-text

A Survey on Evolutionary Instance Selection and Generation

International Journal of Applied Metaheuristic Computing ◽

10.4018/jamc.2010102604 ◽

2010 ◽

Vol 1 (1) ◽

pp. 60-92 ◽

Cited By ~ 38

Author(s):

Joaquín Derrac ◽

Salvador García ◽

Francisco Herrera

Keyword(s):

Data Mining ◽

Evolutionary Algorithms ◽

Nearest Neighbor ◽

Instance Selection ◽

Data Sets ◽

Generation Process ◽

Data Mining Algorithms ◽

Instance Generation ◽

Nearest Neighbor Rule ◽

Mining Algorithms

The use of Evolutionary Algorithms to perform data reduction tasks has become an effective approach to improve the performance of data mining algorithms. Many proposals in the literature have shown that Evolutionary Algorithms obtain excellent results in their application as Instance Selection and Instance Generation procedures. The purpose of this paper is to present a survey on the application of Evolutionary Algorithms to Instance Selection and Generation process. It will cover approaches applied to the enhancement of the nearest neighbor rule, as well as other approaches focused on the improvement of the models extracted by some well-known data mining algorithms. Furthermore, some proposals developed to tackle two emerging problems in data mining, Scaling Up and Imbalance Data Sets, also are reviewed.

Download Full-text