Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

Author(s):  
T. Ravindra Babu ◽  
M. Narasimha Murty ◽  
S. V. Subrahmanya

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.

Data Mining ◽  
2013 ◽  
pp. 734-750
Author(s):  
T. Ravindra Babu ◽  
M. Narasimha Murty ◽  
S. V. Subrahmanya

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.


2020 ◽  
Vol 3 (2) ◽  
Author(s):  
Yoga Religia ◽  
Gatot Tri Pranoto ◽  
Egar Dika Santosa

Normally, most of the bank's wealth is obtained from providing credit loans so that a marketing bank must be able to reduce the risk of non-performing credit loans. The risk of providing loans can be minimized by studying patterns from existing lending data. One technique that can be used to solve this problem is to use data mining techniques. Data mining makes it possible to find hidden information from large data sets by way of classification. The Random Forest (RF) algorithm is a classification algorithm that can be used to deal with data imbalancing problems. The purpose of this study is to discuss the use of the RF algorithm for classification of South German Credit data. This research is needed because currently there is no previous research that applies the RF algorithm to classify South German Credit data specifically. Based on the tests that have been done, the optimal performance of the classification algorithm RF on South German Credit data is the comparison of training data of 85% and testing data of 15% with an accuracy of 78.33%.


2018 ◽  
Vol 1 (2) ◽  
pp. 83-91
Author(s):  
M. Hasyim Siregar

In the world of business competition today, we are required to continually develop business to always survive in the competition. To achieve this there are a few things that can be done is to improve the quality of the product, adding the type of product and operational cost reduction company with how to use data analysis of the company. Data mining is a technology that automate the process to find interesting patterns and sensitive from the large data sets. This allows human understanding about finding patterns and scalability techniques. The store Adi Bangunan is a shop which is engaged in the sale of building materials and household who have such a system on supermarket namely buyers took own goods that will be purchased. Sales data, purchase goods or reimbursed some unexpected is not well ordered, so that the data is only function as archive for the store and cannot be used for the development of marketing strategy. In this research, data mining applied using the model of the process of K-Means that provides a standard process for the use of data mining in various areas used in the classification of because the results of this method can be easily understood and interpreted.


Author(s):  
Adam Kiersztyn ◽  
Pawe Karczmarek ◽  
Krystyna Kiersztyn ◽  
Witold Pedrycz

2021 ◽  
Vol 251 ◽  
pp. 02054
Author(s):  
Olga Sunneborn Gudnadottir ◽  
Daniel Gedon ◽  
Colin Desmarais ◽  
Karl Bengtsson Bernander ◽  
Raazesh Sainudiin ◽  
...  

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a clustering accuracy of 81% when applied to the problem of multi-class classification of simulated jet events. Our implementation adds the distributed training functionality by utilising the Horovod distributed training framework, which necessitated a migration of the code to TensorFlow v2. Together with using parquet files for splitting data up between different compute nodes, the distributed training makes the model scalable to any amount of input data, something that will be essential for use with real LHC data sets. We find that the model is well suited for distributed training, with the training time decreasing in direct relation to the number of GPU’s used. However, further improvements by a more exhaustive and possibly distributed hyper-parameter search is required in order to achieve the reported accuracy of the original UCluster method.


2020 ◽  
Vol 8 (6) ◽  
pp. 4617-4622

The destination image branding is the domain of tourism industry where the facts and information is collected and evaluated for finding the credibility of a target tourist destination. Manual collection and processing of collected information accurately is a complicated and time consuming task therefore a data mining model is suggested ,in this presented work that collect and evaluate the destination image accurately and based on evaluation can make the recommendations about visits of tourist. In order to perform this task data mining techniques are applied on text data source. In first the data is extracted from the Google search engine and it is preprocessed for make it impure. In further the data is labeled based on the positive and negative words available in the collected facts. Finally the clustering and classification of text is performed. For clustering of data FCM (fuzzy c means) clustering algorithm and for classification the Bayesian classifier is used. Based on final classification of text data the decision is made for the destination visits.


2021 ◽  
pp. 1826-1839
Author(s):  
Sandeep Adhikari, Dr. Sunita Chaudhary

The exponential growth in the use of computers over networks, as well as the proliferation of applications that operate on different platforms, has drawn attention to network security. This paradigm takes advantage of security flaws in all operating systems that are both technically difficult and costly to fix. As a result, intrusion is used as a key to worldwide a computer resource's credibility, availability, and confidentiality. The Intrusion Detection System (IDS) is critical in detecting network anomalies and attacks. In this paper, the data mining principle is combined with IDS to efficiently and quickly identify important, secret data of interest to the user. The proposed algorithm addresses four issues: data classification, high levels of human interaction, lack of labeled data, and the effectiveness of distributed denial of service attacks. We're also working on a decision tree classifier that has a variety of parameters. The previous algorithm classified IDS up to 90% of the time and was not appropriate for large data sets. Our proposed algorithm was designed to accurately classify large data sets. Aside from that, we quantify a few more decision tree classifier parameters.


Sign in / Sign up

Export Citation Format

Share Document