Detection and Classification of Anomalies in Large Data Sets on the Basis of Information Granules

Author(s):  
Adam Kiersztyn ◽  
Pawe Karczmarek ◽  
Krystyna Kiersztyn ◽  
Witold Pedrycz
2021 ◽  
Vol 251 ◽  
pp. 02054
Author(s):  
Olga Sunneborn Gudnadottir ◽  
Daniel Gedon ◽  
Colin Desmarais ◽  
Karl Bengtsson Bernander ◽  
Raazesh Sainudiin ◽  
...  

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a clustering accuracy of 81% when applied to the problem of multi-class classification of simulated jet events. Our implementation adds the distributed training functionality by utilising the Horovod distributed training framework, which necessitated a migration of the code to TensorFlow v2. Together with using parquet files for splitting data up between different compute nodes, the distributed training makes the model scalable to any amount of input data, something that will be essential for use with real LHC data sets. We find that the model is well suited for distributed training, with the training time decreasing in direct relation to the number of GPU’s used. However, further improvements by a more exhaustive and possibly distributed hyper-parameter search is required in order to achieve the reported accuracy of the original UCluster method.


2018 ◽  
Vol 25 (3) ◽  
pp. 655-670 ◽  
Author(s):  
Tsung-Wei Ke ◽  
Aaron S. Brewster ◽  
Stella X. Yu ◽  
Daniela Ushizima ◽  
Chao Yang ◽  
...  

A new tool is introduced for screening macromolecular X-ray crystallography diffraction images produced at an X-ray free-electron laser light source. Based on a data-driven deep learning approach, the proposed tool executes a convolutional neural network to detect Bragg spots. Automatic image processing algorithms described can enable the classification of large data sets, acquired under realistic conditions consisting of noisy data with experimental artifacts. Outcomes are compared for different data regimes, including samples from multiple instruments and differing amounts of training data for neural network optimization.


2014 ◽  
Vol 556-562 ◽  
pp. 3935-3940
Author(s):  
Hao Yan Guo ◽  
Yuan Zhi Cheng ◽  
Da Zheng Wang ◽  
Jia Cheng Xu

In this paper a complete and improved mathematical framework of the geometric approach based on the Scaled convex hull (SCH), which includes two parts, SCH-based geometric algorithms for SVM and a fast and novel geometric method for model selection in SCH-based SVM, is developed to solve SCH-based SVM classification of large data sets. On the basis of this framework, these geometric algorithms are more suitable for classification task of large data sets. Results of numerical experiments show that the proposed geometric algorithms can reduce kernel calculations and display nice performances on large medical data sets, such as computer assisted screening for lung cancer.


2020 ◽  
Vol 13 (4) ◽  
pp. 588-594
Author(s):  
Saravana Kumar Coimbatore Shanmugam ◽  
Santhosh Rajendran ◽  
Amudhavalli Padmanabhan ◽  
Kalaiarasan Chellan

Background: Increase in the internet data has increased the priority in the data extraction accuracy. Accuracy here lies with what data the user has requested for and what has been retrieved. The same large data sets that need to be analyzed make the required information retrieval a challenging task. Objective: To propose a new algorithm in an improved way than the traditional methods to classify the category or group to which each training sentence belongs. Method: Identifying the category to which the input sentence belongs is achieved by analyzing the Noun and Verb of each training sentence. NLP is applied to each training sentence and the group or category classification is achieved using the proposed GENI algorithm so that the classifier is trained efficiently to extract the user requested information. Results: The input sentences are transformed into a data table by applying GENI algorithm for group categorization. Plotting the graph in R tool, the accuracy of the group extracted by the Classifier involving GENI approach is higher than that of Naive Bayes & Decision Trees. Conclusion: It remains a challenging task to extract the user-requested data, when the user query is complex. Existing techniques are based more on the fixed attributes, and when we move with respect to the fixed attributes, it becomes too complex or impossible for us to determine the common group from the base sentence. Existing techniques are more suitable to a smaller dataset, whereas the proposed GENI algorithm does not hold any restrictions for the Group categorization of larger data sets.


Sign in / Sign up

Export Citation Format

Share Document