An Improved K-Means Algorithm and its Application in Customer Classification of Network Enterprises

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.2124 ◽

2014 ◽

Vol 543-547 ◽

pp. 2124-2127

Author(s):

Feng Lan Luo

Keyword(s):

High Efficiency ◽

Large Data ◽

Research Field ◽

Data Sets ◽

Seed Selection ◽

Speed Up ◽

Customer Classification ◽

The Stability ◽

Improved Model

K-means algorithm has powerful ability to cluster large data sets due to its high efficiency in data mining but its calculation instability limits the application of the algorithm, so the research of intelligent optimization of K-means algorithm has become a hot research field for the researchers related. First the calculation instability of the original K-means algorithm is analyzed with more details; Second, the improvement of cluster seed selection methods and the calculation flow of K-means algorithm are redesigned to speed up the calculation and enhance the stability of the improved model; Third, the paper realizes and conducts the analysis in customer classification practice of the improved algorithm which show that the improved K-means algorithm has better performance in classification accuracy and calculation stability and can be used in customer classification for network trade enterprises practically.

Download Full-text

Detection and Classification of Anomalies in Large Data Sets on the Basis of Information Granules

IEEE Transactions on Fuzzy Systems ◽

10.1109/tfuzz.2021.3076265 ◽

2021 ◽

pp. 1-1

Author(s):

Adam Kiersztyn ◽

Pawe Karczmarek ◽

Krystyna Kiersztyn ◽

Witold Pedrycz

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Information Granules

Download Full-text

An Application and Research of Gray-Relation for Color Classification in Dyeing Textile

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.694-697.2881 ◽

2013 ◽

Vol 694-697 ◽

pp. 2881-2885

Author(s):

Hai Yan Wang ◽

Jian Xin Zhang

Keyword(s):

System Analysis ◽

High Efficiency ◽

Information Management System ◽

Space Distribution ◽

Color Classification ◽

Uniform Sampling ◽

Forecast Precision ◽

Internal Instability ◽

The Stability

Dyeing textile’s information management system is the basis of accurate classification of color， machine studying methods have became a popular area of research for application in color classification. Traditional classification methods have high efficiency and are very simple ， but they are dependent on the distribution of sample spaces. If the sample data properties are not independent， forecast precision will been affected badly and internal instability will appear. An application of Gray-Relation for dyeing textile color classification has been designed， which offsets the discount in mathematical statistics method for system analysis. It is applicable regardless of variant in sample size， while quantizing structure is in agreement with qualitative analysis. On the basis of theoretical analysis， Dyeing textile color classification was conducted in the conditions of random sampling、 uniform sampling and stratified sampling. The experimental results proofs that by using Gray-Relation， dyeing textile color classification does not need to be dependent on sample space distribution， and increases the stability of classification.

Download Full-text

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-015-0478-7 ◽

2015 ◽

Vol 8 (3) ◽

pp. 1009-1017 ◽

Cited By ~ 54

Author(s):

Junhai Zhai ◽

Sufang Zhang ◽

Chenxi Wang

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

Distributed training and scalability for the particle clustering method UCluster

EPJ Web of Conferences ◽

10.1051/epjconf/202125102054 ◽

2021 ◽

Vol 251 ◽

pp. 02054

Author(s):

Olga Sunneborn Gudnadottir ◽

Daniel Gedon ◽

Colin Desmarais ◽

Karl Bengtsson Bernander ◽

Raazesh Sainudiin ◽

...

Keyword(s):

Particle Physics ◽

Hadron Collider ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Training Time ◽

Distributed Training ◽

Machine Learning Methods ◽

Multi Class Classification

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a clustering accuracy of 81% when applied to the problem of multi-class classification of simulated jet events. Our implementation adds the distributed training functionality by utilising the Horovod distributed training framework, which necessitated a migration of the code to TensorFlow v2. Together with using parquet files for splitting data up between different compute nodes, the distributed training makes the model scalable to any amount of input data, something that will be essential for use with real LHC data sets. We find that the model is well suited for distributed training, with the training time decreasing in direct relation to the number of GPU’s used. However, further improvements by a more exhaustive and possibly distributed hyper-parameter search is required in order to achieve the reported accuracy of the original UCluster method.

Download Full-text

Flexible MapReduce Workflows for Cloud Data Analytics

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2013100104 ◽

2013 ◽

Vol 5 (4) ◽

pp. 48-64 ◽

Cited By ~ 1

Author(s):

Carlos Goncalves ◽

Luis Assuncao ◽

Jose C. Cunha

Keyword(s):

Text Mining ◽

Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Tuple Space ◽

Cloud Data ◽

Intermediate Data ◽

Speed Up ◽

Mapreduce Model

Data analytics applications handle large data sets subject to multiple processing phases, some of which can execute in parallel on clusters, grids or clouds. Such applications can benefit from using MapReduce model, only requiring the end-user to define the application algorithms for input data processing and the map and reduce functions, but this poses a need to install/configure specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. In order to provide more flexibility in defining and adjusting the application configurations, as well as in the specification of the composition of the application phases and their orchestration, the authors describe an approach for supporting MapReduce stages as sub-workflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). The authors discuss how a text mining application is represented as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. Access to intermediate data produced during the MapReduce computations is supported by a data sharing abstraction. The authors describe two implementations of this abstraction, one based on a shared tuple space and another based on an in-memory distributed key/value store. The authors describe the implementation of the framework, a set of developed tools, and our experimentation with the execution of the text mining algorithm over multiple Amazon EC2 (Elastic Compute Cloud) instances, and report on the speed-up and size-up results obtained up to 20 EC2 instances and for different corpus sizes, up to 97 million words.

Download Full-text

Fetal cardiotocography monitoring using Legendre neural networks

Biomedical Engineering / Biomedizinische Technik ◽

10.1515/bmt-2018-0074 ◽

2019 ◽

Vol 64 (6) ◽

pp. 669-675 ◽

Cited By ~ 1

Author(s):

Abdulaziz Alsayyari

Keyword(s):

Neural Network ◽

Neural Networks ◽

Uterine Contraction ◽

High Efficiency ◽

Fetal Monitoring ◽

Data Sets ◽

Legendre Series ◽

Electronic Fetal Monitoring ◽

A New Technique

Abstract A new technique for electronic fetal monitoring (EFM) using an efficient structure of neural networks based on the Legendre series is presented in this paper. Such a structure is achieved by training a Legendre series-based neural network (LNN) to classify the different fetal states based on recorded cardiotocographic (CTG) data sets given by others. These data sets consist of measurements of fetal heart rate (FHR) and uterine contraction (UC). The applied LNN utilizes a Legendre series expansion for the input vectors and, hence, has the capability to produce explicit equations describing multi-input multi-output systems. Simulations of the proposed technique in EFM demonstrate its high efficiency. Training the LNN requires a few number of iterations (5–10 epochs). The applied technique makes the classification of the fetal state available through equations combining the trained LNN weights and the current measured CTG record. A comparison of performance between the proposed LNN and other popular neural network techniques such as the Volterra neural network (VNN) in EFM is provided. The comparison shows that, the LNN outperforms the VNN in case of less computational requirements and fast convergence with a lower mean square error.

Download Full-text

Toward Psychoinformatics: Computer Science Meets Psychology

Computational and Mathematical Methods in Medicine ◽

10.1155/2016/2983685 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 40

Author(s):

Christian Montag ◽

Éilish Duke ◽

Alexander Markowetz

Keyword(s):

Computer Science ◽

Online Social Network ◽

Large Data ◽

Research Field ◽

Large Data Sets ◽

Future Research ◽

Data Sets ◽

Psychological Traits ◽

Scientific Methods ◽

Insight Into

The present paper provides insight into an emerging research discipline calledPsychoinformatics. In the context ofPsychoinformatics, we emphasize the cooperation between the disciplines of psychology and computer science in handling large data sets derived from heavily used devices, such as smartphones or online social network sites, in order to shed light on a large number of psychological traits, including personality and mood. New challenges await psychologists in light of the resulting “Big Data” sets, because classic psychological methods will only in part be able to analyze this data derived from ubiquitous mobile devices, as well as other everyday technologies. As a consequence, psychologists must enrich their scientific methods through the inclusion of methods from informatics. The paper provides a brief review of one area of this research field, dealing mainly with social networks and smartphones. Moreover, we highlight how data derived fromPsychoinformaticscan be combined in a meaningful way with data from human neuroscience. We close the paper with some observations of areas for future research and problems that require consideration within this new discipline.

Download Full-text

Validity of International Classification of Diseases Codes for Identifying Neuro-Ophthalmic Disease in Large Data Sets: A Systematic Review

Journal of Neuro-Ophthalmology ◽

10.1097/wno.0000000000000971 ◽

2020 ◽

Vol 40 (4) ◽

pp. 514-519

Author(s):

Ali G. Hamedani ◽

Lindsey B. De Lott ◽

Tatiana Deveney ◽

Heather E. Moss

Keyword(s):

Systematic Review ◽

Large Data ◽

International Classification Of Diseases ◽

International Classification ◽

Large Data Sets ◽

Data Sets ◽

Classification Of Diseases ◽

Ophthalmic Disease

Download Full-text

Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch006 ◽

2012 ◽

pp. 94-110 ◽

Cited By ~ 1

Author(s):

T. Ravindra Babu ◽

M. Narasimha Murty ◽

S. V. Subrahmanya

Keyword(s):

Data Mining ◽

Large Data ◽

Large Datasets ◽

Superior Performance ◽

Data Sets ◽

Sequence Generation ◽

Data Compaction ◽

Clustering And Classification ◽

Minimal Data

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.

Download Full-text