Adaptive Online Sequential ELM for Concept Drift Tackling

A machine learning method needs to adapt to over time changes in the environment. Such changes are known as concept drift. In this paper, we propose concept drift tackling method as an enhancement of Online Sequential Extreme Learning Machine (OS-ELM) and Constructive Enhancement OS-ELM (CEOS-ELM) by adding adaptive capability for classification and regression problem. The scheme is named as adaptive OS-ELM (AOS-ELM). It is a single classifier scheme that works well to handle real drift, virtual drift, and hybrid drift. The AOS-ELM also works well for sudden drift and recurrent context change type. The scheme is a simple unified method implemented in simple lines of code. We evaluated AOS-ELM on regression and classification problem by using concept drift public data set (SEA and STAGGER) and other public data sets such as MNIST, USPS, and IDS. Experiments show that our method gives higher kappa value compared to the multiclassifier ELM ensemble. Even though AOS-ELM in practice does not need hidden nodes increase, we address some issues related to the increasing of the hidden nodes such as error condition and rank values. We propose taking the rank of the pseudoinverse matrix as an indicator parameter to detect “underfitting” condition.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Extreme Learning Machine with sigmoid activation function on large data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1433.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 3523-3526

Keyword(s):

Efficient Algorithm ◽

Large Data ◽

Activation Function ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Learning Machine ◽

Sigmoid Activation Function ◽

State Of Art ◽

Better Than

This paper describes an efficient algorithm for classification in large data set. While many algorithms exist for classification, they are not suitable for larger contents and different data sets. For working with large data sets various ELM algorithms are available in literature. However the existing algorithms using fixed activation function and it may lead deficiency in working with large data. In this paper, we proposed novel ELM comply with sigmoid activation function. The experimental evaluations demonstrate the our ELM-S algorithm is performing better than ELM,SVM and other state of art algorithms on large data sets.

Download Full-text

CNVScope: Visually Exploring Copy Number Aberrations in Cancer Genomes

Cancer Informatics ◽

10.1177/1176935119890290 ◽

2019 ◽

Vol 18 ◽

pp. 117693511989029

Author(s):

James LT Dalgleish ◽

Yonghong Wang ◽

Jack Zhu ◽

Paul S Meltzer

Keyword(s):

Copy Number ◽

High Performance ◽

Data Sets ◽

Data Set ◽

The Public ◽

Public Data ◽

Analysis Package ◽

Cis And Trans ◽

High Performance Computing Cluster ◽

Shiny Application

Motivation: DNA copy number (CN) data are a fast-growing source of information used in basic and translational cancer research. Most CN segmentation data are presented without regard to the relationship between chromosomal regions. We offer both a toolkit to help scientists without programming experience visually explore the CN interactome and a package that constructs CN interactomes from publicly available data sets. Results: The CNVScope visualization, based on a publicly available neuroblastoma CN data set, clearly displays a distinct CN interaction in the region of the MYCN, a canonical frequent amplicon target in this cancer. Exploration of the data rapidly identified cis and trans events, including a strong anticorrelation between 11q loss and17q gain with the region of 11q loss bounded by the cell cycle regulator CCND1. Availability: The shiny application is readily available for use at http://cnvscope.nci.nih.gov/ , and the package can be downloaded from CRAN ( https://cran.r-project.org/package=CNVScope ), where help pages and vignettes are located. A newer version is available on the GitHub site ( https://github.com/jamesdalg/CNVScope/ ), which features an animated tutorial. The CNVScope package can be locally installed using instructions on the GitHub site for Windows and Macintosh systems. This CN analysis package also runs on a linux high-performance computing cluster, with options for multinode and multiprocessor analysis of CN variant data. The shiny application can be started using a single command (which will automatically install the public data package).

Download Full-text

A PROBABILISTIC SELF-ORGANIZING MAP FOR BINARY DATA TOPOGRAPHIC CLUSTERING

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026808002351 ◽

2008 ◽

Vol 07 (04) ◽

pp. 363-383 ◽

Cited By ~ 10

Author(s):

MUSTAPHA LEBBAH ◽

YOUNÈS BENNANI ◽

NICOLETA ROGOVSCHI

Keyword(s):

Binary Data ◽

Learning Algorithm ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Binary Coding ◽

Public Data ◽

Multivariate Binary Data ◽

Self Organizing

This paper introduces a probabilistic self-organizing map for topographic clustering, analysis and visualization of multivariate binary data or categorical data using binary coding. We propose a probabilistic formalism dedicated to binary data in which cells are represented by a Bernoulli distribution. Each cell is characterized by a prototype with the same binary coding as used in the data space and the probability of being different from this prototype. The learning algorithm, Bernoulli on self-organizing map, that we propose is an application of the EM standard algorithm. We illustrate the power of this method with six data sets taken from a public data set repository. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

hcga: Highly Comparative Graph Analysis for network phenotyping

10.1101/2020.09.25.312926 ◽

2020 ◽

Author(s):

Robert L. Peach ◽

Alexis Arnaudon ◽

Julia A. Schmidt ◽

Henry A. Palasciano ◽

Nathan R. Bernier ◽

...

Keyword(s):

Statistical Learning ◽

Organic Semiconductors ◽

Scientific Discovery ◽

Theoretical Research ◽

Data Sets ◽

Regression Problem ◽

Data Set ◽

Graph Data ◽

Open Platform ◽

Graph Properties

AbstractNetworks are widely used as mathematical models of complex systems across many scientific disciplines, not only in biology and medicine but also in the social sciences, physics, computing and engineering. Decades of work have produced a vast corpus of research characterising the topological, combinatorial, statistical and spectral properties of graphs. Each graph property can be thought of as a feature that captures important (and some times overlapping) characteristics of a network. In the analysis of real-world graphs, it is crucial to integrate systematically a large number of diverse graph features in order to characterise and classify networks, as well as to aid network-based scientific discovery. In this paper, we introduce HCGA, a framework for highly comparative analysis of graph data sets that computes several thousands of graph features from any given network. HCGA also offers a suite of statistical learning and data analysis tools for automated identification and selection of important and interpretable features underpinning the characterisation of graph data sets. We show that HCGA outperforms other methodologies on supervised classification tasks on benchmark data sets whilst retaining the interpretability of network features. We also illustrate how HCGA can be used for network-based discovery through two examples where data is naturally represented as graphs: the clustering of a data set of images of neuronal morphologies, and a regression problem to predict charge transfer in organic semiconductors based on their structure. HCGA is an open platform that can be expanded to include further graph properties and statistical learning tools to allow researchers to leverage the wide breadth of graph-theoretical research to quantitatively analyse and draw insights from network data.

Download Full-text

Linked Method of Open Government Data by Datasets Oriented

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i6.2095 ◽

2021 ◽

Vol 12 (6) ◽

pp. 780-785

Author(s):

Yun-Young Hwang Et.al

Keyword(s):

Spatial Information ◽

Data Sets ◽

Data Set ◽

Related Field ◽

Standard Field ◽

Open Government Data ◽

Public Data ◽

Cooperation Rate ◽

Government Data ◽

Public Datasets

In order to make public data more useful, it is necessary to provide relevant data sets that meet the needs of users. We introduce the method of linkage between datasets. We provide a method for deriving linkages between fields of structured datasets provided by public data portals. We defined a dataset and connectivity between datasets. The connectivity between them is based on the metadata of the dataset and the linkage between the actual data field names and values. We constructed the standard field names. Based on this standard, we established the relationship between the datasets. This paper covers 31,692 structured datasets (as of May 31, 2020) among the public data portal datasets. We extracted 1,185,846 field names from over 30,000 datasets. We extracted 1,185,846 field names from over 30,000 datasets. As a result of analyzing the field names, the field names related to spatial information were the most common at 35%. This paper verified the method of deriving the relation between data sets, focusing on the field names classified as spatial information. For this reason, we have defined spatial standard field names. To derive similar field names, we extracted related field names into spaces such as locations, coordinates, addresses, and zip codes used in public datasets. The standard field name of spatial information was designed and derived 43% cooperation rate of 31,692 datasets. In the future, we plan to apply similar field names additionally to improve the data set cooperation rate of the spatial information standard.

Download Full-text

Extreme data compression while searching for new physics

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa2589 ◽

2020 ◽

Vol 498 (3) ◽

pp. 3440-3451

Author(s):

Alan F Heavens ◽

Elena Sellentin ◽

Andrew H Jaffe

Keyword(s):

Data Compression ◽

Principal Component ◽

New Physics ◽

Standard Theory ◽

Data Sets ◽

Data Set ◽

Formidable Challenge ◽

Public Data ◽

Bayesian Evidence ◽

Data Points

ABSTRACT Bringing a high-dimensional data set into science-ready shape is a formidable challenge that often necessitates data compression. Compression has accordingly become a key consideration for contemporary cosmology, affecting public data releases, and reanalyses searching for new physics. However, data compression optimized for a particular model can suppress signs of new physics, or even remove them altogether. We therefore provide a solution for exploring new physics during data compression. In particular, we store additional agnostic compressed data points, selected to enable precise constraints of non-standard physics at a later date. Our procedure is based on the maximal compression of the MOPED algorithm, which optimally filters the data with respect to a baseline model. We select additional filters, based on a generalized principal component analysis, which are carefully constructed to scout for new physics at high precision and speed. We refer to the augmented set of filters as MOPED-PC. They enable an analytic computation of Bayesian Evidence that may indicate the presence of new physics, and fast analytic estimates of best-fitting parameters when adopting a specific non-standard theory, without further expensive MCMC analysis. As there may be large numbers of non-standard theories, the speed of the method becomes essential. Should no new physics be found, then our approach preserves the precision of the standard parameters. As a result, we achieve very rapid and maximally precise constraints of standard and non-standard physics, with a technique that scales well to large dimensional data sets.

Download Full-text

Land Map Image Dataset: Ground-Truth And Classification Using Visual And Textural Features

Image Processing & Communications ◽

10.1515/ipc-2015-0024 ◽

2014 ◽

Vol 19 (4) ◽

pp. 37-55 ◽

Cited By ~ 1

Author(s):

Sayan Mandal ◽

Samit Biswas ◽

Amit Kumar Das ◽

Bhabatosh Chanda

Keyword(s):

Ground Truth ◽

Document Image ◽

Geographical Information ◽

Data Sets ◽

Data Set ◽

Handwritten Documents ◽

Form Processing ◽

Public Data ◽

Active Research ◽

The Right

Abstract Research on document image analysis is actively pursued in the last few decades and services like OCR, vectorization of drawings/graphics and various types of form processing are very common. Handwritten documents, old historical documents and documents captured through camera are now being the subjects of active research. However, another very important type of paper document, namely the map document image processing research suffers due to the inherent complexities of the map document and also for nonavailability of benchmark public data-sets. This paper presents a new data-set, namely, the Land Map Image Database (LMIDb) that consists of a variety of land maps images (446 images at present and growing; scanned at 200/300 dpi in TIF format) and the corresponding ground-truth. Using semiautomatic tools non-text part of the images are deleted and the text-only ground-truth is also kept in the database. This paper also presents a classification strategy for map images using which the maps in the database are automatically classified into Political (Po), Physical (Ph), Resource (R) and Topographic (T) maps. The automatic classification of maps help indexing of the images in LMIDb for archival and easy retrieval of the right maps to get the appropriate geographical information. Classification accuracy is also tested on the proposed data-set and the result is encouraging.

Download Full-text