Minimum Data Base Determination using Machine Learning

The exploitation of large data bases frequently implies the investment of large and, usually, expensive resources both in terms of the storage and processing time required. It is possible to obtain equivalent reduced data sets where the statistical information of the original data may be preserved while dispensing with redundant constituents. Therefore, the physical embodiment of the relevant features of the data base is more economical. The author proposes a method where we may obtain an optimal transformed representation of the original data which is, in general, considerably more compact than the original without impairing its informational content. To certify the equivalence of the original data set (FD) and the reduced one (RD), the author applies an algorithm which relies in a Genetic Algorithm (GA) and a multivariate regression algorithm (AA). Through the combined application of GA and AA the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.

Download Full-text

Deep residual detection of radio frequency interference for FAST

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz3521 ◽

2020 ◽

Vol 492 (1) ◽

pp. 1421-1431 ◽

Cited By ~ 4

Author(s):

Zhicheng Yang ◽

Ce Yu ◽

Jian Xiao ◽

Bo Zhang

Keyword(s):

Radio Frequency ◽

Large Data ◽

High Sensitivity ◽

Original Data ◽

Training Data ◽

Radio Frequency Interference ◽

Data Sets ◽

Data Set ◽

Time Required ◽

Key Steps

ABSTRACT Radio frequency interference (RFI) detection and excision are key steps in the data-processing pipeline of the Five-hundred-meter Aperture Spherical radio Telescope (FAST). Because of its high sensitivity and large data rate, FAST requires more accurate and efficient RFI flagging methods than its counterparts. In the last decades, approaches based upon artificial intelligence (AI), such as codes using convolutional neural networks (CNNs), have been proposed to identify RFI more reliably and efficiently. However, RFI flagging of FAST data with such methods has often proved to be erroneous, with further manual inspections required. In addition, network construction as well as preparation of training data sets for effective RFI flagging has imposed significant additional workloads. Therefore, rapid deployment and adjustment of AI approaches for different observations is impractical to implement with existing algorithms. To overcome such problems, we propose a model called RFI-Net. With the input of raw data without any processing, RFI-Net can detect RFI automatically, producing corresponding masks without any alteration of the original data. Experiments with RFI-Net using simulated astronomical data show that our model has outperformed existing methods in terms of both precision and recall. Besides, compared with other models, our method can obtain the same relative accuracy with fewer training data, thus reducing the effort and time required to prepare the training data set. Further, the training process of RFI-Net can be accelerated, with overfittings being minimized, compared with other CNN codes. The performance of RFI-Net has also been evaluated with observing data obtained by FAST and the Bleien Observatory. Our results demonstrate the ability of RFI-Net to accurately identify RFI with fine-grained, high-precision masks that required no further modification.

Download Full-text

Minimum Database Determination and Preprocessing for Machine Learning

Advances in Web Technologies and Engineering - Innovative Solutions and Applications of Web Services Technology ◽

10.4018/978-1-5225-7268-8.ch005 ◽

2019 ◽

pp. 94-131

Author(s):

Angel Fernando Kuri-Morales

Keyword(s):

Processing Time ◽

Categorical Variables ◽

Informational Content ◽

Data Set ◽

Combined Application ◽

Large Databases ◽

High Degree ◽

Reduced Data ◽

Processing Steps ◽

Statistical Certainty

The exploitation of large databases implies the investment of expensive resources both in terms of the storage and processing time. The correct assessment of the data implies that pre-processing steps be taken before its analysis. The transformation of categorical data by adequately encoding every instance of categorical variables is needed. Encoding must be implemented that preserves the actual patterns while avoiding the introduction of non-existing ones. The authors discuss CESAMO, an algorithm which allows us to statistically identify the pattern preserving codes. The resulting database is more economical and may encompass mixed databases. Thus, they obtain an optimal transformed representation that is considerably more compact without impairing its informational content. For the equivalence of the original (FD) and reduced data set (RD), they apply an algorithm that relies on a multivariate regression algorithm (AA). Through the combined application of CESAMO and AA, the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.

Download Full-text

INTEGRATED EFFECT OF DATA CLEANING AND SAMPLING ON DECISION TREE LEARNING OF LARGE DATA SETS

International Journal of Computing ◽

10.47839/ijc.11.3.565 ◽

2014 ◽

pp. 215-223

Author(s):

Dipak V. Patil ◽

Rajankumar S. Bichkar

Keyword(s):

Decision Trees ◽

Data Cleaning ◽

Large Data ◽

Classification Performance ◽

Large Data Sets ◽

Optimal Decision ◽

Data Sets ◽

Data Set ◽

Outlier Data ◽

Reduced Data

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

Download Full-text

The effect of Z-Score standardization (normalization) on binary input due the speed of learning in back-propagation neural network

Iraqi Journal of Information & Communications Technology ◽

10.31987/ijict.1.3.41 ◽

2019 ◽

Vol 1 (3) ◽

pp. 42-48

Author(s):

Mohammed Z. Al-Faiz ◽

Ali A. Ibrahim ◽

Sarmad M. Hadi

Keyword(s):

Neural Network ◽

Input Data ◽

Back Propagation ◽

Large Data ◽

Back Propagation Neural Network ◽

Data Sets ◽

Z Score ◽

Data Set ◽

The Neural Network ◽

Time Required

The speed of learning in neural network environment is considered as the most effective parameter spatially in large data sets. This paper tries to minimize the time required for the neural network to fully understand and learn about the data by standardize input data. The paper showed that the Z-Score standardization of input data significantly decreased the number of epoochs required for the network to learn. This paper also proved that the binary dataset is a serious limitation for the convergence of neural network, so the standardization is a must in such case where the 0’s inputs simply neglect the connections in the neural network. The data set used in this paper are features extracted from gel electrophoresis images and that open the door for using artificial intelligence in such areas.

Download Full-text

DESIGNING RELEVANT FEATURES FOR CONTINUOUS DATA SETS USING ICA

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026808002387 ◽

2008 ◽

Vol 07 (04) ◽

pp. 447-468 ◽

Cited By ~ 4

Author(s):

MITHUN PRASAD ◽

ARCOT SOWMYA ◽

INGE KOCH

Keyword(s):

Feature Selection ◽

Independent Component Analysis ◽

Large Data ◽

Original Data ◽

Component Analysis ◽

Medical Image Segmentation ◽

Independent Component ◽

Data Sets ◽

Construction Technique ◽

Data Set

Isolating relevant information and reducing the dimensionality of the original data set are key areas of interest in pattern recognition and machine learning. In this paper, a novel approach to reducing dimensionality of the feature space by employing independent component analysis (ICA) is introduced. While ICA is primarily a feature extraction technique, it is used here as a feature selection/construction technique in a generic way. The new technique, called feature selection based on independent component analysis (FS_ICA), efficiently builds a reduced set of features without loss in accuracy and also has a fast incremental version. When used as a first step in supervised learning, FS_ICA outperforms comparable methods in efficiency without loss of classification accuracy. For large data sets as in medical image segmentation of high-resolution computer tomography images, FS_ICA reduces dimensionality of the data set substantially and results in efficient and accurate classification.

Download Full-text

Parallel Distance-Based Instance Selection Algorithm for Feed-Forward Neural Network

Journal of Intelligent Systems ◽

10.1515/jisys-2015-0039 ◽

2017 ◽

Vol 26 (2) ◽

pp. 335-358 ◽

Cited By ~ 1

Author(s):

Piyabute Fuangkhon

Keyword(s):

Neural Network ◽

Classification Accuracy ◽

Synthetic Data ◽

Original Data ◽

Instance Selection ◽

Data Sets ◽

Decision Boundary ◽

Feed Forward Neural Network ◽

Data Set ◽

Reduced Data

AbstractInstance selection endeavors to decide which instances from the data set should be maintained for further use during the learning process. It can result in increased generalization of the learning model, shorter time of the learning process, or scaling up to large data sources. This paper presents a parallel distance-based instance selection approach for a feed-forward neural network (FFNN), which can utilize all available processing power to reduce the data set while obtaining similar levels of classification accuracy as when the original data set is used. The algorithm identifies the instances at the decision boundary between consecutive classes of data, which are essential for placing hyperplane decision surfaces, and retains these instances in the reduced data set (subset). Each identified instance, called a prototype, is one of the representatives of the decision boundary of its class that constitutes the shape or distribution model of the data set. No feature or dimension is sacrificed in the reduction process. Regarding reduction capability, the algorithm obtains approximately 85% reduction power on non-overlapping two-class synthetic data sets, 70% reduction power on highly overlapping two-class synthetic data sets, and 77% reduction power on multiclass real-world data sets. Regarding generalization, the reduced data sets obtain similar levels of classification accuracy as when the original data set is used on both FFNN and support vector machine. Regarding execution time requirement, the speedup of the parallel algorithm over the serial algorithm is proportional to the number of threads the processor can run concurrently.

Download Full-text

The Issue of Missing Values in Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch171 ◽

2011 ◽

pp. 1102-1109

Author(s):

Malcolm J. Beynon

Keyword(s):

Data Mining ◽

Missing Data ◽

Incomplete Data ◽

Missing Values ◽

Large Data ◽

Original Data ◽

Customer Relationship ◽

Data Sets ◽

Data Mining Technique ◽

Data Set

The essence of data mining is to investigate for pertinent information that may exist in data (often large data sets). The immeasurably large amount of data present in the world, due to the increasing capacity of storage media, manifests the issue of the presence of missing values (Olinsky et al., 2003; Brown and Kros, 2003). The presented encyclopaedia article considers the general issue of the presence of missing values when data mining, and demonstrates the effect of when managing their presence is or is not undertaken, through the utilisation of a data mining technique. The issue of missing values was first exposited over forty years ago in Afifi and Elashoff (1966). Since then it is continually the focus of study and explanation (El-Masri and Fox-Wasylyshyn, 2005), covering issues such as the nature of their presence and management (Allison, 2000). With this in mind, the naïve consistent aspect of the missing value debate is the limited general strategies available for their management, the main two being either the simple deletion of cases with missing data or a form of imputation of the missing values in someway (see Elliott and Hawthorne, 2005). Examples of the specific investigation of missing data (and data quality), include in; data warehousing (Ma et al., 2000), and customer relationship management (Berry and Linoff, 2000). An alternative strategy considered is the retention of the missing values, and their subsequent ‘ignorance’ contribution in any data mining undertaken on the associated original incomplete data set. A consequence of this retention is that full interpretability can be placed on the results found from the original incomplete data set. This strategy can be followed when using the nascent CaRBS technique for object classification (Beynon, 2005a, 2005b). CaRBS analyses are presented here to illustrate that data mining can manage the presence of missing values in a much more effective manner than the more inhibitory traditional strategies. An example data set is considered, with a noticeable level of missing values present in the original data set. A critical increase in the number of missing values present in the data set further illustrates the benefit from ‘intelligent’ data mining (in this case using CaRBS).

Download Full-text

RECENT RESULTS IN HIERARCHICAL CLUSTERING: I–THE REDUCIBLE NEIGHBORHOODS CLUSTERING ALGORITHM

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001493000285 ◽

1993 ◽

Vol 07 (03) ◽

pp. 541-571 ◽

Cited By ~ 5

Author(s):

MICHEL BRUYNOOGHE

Keyword(s):

Hierarchical Clustering ◽

Speech Processing ◽

Clustering Algorithm ◽

Large Data ◽

Original Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Hierarchical Clustering Algorithm ◽

Better Than

The clustering of large data sets is of great interest in fields such as pattern recognition, numerical taxonomy, image or speech processing. The traditional Ascendant Hierarchical Algorithm (AHC) cannot be run for sets of more than a few thousand elements. The reducible neighborhoods clustering algorithm, which is presented in this paper, has overtaken the limits of the traditional hierarchical clustering algorithm by generating an exact hierarchy on a large data set. The theoretical justification of this algorithm is the so-called Bruynooghe reducibility principle, that lays down the condition under which the exact hierarchy may be constructed locally, by carrying out aggregations in restricted regions of the representation space. As for the Day and Edelsbrunner algorithm, the maximum theoretical time complexity of the reducible neighborhoods clustering algorithm is O(n2 log n), regardless of the chosen clustering strategy. But the reducible neighborhoods clustering algorithm uses the original data table and its practical performances are by far better than Day and Edelsbrunner’s algorithm, thus allowing the hierarchical clustering of large data sets, i.e. composed of more than 10 000 objects.

Download Full-text

Solution of the structure of a calmodulin–peptide complex in a novel configuration from a variably twinned data set

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798316019318 ◽

2017 ◽

Vol 73 (1) ◽

pp. 22-31 ◽

Cited By ~ 5

Author(s):

Jacob Pearson Keller

Keyword(s):

Sequence Similarity ◽

Complex Structure ◽

Data Bank ◽

Original Data ◽

Data Sets ◽

High Sequence Similarity ◽

Data Set ◽

Search Models ◽

Density Maps ◽

High Degree

Structure determination of conformationally variable proteins can prove challenging even when many possible molecular-replacement (MR) search models of high sequence similarity are available. Calmodulin (CaM) is perhaps the best-studied archetype of these flexible proteins: while there are currently ∼450 structures of significant sequence similarity available in the Protein Data Bank (PDB), novel conformations of CaM and complexes thereof continue to be reported. Here, the details of the solution of a novel peptide–CaM complex structure by MR are presented, in which only one MR solution of marginal quality was found despite the use of 120 different search models, an exclusivity enhanced by the presence of a high degree of hemihedral twinning (overall refined twin fraction = 0.43). Ambiguities in the initial MR electron-density maps were overcome by using MR-SAD: phases from the MR partial model were used to identify weak anomalous scatterers (calcium, sulfur and chloride), which were in turn used to improve the phases, automatically rebuild the structure and resolve sequence ambiguities. Retrospective analysis of consecutive wedges of the original data sets showed twin fractions ranging from 0.32 to 0.55, suggesting that the data sets were variably twinned. Despite these idiosyncrasies and obstacles, the data themselves and the final model were of high quality and indeed showed a novel, nearly right-angled conformation of the bound peptide.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text