An Empirical Study on Initializing Centroid in K-Means Clustering for Feature Selection

One of the main problems in K-means clustering is setting of initial centroids which can cause misclustering of patterns which affects clustering accuracy. Recently, a density and distance-based technique for determining initial centroids has claimed a faster convergence of clusters. Motivated from this key idea, the authors study the impact of initial centroids on clustering accuracy for unsupervised feature selection. Three metrics are used to rank the features of a data set. The centroids of the clusters in the data sets, to be applied in K-means clustering, are initialized randomly as well as by density and distance-based approaches. Extensive experiments are performed on 15 datasets. The main significance of the paper is that the K-means clustering yields higher accuracies in majority of these datasets using proposed density and distance-based approach. As an impact of the paper, with fewer features, a good clustering accuracy can be achieved which can be useful in data mining of data sets with thousands of features.

Download Full-text

Unsupervised Feature Selection for Histogram-Valued Symbolic Data Using Hierarchical Conceptual Clustering

Stats ◽

10.3390/stats4020024 ◽

2021 ◽

Vol 4 (2) ◽

pp. 359-384

Author(s):

Manabu Ichino ◽

Kadri Umbleja ◽

Hiroyuki Yaguchi

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Fixed Number ◽

Data Sets ◽

Data Set ◽

Unsupervised Feature Selection ◽

Symbolic Data ◽

Thin Structure ◽

The Given

This paper presents an unsupervised feature selection method for multi-dimensional histogram-valued data. We define a multi-role measure, called the compactness, based on the concept size of given objects and/or clusters described using a fixed number of equal probability bin-rectangles. In each step of clustering, we agglomerate objects and/or clusters so as to minimize the compactness for the generated cluster. This means that the compactness plays the role of a similarity measure between objects and/or clusters to be merged. Minimizing the compactness is equivalent to maximizing the dis-similarity of the generated cluster, i.e., concept, against the whole concept in each step. In this sense, the compactness plays the role of cluster quality. We also show that the average compactness of each feature with respect to objects and/or clusters in several clustering steps is useful as a feature effectiveness criterion. Features having small average compactness are mutually covariate and are able to detect a geometrically thin structure embedded in the given multi-dimensional histogram-valued data. We obtain thorough understandings of the given data via visualization using dendrograms and scatter diagrams with respect to the selected informative features. We illustrate the effectiveness of the proposed method by using an artificial data set and real histogram-valued data sets.

Download Full-text

Unsupervised Feature Selection for Histogram-Valued Symbolic Data by Hierarchical Conceptual Clustering

10.20944/preprints202103.0753.v1 ◽

2021 ◽

Author(s):

Manabu Ichino ◽

Kadri Umbleja ◽

Hiroyuki Yaguchi

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Fixed Number ◽

Data Sets ◽

Data Set ◽

Unsupervised Feature Selection ◽

Symbolic Data ◽

Thin Structure ◽

The Given

This paper presents an unsupervised feature selection method for multi-dimensional histogram-valued data. We define a multi-role measure, called the compactness, based on the concept size of given objects and/or clusters described by a fixed number of equal probability bin-rectangles. In each step of clustering, we agglomerate objects and/or clusters so as to minimize the compactness for the generated cluster. This means that the compactness plays the role of a similarity measure between objects and/or clusters to be merged. To minimize the compactness is equivalent to maximize the dis-similarity of the generated cluster, i.e., concept, against the whole concept in each step. In this sense, the compactness plays the role of cluster quality. We also show that the average compactness of each feature with respect to objects and/or clusters in several clustering steps is useful as feature effectiveness criterion. Features having small average compactness are mutually covariate, and are able to detect geometrically thin structure embedded in the given multi-dimensional histogram-valued data. We obtain thorough understandings of the given data by the visualization using dendrograms and scatter diagrams with respect to the selected informative features. We illustrate the effectiveness of the proposed method by using an artificial data set and real histogram-valued data sets.

Download Full-text

How Size Matters

Heuristic and Optimization for Knowledge Discovery ◽

10.4018/978-1-930708-26-6.ch008 ◽

2011 ◽

pp. 122-141

Author(s):

Paul D. Scott

Keyword(s):

Data Mining ◽

Sample Size ◽

Target Function ◽

Simple Random Sample ◽

Data Sets ◽

Data Set ◽

Processing Resources ◽

Mining Procedure ◽

Effective Use ◽

The Impact

This chapter addresses the question of how to decide how large a sample is necessary in order to apply a particular data mining procedure to a given data set. A brief review of the main results of basic sampling theory is followed by a detailed consideration and comparison of the impact of simple random sample size on two well-known data mining procedures: naïve Bayes classifiers and decision tree induction. It is shown that both the learning procedure and the data set have a major impact on the size of sample required but that the size of the data set itself has little effect. The next section introduces a more sophisticated form of sampling, disproportionate stratification, and shows how it may be used to make much more effective use of limited processing resources. This section also includes a discussion of dynamic and static sampling. An examination of the impact of target function complexity concludes that neither target function complexity nor size of the attribute tuple space need be considered explicitly in determining sample size. The chapter concludes with a summary of the major results, a consideration of their relevance for small data sets and some brief remarks on the role of sampling for other data mining procedures.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

The Midlatitude Continental Convective Clouds Experiment (MC3E) sounding network: operations, processing and analysis

Atmospheric Measurement Techniques ◽

10.5194/amt-8-421-2015 ◽

2015 ◽

Vol 8 (1) ◽

pp. 421-434 ◽

Cited By ~ 18

Author(s):

M. P. Jensen ◽

T. Toto ◽

D. Troyan ◽

P. E. Ciesielski ◽

D. Holdridge ◽

...

Keyword(s):

Large Scale ◽

Scale Model ◽

Data Sets ◽

Central Plains ◽

Data Set ◽

Convective Systems ◽

Convective Clouds ◽

Quality Checks ◽

Network Operations ◽

The Impact

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.

Download Full-text

Improving SAR Altimeter processing over the coastal zone and inland waters - the ESA HYDROCOASTAL project

10.5194/egusphere-egu21-9 ◽

2021 ◽

Author(s):

David Cotton ◽

Keyword(s):

Coastal Zone ◽

Test Data ◽

River Discharge ◽

Altimeter Data ◽

Inland Waters ◽

Data Sets ◽

Data Set ◽

Discharge Data ◽

Processing Algorithms ◽

The Impact

IntroductionHYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.A series of case studies will assess these products in terms of their scientific impacts.All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&#160;ObjectivesThe scientific objectives of HYDROCOASTAL are to enhance our understanding&#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changesThe technical objectives are to develop and evaluate&#160; new SAR&#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.Project&#160; OutlineThere are four tasks to the project<ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul>&#160;PresentationThe presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&#160;

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text

Six years of total ozone column measurements from SCIAMACHY nadir observations

Atmospheric Measurement Techniques ◽

10.5194/amt-2-87-2009 ◽

2009 ◽

Vol 2 (1) ◽

pp. 87-98 ◽

Cited By ~ 39

Author(s):

C. Lerot ◽

M. Van Roozendael ◽

J. van Geffen ◽

J. van Gent ◽

C. Fayt ◽

...

Keyword(s):

Cross Sections ◽

Total Ozone ◽

Large Scale ◽

European Space Agency ◽

Data Sets ◽

Data Set ◽

Ozone Data ◽

Space Agency ◽

German Aerospace ◽

The Impact

Abstract. Total O3 columns have been retrieved from six years of SCIAMACHY nadir UV radiance measurements using SDOAS, an adaptation of the GDOAS algorithm previously developed at BIRA-IASB for the GOME instrument. GDOAS and SDOAS have been implemented by the German Aerospace Center (DLR) in the version 4 of the GOME Data Processor (GDP) and in version 3 of the SCIAMACHY Ground Processor (SGP), respectively. The processors are being run at the DLR processing centre on behalf of the European Space Agency (ESA). We first focus on the description of the SDOAS algorithm with particular attention to the impact of uncertainties on the reference O3 absorption cross-sections. Second, the resulting SCIAMACHY total ozone data set is globally evaluated through large-scale comparisons with results from GOME and OMI as well as with ground-based correlative measurements. The various total ozone data sets are found to agree within 2% on average. However, a negative trend of 0.2–0.4%/year has been identified in the SCIAMACHY O3 columns; this probably originates from instrumental degradation effects that have not yet been fully characterized.

Download Full-text

Evaluating Infill Well Performance and Fracture Driven Interactions Using Intervention Based Distributed Fiber Optics

10.2118/204184-ms ◽

2021 ◽

Author(s):

Ahmed Attia ◽

Matthew Lawrence

Keyword(s):

Fiber Optics ◽

Fracture Network ◽

Data Sets ◽

Well Performance ◽

Design Strategies ◽

Data Set ◽

Geological Features ◽

Production Output ◽

Long Carbon Fiber ◽

The Impact

Abstract Distributed Fiber Optics (DFO) technology has been the new face for unconventional well diagnostics. This technology focuses on measuring Distributed Acoustic Sensing (DAS) and Distrusted Temperature Sensing (DTS) to give an in-depth understanding of well productivity pre and post stimulation. Many different completion design strategies, both on surface and downhole, are used to obtain the best fracture network outcome; however, with complex geological features, different fracture designs, and fracture driven interactions (FDIs) effecting nearby wells, it is difficult to grasp a full understanding on completion design performance for each well. Validating completion designs and improving on the learnings found in each data set should be the foundation in developing each field. Capturing a data set with strong evidence of what works and what doesn't, can help the operator make better engineering decisions to make more efficient wells as well as help gauge the spacing between each well. The focus of this paper will be on a few case studies in the Bakken which vividly show how infill wells greatly interfered with production output. A DFO deployed with a 0.6" OD, 23,000-foot-long carbon fiber rod to acquire DAS and DTS for post frac flow, completion, and interference evaluation. This paper will dive into the DFO measurements taken post frac to further explain what effects are seen on completion designs caused by interferences with infill wells; the learnings taken from the DFO post frac were applied to further escalate the understanding and awareness of how infill wells will preform on future pad sites. A showcase of three separate data sets from the Bakken will identify how effective DFO technology can be in evaluating and making informed decisions on future frac completions. In this paper we will also show and discuss how DFO can measure real time FDI events and what measures can be taken to lessen the impact on negative interference caused by infill wells.

Download Full-text

The shifting of climate types: manifestation to phenology and ecosystems structure

10.5194/egusphere-egu21-5689 ◽

2021 ◽

Author(s):

Gunta Kalvāne ◽

Andis Kalvāns ◽

Agrita Briede ◽

Ilmārs Krampis ◽

Dārta Kaupe ◽

...

Keyword(s):

Climate Change ◽

Breaking Point ◽

Data Sets ◽

Precipitation Regime ◽

Climate Type ◽

Data Set ◽

Temperature And Precipitation ◽

Spatial Changes ◽

Temporal And Spatial ◽

The Impact

According to the K&#246;ppen climate classification, almost the entire area of Latvia belongs to the same climate type, Dfb, which is characterized by humid continental climates with warm (sometimes hot) summers and cold winters.&#160; In the last decades whether conditions on the western coast of Latvia more characterized by temperate maritime climates. In this area there has been a transition (and still ongoing) to the climate type Cfb.Temporal and spatial changes of temperature and precipitation regime have been examined in whole territory to identify the breaking point of climate type shifts. We used two type of climatological data sets: gridded daily temperature from the E-OBS data set version 21.0e (Cornes et al., 2018) and direct observations from meteorological stations (data source: Latvian Environment, Geology and Meteorology Centre). The temperature and precipitation regime have changed significantly in the last century - seasonal and regional differences can be observed in the territory of Latvia.We have digitized and analysed more than 47 thousand phenological records, fixed by volunteers in period 1970-2018. Study has shown that significant seasonal changes have taken place across the Latvian landscape due to climate change (Kalv&#257;ne and Kalv&#257;ns, 2021). The largest changes have been recorded for the unfolding (BBCH11) and flowering (BBCH61) phase of plants&#160;&#8211; almost 90% of the data included in the database demonstrate a negative trend. The winter of 1988/1989 may be considered as breaking point, it has been common that many phases have begun sooner (particularly spring phases), while abiotic autumn phases have been characterized by late years.Study gives an overview aboutclimate change (also climate type shift) impacts on ecosystems in Latvia, particularly to forest and semi-natural grasslands and temporal and spatial changes of vegetation structure and distribution areas.This study was carried out within the framework of the Impact of Climate Change on Phytophenological Phases and Related Risks in the Baltic Region (No. 1.1.1.2/VIAA/2/18/265) ERDF project and the Climate change and sustainable use of natural resources&#160;institutional research grant&#160;of the University of Latvia (No. AAP2016/B041//ZD2016/AZ03).Cornes, R. C., van der Schrier, G., van den Besselaar, E. J. M. and Jones, P. D.: An Ensemble Version of the E-OBS Temperature and Precipitation Data Sets, J. Geophys. Res. Atmos., 123(17), 9391&#8211;9409, doi:10.1029/2017JD028200, 2018.Kalv&#257;ne, G. and Kalv&#257;ns, A.(2021): Phenological trends of multi-taxonomic groups in Latvia, 1970-2018, Int. J. Biometeorol., doi:https://doi.org/10.1007/s00484-020-02068-8, 2021.

Download Full-text