Hierarchical Sparse Subspace Clustering (HESSC): An Automatic Approach for Hyperspectral Image Analysis

Kasra Rafiezadeh Shahi; Mahdi Khodadadzadeh; Laura Tusa; Pedram Ghamisi; Raimon Tolosana-Delgado; Richard Gloaguen

doi:10.3390/rs12152421

Hierarchical Sparse Subspace Clustering (HESSC): An Automatic Approach for Hyperspectral Image Analysis

Remote Sensing ◽

10.3390/rs12152421 ◽

2020 ◽

Vol 12 (15) ◽

pp. 2421

Author(s):

Kasra Rafiezadeh Shahi ◽

Mahdi Khodadadzadeh ◽

Laura Tusa ◽

Pedram Ghamisi ◽

Raimon Tolosana-Delgado ◽

...

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Hyperspectral Image ◽

Imaging Techniques ◽

Clustering Algorithms ◽

Machine Learning Techniques ◽

Mixed Data ◽

Number Of Clusters ◽

Hyperspectral Image Analysis ◽

Learning Techniques

Hyperspectral imaging techniques are becoming one of the most important tools to remotely acquire fine spectral information on different objects. However, hyperspectral images (HSIs) require dedicated processing for most applications. Therefore, several machine learning techniques were proposed in the last decades. Among the proposed machine learning techniques, unsupervised learning techniques have become popular as they do not need any prior knowledge. Specifically, sparse subspace-based clustering algorithms have drawn special attention to cluster the HSI into meaningful groups since such algorithms are able to handle high dimensional and highly mixed data, as is the case in real-world applications. Nonetheless, sparse subspace-based clustering algorithms usually tend to demand high computational power and can be time-consuming. In addition, the number of clusters is usually predefined. In this paper, we propose a new hierarchical sparse subspace-based clustering algorithm (HESSC), which handles the aforementioned problems in a robust and fast manner and estimates the number of clusters automatically. In the experiment, HESSC is applied to three real drill-core samples and one well-known rural benchmark (i.e., Trento) HSI datasets. In order to evaluate the performance of HESSC, the performance of the new proposed algorithm is quantitatively and qualitatively compared to the state-of-the-art sparse subspace-based algorithms. In addition, in order to have a comparison with conventional clustering algorithms, HESSC’s performance is compared with K-means and FCM. The obtained clustering results demonstrate that HESSC performs well when clustering HSIs compared to the other applied clustering algorithms.

Download Full-text

Exploring multi-modalities in weather prediction using a univariate graph based on machine learning techniques

10.5194/egusphere-egu21-11747 ◽

2021 ◽

Author(s):

Natacha Galmiche ◽

Nello Blaser ◽

Morten Brun ◽

Helwig Hauser ◽

Thomas Spengler ◽

...

Keyword(s):

Machine Learning ◽

Standard Deviation ◽

Probability Distributions ◽

Weather Prediction ◽

A Priori ◽

Clustering Algorithms ◽

Quantitative Information ◽

Machine Learning Techniques ◽

Topological Data Analysis ◽

Learning Techniques

Probability distributions based on ensemble forecasts are commonly used to assess uncertainty in weather prediction. However, interpreting these distributions is not trivial, especially in the case of multimodality with distinct likely outcomes. The conventional summary employs mean and standard deviation across ensemble members, which works well for unimodal, Gaussian-like distributions. In the case of multimodality this misleads, discarding crucial information.&#160;We aim at combining previously developed clustering algorithms in machine learning and topological data analysis to extract useful information such as the number of clusters in an ensemble. Given the chaotic behaviour of the atmosphere, machine learning techniques can provide relevant results even if no, or very little, a priori information about the data is available. In addition, topological methods that analyse the shape of the data can make results explainable.Given an ensemble of univariate time series, a graph is generated whose edges and vertices represent clusters of members, including additional information for each cluster such as the members belonging to them, their uncertainty, and their relevance according to the graph. In the case of multimodality, this approach provides relevant and quantitative information beyond the commonly used mean and standard deviation approach that helps to further characterise the predictability.

Download Full-text

Hamming Distance based Clustering Algorithm

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2012010102 ◽

2012 ◽

Vol 2 (1) ◽

pp. 11-20 ◽

Cited By ~ 3

Author(s):

Ritu Vijay ◽

Prerna Mahajan ◽

Rekha Kandwal

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Hamming Distance ◽

Promising Result ◽

Clustering Algorithms ◽

Distribution Patterns ◽

Mixed Data ◽

Binary Representation ◽

Data Sets ◽

Performance Study

Cluster analysis has been extensively used in machine learning and data mining to discover distribution patterns in the data. Clustering algorithms are generally based on a distance metric in order to partition the data into small groups such that data instances in the same group are more similar than the instances belonging to different groups. In this paper the authors have extended the concept of hamming distance for categorical data .As a data processing step they have transformed the data into binary representation. The authors have used proposed algorithm to group data points into clusters. The experiments are carried out on the data sets from UCI machine learning repository to analyze the performance study. They conclude by stating that this proposed algorithm shows promising result and can be extended to handle numeric as well as mixed data.

Download Full-text

A Novel Semi-Supervised Fuzzy C-Means Clustering Algorithm Using Multiple Fuzzification Coefficients

Algorithms ◽

10.3390/a14090258 ◽

2021 ◽

Vol 14 (9) ◽

pp. 258

Author(s):

Tran Dinh Khang ◽

Manh-Kien Tran ◽

Michael Fowler

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Machine Learning Techniques ◽

Unsupervised Machine Learning ◽

Practical Applications ◽

Fuzzy C Means ◽

Learning Techniques ◽

Fuzzy C Means Clustering ◽

Data Points ◽

Data Elements

Clustering is an unsupervised machine learning method with many practical applications that has gathered extensive research interest. It is a technique of dividing data elements into clusters such that elements in the same cluster are similar. Clustering belongs to the group of unsupervised machine learning techniques, meaning that there is no information about the labels of the elements. However, when knowledge of data points is known in advance, it will be beneficial to use a semi-supervised algorithm. Within many clustering techniques available, fuzzy C-means clustering (FCM) is a common one. To make the FCM algorithm a semi-supervised method, it was proposed in the literature to use an auxiliary matrix to adjust the membership grade of the elements to force them into certain clusters during the computation. In this study, instead of using the auxiliary matrix, we proposed to use multiple fuzzification coefficients to implement the semi-supervision component. After deriving the proposed semi-supervised fuzzy C-means clustering algorithm with multiple fuzzification coefficients (sSMC-FCM), we demonstrated the convergence of the algorithm and validated the efficiency of the method through a numerical example.

Download Full-text

Using Machine-Learning Techniques to Identify Responders vs. Non-responders in Randomized Clinical Trials.

10.1101/2020.11.21.20232041 ◽

2020 ◽

Author(s):

Vasiliki Nikolodimou ◽

Paul Agapow

Keyword(s):

Machine Learning ◽

Randomized Clinical Trials ◽

Clustering Algorithms ◽

Human Monoclonal Antibody ◽

Clinical History ◽

Differential Response ◽

Unsupervised Clustering ◽

Machine Learning Techniques ◽

Genetic Characteristics ◽

Learning Techniques

Despite the expectation of heterogeneity in therapy outcomes, especially for complex diseases like cancer, analyzing differential response to experimental therapies in a randomized clinical trial (RCT) setting is typically done by dividing patients into responders and non-responders, usually based on a single endpoint. Given the existence of biological and patho-physiological differences among metastatic colorectal cancer (mCRC) patients, we hypothesized that a data-driven analysis of an RCT population outcomes can identify sub-types of patients founded on differential response to Panitumumab - a fully human monoclonal antibody directed against the epidermal growth factor receptor. Outcome and response data of the RCT population were mined with heuristic, distance-based and model-based unsupervised clustering algorithms. The population sub-groups obtained by the best performing clustering approach were then examined in terms of molecular and clinical characteristics. The utility of this characterization was compared against that of the sub-groups obtained by the conventional responders' analysis and then contrasted with aetiological evidence around mCRC heterogeneity and biological functioning. The Partition around Medoids clustering method results into the identification of seven sub-types of patients, statistically distinct from each other in survival outcomes, prognostic biomarkers and genetic characteristics. Conventional responders analysis was proven inferior in uncovering relationships between physical, clinical history, genetic attributes and differential treatment resistance mechanisms. Combined with improved characterization of the molecular subtypes of CRC, applying Machine Learning techniques, like unsupervised clustering, onto the wealth of data already collected by previous RCTs can support the design of further targeted, more efficient RCTs and better identification of patient groups who will respond to a given intervention.

Download Full-text

Using Real-Time Data and Unsupervised Machine Learning Techniques to Study Large-Scale Spatio–Temporal Characteristics of Wastewater Discharges and their Influence on Surface Water Quality in the Yangtze River Basin

Water ◽

10.3390/w11061268 ◽

2019 ◽

Vol 11 (6) ◽

pp. 1268 ◽

Cited By ~ 3

Author(s):

Zhenzhen Di ◽

Miao Chang ◽

Peikun Guo ◽

Yang Li ◽

Yin Chang

Keyword(s):

Machine Learning ◽

Surface Water ◽

Real Time ◽

Yangtze River Basin ◽

Clustering Algorithms ◽

Machine Learning Techniques ◽

Unsupervised Machine Learning ◽

Learning Techniques ◽

The Yangtze River Basin ◽

Spatio Temporal

Most worldwide industrial wastewater, including in China, is still directly discharged to aquatic environments without adequate treatment. Because of a lack of data and few methods, the relationships between pollutants discharged in wastewater and those in surface water have not been fully revealed and unsupervised machine learning techniques, such as clustering algorithms, have been neglected in related research fields. In this study, real-time monitoring data for chemical oxygen demand (COD), ammonia nitrogen (NH3-N), pH, and dissolved oxygen in the wastewater discharged from 2213 factories and in the surface water at 18 monitoring sections (sites) in 7 administrative regions in the Yangtze River Basin from 2016 to 2017 were collected and analyzed by the partitioning around medoids (PAM) and expectation–maximization (EM) clustering algorithms, Welch t-test, Wilcoxon test, and Spearman correlation. The results showed that compared with the spatial cluster comprising unpolluted sites, the spatial cluster comprised heavily polluted sites where more wastewater was discharged had relatively high COD (>100 mg L−1) and NH3-N (>6 mg L−1) concentrations and relatively low pH (<6) from 15 industrial classes that respected the different discharge limits outlined in the pollutant discharge standards. The results also showed that the economic activities generating wastewater and the geographical distribution of the heavily polluted wastewater changed from 2016 to 2017, such that the concentration ranges of pollutants in discharges widened and the contributions from some emerging enterprises became more important. The correlations between the quality of the wastewater and the surface water strengthened as the whole-year data sets were reduced to the heavily polluted periods by the EM clustering and water quality evaluation. This study demonstrates how unsupervised machine learning algorithms play an objective and effective role in data mining real-time monitoring information and highlighting spatio–temporal relationships between pollutants in wastewater discharges and surface water to support scientific water resource management.

Download Full-text

Analysis of Cattle Social Transitional Behaviour: Attraction and Repulsion

Sensors ◽

10.3390/s20185340 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5340

Author(s):

Haocheng Xu ◽

Shenghong Li ◽

Caroline Lee ◽

Wei Ni ◽

David Abbott ◽

...

Keyword(s):

Machine Learning ◽

Social Interactions ◽

Management Practices ◽

Clustering Algorithms ◽

Physical Distance ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Video Recordings ◽

Learning Techniques ◽

The Social

Understanding social interactions in livestock groups could improve management practices, but this can be difficult and time-consuming using traditional methods of live observations and video recordings. Sensor technologies and machine learning techniques could provide insight not previously possible. In this study, based on the animals’ location information acquired by a new cooperative wireless localisation system, unsupervised machine learning approaches were performed to identify the social structure of a small group of cattle yearlings (n=10) and the social behaviour of an individual. The paper first defined the affinity between an animal pair based on the ranks of their distance. Unsupervised clustering algorithms were then performed, including K-means clustering and agglomerative hierarchical clustering. In particular, K-means clustering was applied based on logical and physical distance. By comparing the clustering result based on logical distance and physical distance, the leader animals and the influence of an individual in a herd of cattle were identified, which provides valuable information for studying the behaviour of animal herds. Improvements in device robustness and replication of this work would confirm the practical application of this technology and analysis methodologies.

Download Full-text

Leaf Disease Detection Using AI

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch006 ◽

2021 ◽

pp. 110-136

Author(s):

Praveen Kumar Maduri ◽

Tushar Biswas ◽

Preeti Dhiman ◽

Apurva Soni ◽

Kushagra Singh

Keyword(s):

Machine Learning ◽

Clustering Algorithms ◽

Essential Elements ◽

Plant Diseases ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Initial Stage ◽

Large Farm ◽

Time And Energy ◽

Financial Losses

Plants play a significant role in everyone's life. They provide us essential elements like food, oxygen, and shelter, so plants must be supervised and nurtured properly. During cultivation, crops are prone to different kinds of diseases which can severely damage the whole yield leading to financial losses for farmers. In last 10 years, researchers have used different machine learning techniques to detect the disease on plants, but either the methods were not efficient enough to be implemented or were not able to cover the wide area in which plant diseases can be detected. So, the author has introduced a method which is efficient enough to easily detect plant disease and can be implemented in large fields. The author has used a combination of CNN and k-means clustering algorithms. By using this method, crops disease is detected by analyzing the leaves, which notifies users for action in the initial stage. Thus, the proposed method prevents whole crops from getting damaged and saves time and energy of farmers as disease will be identified way before a human eye can detect it on a large farm.

Download Full-text

Abstract 2108: Application of random forest machine learning techniques on mixed data from breast cancer studies

10.1158/1538-7445.am2020-2108 ◽

2020 ◽

Author(s):

Jelmar Quist ◽

Lawson Taylor ◽

Johan Staaf ◽

Anita Grigoriadis

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Machine Learning Techniques ◽

Mixed Data ◽

Learning Techniques ◽

Cancer Studies

Download Full-text

An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples

PeerJ Computer Science ◽

10.7717/peerj-cs.671 ◽

2021 ◽

Vol 7 ◽

pp. e671

Author(s):

Shilpi Bose ◽

Chandra Das ◽

Abhik Banerjee ◽

Kuntal Ghosh ◽

Matangini Chattopadhyay ◽

...

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Class Imbalance ◽

Classification Model ◽

Machine Learning Techniques ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Attribute Clustering

Background Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. Methods In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. Results To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.

Download Full-text

Application of Visible/Infrared Spectroscopy and Hyperspectral Imaging With Machine Learning Techniques for Identifying Food Varieties and Geographical Origins

Frontiers in Nutrition ◽

10.3389/fnut.2021.680357 ◽

2021 ◽

Vol 8 ◽

Author(s):

Lei Feng ◽

Baohua Wu ◽

Susu Zhu ◽

Yong He ◽

Chu Zhang

Keyword(s):

Machine Learning ◽

Infrared Spectroscopy ◽

Hyperspectral Imaging ◽

Food Quality ◽

Near Infrared ◽

Imaging Techniques ◽

Machine Learning Techniques ◽

Research Progress ◽

Food Fraud ◽

Learning Techniques

Food quality and safety are strongly related to human health. Food quality varies with variety and geographical origin, and food fraud is becoming a threat to domestic and global markets. Visible/infrared spectroscopy and hyperspectral imaging techniques, as rapid and non-destructive analytical methods, have been widely utilized to trace food varieties and geographical origins. In this review, we outline recent research progress on identifying food varieties and geographical origins using visible/infrared spectroscopy and hyperspectral imaging with the help of machine learning techniques. The applications of visible, near-infrared, and mid-infrared spectroscopy as well as hyperspectral imaging techniques on crop food, beverage, fruits, nuts, meat, oil, and some other kinds of food are reviewed. Furthermore, existing challenges and prospects are discussed. In general, the existing machine learning techniques contribute to satisfactory classification results. Follow-up researches of food varieties and geographical origins traceability and development of real-time detection equipment are still in demand.

Download Full-text