scholarly journals DPDRC, a Novel Machine Learning Method about the Decision Process for Dimensionality Reduction before Clustering

AI ◽  
2021 ◽  
Vol 3 (1) ◽  
pp. 1-22
Author(s):  
Jean-Sébastien Dessureault ◽  
Daniel Massicotte

This paper examines the critical decision process of reducing the dimensionality of a dataset before applying a clustering algorithm. It is always a challenge to choose between extracting or selecting features. It is not obvious to evaluate the importance of the features since the most popular methods to do it are usually intended for a supervised learning technique process. This paper proposes a novel method called “Decision Process for Dimensionality Reduction before Clustering” (DPDRC). It chooses the best dimensionality reduction method (selection or extraction) according to the data scientist’s parameters and the profile of the data, aiming to apply a clustering process at the end. It uses a Feature Ranking Process Based on Silhouette Decomposition (FRSD) algorithm, a Principal Component Analysis (PCA) algorithm, and a K-means algorithm along with its metric, the Silhouette Index (SI). This paper presents five scenarios based on different parameters. This research also aims to discuss the impacts, advantages, and disadvantages of each choice that can be made in this unsupervised learning process.

Author(s):  
Hsein Kew

AbstractIn this paper, we propose a method to generate an audio output based on spectroscopy data in order to discriminate two classes of data, based on the features of our spectral dataset. To do this, we first perform spectral pre-processing, and then extract features, followed by machine learning, for dimensionality reduction. The features are then mapped to the parameters of a sound synthesiser, as part of the audio processing, so as to generate audio samples in order to compute statistical results and identify important descriptors for the classification of the dataset. To optimise the process, we compare Amplitude Modulation (AM) and Frequency Modulation (FM) synthesis, as applied to two real-life datasets to evaluate the performance of sonification as a method for discriminating data. FM synthesis provides a higher subjective classification accuracy as compared with to AM synthesis. We then further compare the dimensionality reduction method of Principal Component Analysis (PCA) and Linear Discriminant Analysis in order to optimise our sonification algorithm. The results of classification accuracy using FM synthesis as the sound synthesiser and PCA as the dimensionality reduction method yields a mean classification accuracies of 93.81% and 88.57% for the coffee dataset and the fruit puree dataset respectively, and indicate that this spectroscopic analysis model is able to provide relevant information on the spectral data, and most importantly, is able to discriminate accurately between the two spectra and thus provides a complementary tool to supplement current methods.


2021 ◽  
Vol 10 (4) ◽  
pp. 2170-2180
Author(s):  
Untari N. Wisesty ◽  
Tati Rajab Mengko

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.


2021 ◽  
Vol 15 ◽  
Author(s):  
Jiasong Wu ◽  
Xiang Qiu ◽  
Jing Zhang ◽  
Fuzhi Wu ◽  
Youyong Kong ◽  
...  

Generative adversarial networks and variational autoencoders (VAEs) provide impressive image generation from Gaussian white noise, but both are difficult to train, since they need a generator (or encoder) and a discriminator (or decoder) to be trained simultaneously, which can easily lead to unstable training. To solve or alleviate these synchronous training problems of generative adversarial networks (GANs) and VAEs, researchers recently proposed generative scattering networks (GSNs), which use wavelet scattering networks (ScatNets) as the encoder to obtain features (or ScatNet embeddings) and convolutional neural networks (CNNs) as the decoder to generate an image. The advantage of GSNs is that the parameters of ScatNets do not need to be learned, while the disadvantage of GSNs is that their ability to obtain representations of ScatNets is slightly weaker than that of CNNs. In addition, the dimensionality reduction method of principal component analysis (PCA) can easily lead to overfitting in the training of GSNs and, therefore, affect the quality of generated images in the testing process. To further improve the quality of generated images while keeping the advantages of GSNs, this study proposes generative fractional scattering networks (GFRSNs), which use more expressive fractional wavelet scattering networks (FrScatNets), instead of ScatNets as the encoder to obtain features (or FrScatNet embeddings) and use similar CNNs of GSNs as the decoder to generate an image. Additionally, this study develops a new dimensionality reduction method named feature-map fusion (FMF) instead of performing PCA to better retain the information of FrScatNets,; it also discusses the effect of image fusion on the quality of the generated image. The experimental results obtained on the CIFAR-10 and CelebA datasets show that the proposed GFRSNs can lead to better generated images than the original GSNs on testing datasets. The experimental results of the proposed GFRSNs with deep convolutional GAN (DCGAN), progressive GAN (PGAN), and CycleGAN are also given.


2021 ◽  
Author(s):  
John B. Lemos ◽  
Matheus R. S. Barbosa ◽  
Edric B. Troccoli ◽  
Alexsandro G. Cerqueira

This work aims to delimit the Direct Hydrocarbon Indicators (DHI) zones using the Gaussian Mixture Models (GMM) algorithm, an unsupervised machine learning method, over the FS8 seismic horizon in the seismic data of the Dutch F3 Field. The dataset used to perform the cluster analysis was extracted from the 3D seismic dataset. It comprises the following seismic attributes: Sweetness, Spectral Decomposition, Acoustic Impedance, Coherence, and Instantaneous Amplitude. The Principal Component Analysis (PCA) algorithm was applied in the original dataset for dimensionality reduction and noise filtering, and we choose the first three principal components to be the input of the clustering algorithm. The cluster analysis using the Gaussian Mixture Models was performed by varying the number of groups from 2 to 20. The Elbow Method suggested a smaller number of groups than needed to isolate the DHI zones. Therefore, we observed that four is the optimal number of clusters to highlight this seismic feature. Furthermore, it was possible to interpret other clusters related to the lithology through geophysical well log data.


2018 ◽  
Vol 11 (2) ◽  
pp. 41-51 ◽  
Author(s):  
I. Ya. Lukasevich

The subject of the research is new tools for business financing using the initial coin offering (ICO) in the context of the development of cryptocurrencies and the blockchain technologies as their basis. The purpose of the work was to analyze the advantages and disadvantages of the ICO in comparison with traditional financial tools as well as prospects, limitations and problems of using digital financial tools. Conclusions are made in relation to possibilities, limitations and application areas of digital business financing tools, particularly in the real sector, taking into account the specifics of the Russian economy and legislation. It is shown that the main problems of using the digital financial tools are related to the economic sphere and caused by the lack of adequate approaches to evaluation of assets as well as the shortage of objective information. The problems and new tasks of corporate finance in the digital economy are defined.


2020 ◽  
Vol 26 ◽  
Author(s):  
Emir Muzurović ◽  
Zoja Stanković ◽  
Zlata Kovačević ◽  
Benida Šahmanović Škrijelj ◽  
Dimitri P Mikhailidis

: Diabetes mellitus (DM) is a chronic and complex metabolic disorder, and also an important cause of cardiovascular (CV) diseases (CVDs). Subclinical inflammation, observed in patients with type 2 DM (T2DM), cannot be considered the sole or primary cause of T2DM in the absence of classical risk factors, but it represents an important mechanism that serves as a bridge between primary causes of T2DM and its manifestation. Progress has been made in the identification of effective strategies to prevent or delay the onset of T2DM. It is important to identify those at increased risk for DM by using specific biomarkers. Inflammatory markers correlate with insulin resistance (IR) and glycoregulation in patients with DM. Also, several inflammatory markers have been shown to be useful in assessing the risk of developing DM and its complications. However, the intertwining of pathophysiological processes and the not-quite-specificity of inflammatory markers for certain clinical entities limits their practical use. In this review we consider the advantages and disadvantages of various inflammatory biomarkers of DM that have been investigated to date as well as possible future directions. Key features of such biomarkers should be high specificity, non-invasiveness and cost-effectiveness.


2020 ◽  
Vol 15 ◽  
Author(s):  
Shuwen Zhang ◽  
Qiang Su ◽  
Qin Chen

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.


2021 ◽  
Vol 13 (3) ◽  
pp. 526
Author(s):  
Shengliang Pu ◽  
Yuanfeng Wu ◽  
Xu Sun ◽  
Xiaotong Sun

The nascent graph representation learning has shown superiority for resolving graph data. Compared to conventional convolutional neural networks, graph-based deep learning has the advantages of illustrating class boundaries and modeling feature relationships. Faced with hyperspectral image (HSI) classification, the priority problem might be how to convert hyperspectral data into irregular domains from regular grids. In this regard, we present a novel method that performs the localized graph convolutional filtering on HSIs based on spectral graph theory. First, we conducted principal component analysis (PCA) preprocessing to create localized hyperspectral data cubes with unsupervised feature reduction. These feature cubes combined with localized adjacent matrices were fed into the popular graph convolution network in a standard supervised learning paradigm. Finally, we succeeded in analyzing diversified land covers by considering local graph structure with graph convolutional filtering. Experiments on real hyperspectral datasets demonstrated that the presented method offers promising classification performance compared with other popular competitors.


2021 ◽  
Vol 13 (9) ◽  
pp. 4648
Author(s):  
Rana Muhammad Adnan ◽  
Kulwinder Singh Parmar ◽  
Salim Heddam ◽  
Shamsuddin Shahid ◽  
Ozgur Kisi

The accurate estimation of suspended sediments (SSs) carries significance in determining the volume of dam storage, river carrying capacity, pollution susceptibility, soil erosion potential, aquatic ecological impacts, and the design and operation of hydraulic structures. The presented study proposes a new method for accurately estimating daily SSs using antecedent discharge and sediment information. The novel method is developed by hybridizing the multivariate adaptive regression spline (MARS) and the Kmeans clustering algorithm (MARS–KM). The proposed method’s efficacy is established by comparing its performance with the adaptive neuro-fuzzy system (ANFIS), MARS, and M5 tree (M5Tree) models in predicting SSs at two stations situated on the Yangtze River of China, according to the three assessment measurements, RMSE, MAE, and NSE. Two modeling scenarios are employed; data are divided into 50–50% for model training and testing in the first scenario, and the training and test data sets are swapped in the second scenario. In Guangyuan Station, the MARS–KM showed a performance improvement compared to ANFIS, MARS, and M5Tree methods in term of RMSE by 39%, 30%, and 18% in the first scenario and by 24%, 22%, and 8% in the second scenario, respectively, while the improvement in RMSE of ANFIS, MARS, and M5Tree was 34%, 26%, and 27% in the first scenario and 7%, 16%, and 6% in the second scenario, respectively, at Beibei Station. Additionally, the MARS–KM models provided much more satisfactory estimates using only discharge values as inputs.


Sign in / Sign up

Export Citation Format

Share Document