scholarly journals Pattern Recognition Using Clustering Analysis to Support Transportation System Management, Operations, and Modeling

2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Rajib Saha ◽  
Mosammat Tahnin Tariq ◽  
Mohammed Hadi ◽  
Yan Xiao

There has been an increasing interest in recent years in using clustering analysis for the identification of traffic patterns that are representative of traffic conditions in support of transportation system operations and management (TSMO); integrated corridor management; and analysis, modeling, and simulation (AMS). However, there has been limited information to support agencies in their selection of the most appropriate clustering technique(s), associated parameters, the optimal number of clusters, clustering result analysis, and selecting observations that are representative of each cluster. This paper investigates and compares the use of a number of existing clustering methods for traffic pattern identifications, considering the above. These methods include the K-means, K-prototypes, K-medoids, four variations of the Hierarchical method, and the combination of Principal Component Analysis for mixed data (PCAmix) with K-means. Among these methods, the K-prototypes and K-means with PCs produced the best results. The paper then provides recommendations regarding conducting and utilizing the results of clustering analysis.

2020 ◽  
Vol 12 (14) ◽  
pp. 5809 ◽  
Author(s):  
Mojtaba Zeraatpisheh ◽  
Esmaeil Bakhshandeh ◽  
Mostafa Emadi ◽  
Tengfei Li ◽  
Ming Xu

Citrus spp. are one of the most important commercial crops with global marketing potential in the world, as in Iran. A soil management zone (MZ) as an appropriate approach is necessary to achieve sustainable production, along with improving soil management and increasing economic benefits in the commercial citrus plantations of northern Iran. As the first report, the biological and terrain attributes along with the physicochemical properties (57 soil samples, 0–30 cm) were used for MZ delineation using the integration of principal component analysis (PCA) and the fuzzy c-means clustering methods. An economic analysis based on the MZ results was also performed to determine the changes in each MZ using a relative cost (RC) value. The high correlation between soil properties and terrain attributes and the considerable spatial variation of these factors in the study area call for site-specific nutrient management. The optimal number of MZs was six and there was a significant heterogeneity variation among different MZs. The ranking of the MZs were MZ5 > MZ2 > MZ6 > MZ1 > MZ3 > MZ4 based on higher soil quality and lower costs per tree. The MZ4, MZ3, MZ1, MZ6, and MZ2 required 34.4, 30.6, 29.4, 9.77, and 9.44% more costs than MZ5 (as reference MZ) for achieving similar productivity, respectively. Therefore, this simple and cost-effective approach could be an initial step to utilize fertilizers site-specifically for data-scarce areas and reduce the soil property variability within the delineated MZs, which is fundamental for precision agriculture management.


Atmosphere ◽  
2021 ◽  
Vol 12 (6) ◽  
pp. 698
Author(s):  
Likai Cui ◽  
Xiaoquan Song ◽  
Guoqiang Zhong

Using the Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) model to obtain backward trajectories and then conduct clustering analysis is a common method to analyze potential sources and transmission paths of atmospheric particulate pollutants. Taking Qingdao (N36 E120) as an example, the global data assimilation system (GDAS 1°) of days from 2015 to 2018 provided by National Centers for Environmental Prediction (NCEP) is used to process the backward 72 h trajectory data of 3 arrival heights (10 m, 100 m, 500 m) through the HYSPLIT model with a data interval of 6 h (UTC 0:00, 6:00, 12:00, and 18:00 per day). Three common clustering methods of trajectory data, i.e., K-means, Hierarchical clustering (Hier), and Self-organizing maps (SOM), are used to conduct clustering analysis of trajectory data, and the results are compared with those of the HYSPLIT model released by National Oceanic and Atmospheric Administration (NOAA). Principal Component Analysis (PCA) is used to analyze the original trajectory data. The internal evaluation indexes of Davies–Bouldin Index (DBI), Silhouette Coefficient (SC), Calinski Harabasz Index (CH), and I index are used to quantitatively evaluate the three clustering algorithms. The results show that there is little information in the height data, and thus only two-dimensional plane data are used for clustering. From the results of clustering indexes, the clustering results of SOM and K-means are better than the Hier and HYSPLIT model. In addition, it is found that DBI and I index can help to select the number of clusters, of which DBI is preferred for cluster analysis.


2022 ◽  
Vol 17 (1) ◽  
pp. 1934578X2110692
Author(s):  
Che Puteh Osman ◽  
Noraini Kasim ◽  
Nur Syamimi Amirah Mohamed Salim ◽  
Nuralina Abdul Aziz

There are reports documenting the volatile oils of several durian cultivars in Malaysia. However, there is limited information on the rapid discrimination of the durian cultivars based on the composition of the total volatiles and individual volatile compounds. Thus, the present work aims to discriminate 11 Malaysian durian cultivars based on their volatile compositions using multivariate data analysis. Sulfur-containing volatiles are the major volatiles in D175 (Udang Merah), D88 (Darling), D13 (Golden Bun), DXO (D24 Special), D17 (Green Bamboo), D2 (Dato Nina), and D168 (Hajah Hasmah) durian cultivars, while esters are predominant in D99 (Kop Kecil), D24 (Bukit Merah), and D160 (Musang Queen) durian cultivars. D197 (Musang King) cultivar has an almost equal composition of sulfur-containing volatiles and esters. In the ester predominated volatile durian oil, ethyl 2-methylbutanoate and propyl 2-methylbutanoate are the major volatile compounds, while the durian cultivars with predominant sulfur-containing volatiles mainly contain diethyl disulfide, diethyl trisulfide, and 3,5-dimethyl-1,2,4-trithiolane. The durian cultivars were clustered into 8 clusters using principal component analysis, with 3 clusters consisting of 2 cultivars, and with the remaining cultivars clustered individually. The highly sought-after durian cultivars, D160 and D197, were clustered into one. Hierarchal clustering analysis identified the distinct compounds which discriminate every durian cultivar.


Transport ◽  
2014 ◽  
Vol 32 (2) ◽  
pp. 221-232 ◽  
Author(s):  
Rima Sahani ◽  
Prasanta Kumar Bhuyan

Levels Of Service (LOS) evaluation criteria for off-street pedestrian facilities are not well defined in urban Indian context; hence an in-depth research is carried out in this regard. Defining Pedestrian Level of Service (PLOS) criteria is basically a classification problem; therefore a comparative study is made using three methods of clustering i.e. Affinity Propagation (AP), Self-Organizing Map (SOM) in Artificial Neural Network (ANN) and Genetic AlgorithmFuzzy (GA-Fuzzy) clustering. Pedestrian data are used on validation measure of clustering method to obtain optimal number of cluster used in defining PLOS categories. To decide the most suitable algorithm applicable in defining PLOS criteria for urban off-street facilities in Indian context, Wilk’s Lambda is used on results of the three clustering methods. It is observed from the analysis that GA-Fuzzy is the most suitable clustering analysis among the three methods. With the help of GA-Fuzzy clustering analysis the ranges of the four measuring parameters (average pedestrian space, flow rate, speed of pedestrian and volume to capacity ratio) are defined by using the data collected from two mid-sized cities located in the state of Odisha, India. It is also observed that at >16.53 m2/ped average space, ≤0.061 ped/sec/m flow rate, >1.21 speed and ≤0.34 v/c ratio pedestrians can move in their desired path at LOS ‘A’ without changing movements and it is the best condition for off-street facilities. But in the pedestrian facility having ≤4.48 m2/ped average space, >0.146 ped/sec/m flow rate, ≤0.62 average speed and >1.00 v/c ratio, pedestrian movement is severely restricted and frequent collision among users occurs. The ranges of the parameters used for LOS categories found in this study for Indian cities are different from that mentioned in HCM (Highway Capacity Manual 2010) because of differences in population density, traffic flow condition, geometric structure and some other factors.


Author(s):  
Sherif S. Ishak ◽  
Haitham M. Al-Deek

Pattern recognition techniques such as artificial neural networks continue to offer potential solutions to many of the existing problems associated with freeway incident-detection algorithms. This study focuses on the application of Fuzzy ART neural networks to incident detection on freeways. Unlike back-propagation models, Fuzzy ART is capable of fast, stable learning of recognition categories. It is an incremental approach that has the potential for on-line implementation. Fuzzy ART is trained with traffic patterns that are represented by 30-s loop-detector data of occupancy, speed, or a combination of both. Traffic patterns observed at the incident time and location are mapped to a group of categories. Each incident category maps incidents with similar traffic pattern characteristics, which are affected by the type and severity of the incident and the prevailing traffic conditions. Detection rate and false alarm rate are used to measure the performance of the Fuzzy ART algorithm. To reduce the false alarm rate that results from occasional misclassification of traffic patterns, a persistence time period of 3 min was arbitrarily selected. The algorithm performance improves when the temporal size of traffic patterns increases from one to two 30-s periods for all traffic parameters. An interesting finding is that the speed patterns produced better results than did the occupancy patterns. However, when combined, occupancy–speed patterns produced the best results. When compared with California algorithms 7 and 8, the Fuzzy ART model produced better performance.


2018 ◽  
Vol 14 (1) ◽  
pp. 11-23 ◽  
Author(s):  
Lin Zhang ◽  
Yanling He ◽  
Huaizhi Wang ◽  
Hui Liu ◽  
Yufei Huang ◽  
...  

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. <P><P> Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. <P><P> Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. <P><P> Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. <P><P> Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. <P><P> Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.


2021 ◽  
Vol 13 (11) ◽  
pp. 2125
Author(s):  
Bardia Yousefi ◽  
Clemente Ibarra-Castanedo ◽  
Martin Chamberland ◽  
Xavier P. V. Maldague ◽  
Georges Beaudoin

Clustering methods unequivocally show considerable influence on many recent algorithms and play an important role in hyperspectral data analysis. Here, we challenge the clustering for mineral identification using two different strategies in hyperspectral long wave infrared (LWIR, 7.7–11.8 μm). For that, we compare two algorithms to perform the mineral identification in a unique dataset. The first algorithm uses spectral comparison techniques for all the pixel-spectra and creates RGB false color composites (FCC). Then, a color based clustering is used to group the regions (called FCC-clustering). The second algorithm clusters all the pixel-spectra to directly group the spectra. Then, the first rank of non-negative matrix factorization (NMF) extracts the representative of each cluster and compares results with the spectral library of JPL/NASA. These techniques give the comparison values as features which convert into RGB-FCC as the results (called clustering rank1-NMF). We applied K-means as clustering approach, which can be modified in any other similar clustering approach. The results of the clustering-rank1-NMF algorithm indicate significant computational efficiency (more than 20 times faster than the previous approach) and promising performance for mineral identification having up to 75.8% and 84.8% average accuracies for FCC-clustering and clustering-rank1 NMF algorithms (using spectral angle mapper (SAM)), respectively. Furthermore, several spectral comparison techniques are used also such as adaptive matched subspace detector (AMSD), orthogonal subspace projection (OSP) algorithm, principal component analysis (PCA), local matched filter (PLMF), SAM, and normalized cross correlation (NCC) for both algorithms and most of them show a similar range in accuracy. However, SAM and NCC are preferred due to their computational simplicity. Our algorithms strive to identify eleven different mineral grains (biotite, diopside, epidote, goethite, kyanite, scheelite, smithsonite, tourmaline, pyrope, olivine, and quartz).


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Gregoire Preud’homme ◽  
Kevin Duarte ◽  
Kevin Dalleau ◽  
Claire Lacomblez ◽  
Emmanuel Bresso ◽  
...  

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.


Author(s):  
Lingling Chen ◽  
Yuanyuan Zhang ◽  
Min Zeng

Given that the traditional methods cannot perform clustering analysis on the Internet financial credit reporting directly and effectively, a kind of precise clustering analysis of internet financial credit reporting dependent on multidimensional attribute sparse large data is proposed. By measuring the overall distance between Internet financial credit reporting through the sparse large data with multidimensional attributes, the multidimensional attribute sparse large data are used to perform clustering analysis on the overall distance matrix and the component approximate distance matrix between the data, respectively. The correlation relationship between the Internet financial credit reporting under these two perspectives is taken into comprehensive consideration. Multidimensional attribute sparse large data pairs are used to reflect the comprehensive relationship matrix of the original Internet financial credit reporting to achieve clustering with relatively high quality. Numerical experiments show that compared with the traditional clustering methods, the method proposed in this paper can not only reflect the overall data features effectively, but also improve the clustering effect of the original Internet financial credit reporting data through the analysis of the correlation relationship between the important component attribute sequences.


Sign in / Sign up

Export Citation Format

Share Document