Accurate recapture identification for genetic mark–recapture studies with error-tolerant likelihood-based match calling and sample clustering

Error-tolerant likelihood-based match calling presents a promising technique to accurately identify recapture events in genetic mark–recapture studies by combining probabilities of latent genotypes and probabilities of observed genotypes, which may contain genotyping errors. Combined with clustering algorithms to group samples into sets of recaptures based upon pairwise match calls, these tools can be used to reconstruct accurate capture histories for mark–recapture modelling. Here, we assess the performance of a recently introduced error-tolerant likelihood-based match-calling model and sample clustering algorithm for genetic mark–recapture studies. We assessed both biallelic (i.e. single nucleotide polymorphisms; SNP) and multiallelic (i.e. microsatellite; MSAT) markers using a combination of simulation analyses and case study data on Pacific walrus ( Odobenus rosmarus divergens ) and fishers ( Pekania pennanti ). A novel two-stage clustering approach is demonstrated for genetic mark–recapture applications. First, repeat captures within a sampling occasion are identified. Subsequently, recaptures across sampling occasions are identified. The likelihood-based matching protocol performed well in simulation trials, demonstrating utility for use in a wide range of genetic mark–recapture studies. Moderately sized SNP (64+) and MSAT (10–15) panels produced accurate match calls for recaptures and accurate non-match calls for samples from closely related individuals in the face of low to moderate genotyping error. Furthermore, matching performance remained stable or increased as the number of genetic markers increased, genotyping error notwithstanding.

Download Full-text

Clustering Algorithms and Validation Indices for a Wide mmWave Spectrum

Information ◽

10.3390/info10090287 ◽

2019 ◽

Vol 10 (9) ◽

pp. 287 ◽

Cited By ~ 2

Author(s):

Bogdan Antonescu ◽

Miead Tehrani Moayyed ◽

Stefano Basagni

Keyword(s):

Communication Systems ◽

Clustering Algorithm ◽

Radio Channel ◽

Clustering Algorithms ◽

Wireless Communication Systems ◽

Cluster Validity Indices ◽

Validity Indices ◽

Wide Range ◽

Radio Signals ◽

Urban Scenario

Radio channel propagation models for the millimeter wave (mmWave) spectrum are extremely important for planning future 5G wireless communication systems. Transmitted radio signals are received as clusters of multipath rays. Identifying these clusters provides better spatial and temporal characteristics of the mmWave channel. This paper deals with the clustering process and its validation across a wide range of frequencies in the mmWave spectrum below 100 GHz. By way of simulations, we show that in outdoor communication scenarios clustering of received rays is influenced by the frequency of the transmitted signal. This demonstrates the sparse characteristic of the mmWave spectrum (i.e., we obtain a lower number of rays at the receiver for the same urban scenario). We use the well-known k-means clustering algorithm to group arriving rays at the receiver. The accuracy of this partitioning is studied with both cluster validity indices (CVIs) and score fusion techniques. Finally, we analyze how the clustering solution changes with narrower-beam antennas, and we provide a comparison of the cluster characteristics for different types of antennas.

Download Full-text

Enhancement of Sales promotion using Clustering Techniques in Data Mart

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v15i2.6934 ◽

2015 ◽

Vol 15 (2) ◽

pp. 6534-6540

Author(s):

Vithya Gopalakrishnan

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Linear Time ◽

Clustering Algorithms ◽

Unsupervised Classification ◽

Sales Promotion ◽

Golay Code ◽

Sales Data ◽

Wide Range ◽

Noise Data

Clustering is an important research topic in wide range of unsupervised classification application. Clustering is a technique, which divides a data into meaningful groups. K-means algorithm is one of the popular clustering algorithms. It belongs to partition based grouping techniques, which are based on the iterative relocation of data points between clusters. It does not support global clustering and it has linear time complexity of O(n2). The existing and conventional data clustering algorithms were nâ€™t designed to handle the huge amount of data. So, to overcome these issues Golay code clustering algorithm is selected. Golay code based system used to facilitate the identification of the set of codeword incarnate similar object behaviors. The time complexity associated with Golay code-clustering algorithm is O(n). In this work, the collected sales data is pre processed by removing all null and empty attributes, then eliminating redundant, and noise data. To enhance the sales promotion, K-means and Golay code clustering algorithms are used to cluster the sales data in terms of place and item. Performances of these algorithms are analyzed in terms of accuracy and execution time. Our results show that the Golay code algorithm outperforms than K-mean algorithm in all factors.

Download Full-text

Entropy-Based Multiview Data Clustering Analysis in the Era of Industry 4.0

Wireless Communications and Mobile Computing ◽

10.1155/2021/9963133 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Yi Gu ◽

Kang Li

Keyword(s):

Industry 4.0 ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Complex Data ◽

Single View ◽

The Face ◽

Fuzzy C Means Clustering ◽

Fuzzy Index ◽

Multiview Clustering ◽

Almost All

In the era of Industry 4.0, single-view clustering algorithm is difficult to play a role in the face of complex data, i.e., multiview data. In recent years, an extension of the traditional single-view clustering is multiview clustering technology, which is becoming more and more popular. Although the multiview clustering algorithm has better effectiveness than the single-view clustering algorithm, almost all the current multiview clustering algorithms usually have two weaknesses as follows. (1) The current multiview collaborative clustering strategy lacks theoretical support. (2) The weight of each view is averaged. To solve the above-mentioned problems, we used the Havrda-Charvat entropy and fuzzy index to construct a new collaborative multiview fuzzy c-means clustering algorithm using fuzzy weighting called Co-MVFCM. The corresponding results show that the Co-MVFCM has the best clustering performance among all the comparison clustering algorithms.

Download Full-text

Multi-View Data Analysis Techniques for Monitoring Smart Building Systems

Sensors ◽

10.3390/s21206775 ◽

2021 ◽

Vol 21 (20) ◽

pp. 6775

Author(s):

Vishnu Manasa Devagiri ◽

Veselka Boeva ◽

Shahrooz Abghari ◽

Farhad Basiri ◽

Niklas Lavesson

Keyword(s):

Clustering Algorithm ◽

Concept Drift ◽

Clustering Algorithms ◽

Sensor Data ◽

Integrated Analysis ◽

Domain Experts ◽

Smart Building ◽

Monitoring Task ◽

Wide Range ◽

Building Systems

In smart buildings, many different systems work in coordination to accomplish their tasks. In this process, the sensors associated with these systems collect large amounts of data generated in a streaming fashion, which is prone to concept drift. Such data are heterogeneous due to the wide range of sensors collecting information about different characteristics of the monitored systems. All these make the monitoring task very challenging. Traditional clustering algorithms are not well equipped to address the mentioned challenges. In this work, we study the use of MV Multi-Instance Clustering algorithm for multi-view analysis and mining of smart building systems’ sensor data. It is demonstrated how this algorithm can be used to perform contextual as well as integrated analysis of the systems. Various scenarios in which the algorithm can be used to analyze the data generated by the systems of a smart building are examined and discussed in this study. In addition, it is also shown how the extracted knowledge can be visualized to detect trends in the systems’ behavior and how it can aid domain experts in the systems’ maintenance. In the experiments conducted, the proposed approach was able to successfully detect the deviating behaviors known to have previously occurred and was also able to identify some new deviations during the monitored period. Based on the results obtained from the experiments, it can be concluded that the proposed algorithm has the ability to be used for monitoring, analysis, and detecting deviating behaviors of the systems in a smart building domain.

Download Full-text

Clique-Based Clustering of Correlated SNPs in a Gene Can Improve Performance of Gene-Based Multi-Bin Linear Combination Test

BioMed Research International ◽

10.1155/2015/852341 ◽

2015 ◽

Vol 2015 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Yun Joo Yoo ◽

Sun Ah Kim ◽

Shelley B. Bull

Keyword(s):

Linear Combination ◽

Degrees Of Freedom ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Snp Analysis ◽

Nucleotide Polymorphisms ◽

Test Statistic ◽

Global Test ◽

Combination Test ◽

Linear Combination Test

Gene-based analysis of multiple single nucleotide polymorphisms (SNPs) in a gene region is an alternative to single SNP analysis. The multi-bin linear combination test (MLC) proposed in previous studies utilizes the correlation among SNPs within a gene to construct a gene-based global test. SNPs are partitioned into clusters of highly correlated SNPs, and the MLC test statistic quadratically combines linear combination statistics constructed for each cluster. The test has degrees of freedom equal to the number of clusters and can be more powerful than a fully quadratic or fully linear test statistic. In this study, we develop a new SNP clustering algorithm designed to find cliques, which are complete subnetworks of SNPs with all pairwise correlations above a threshold. We evaluate the performance of the MLC test using the clique-based CLQ algorithm versus using the tag-SNP-based LDSelect algorithm. In our numerical power calculations we observed that the two clustering algorithms produce identical clusters about 40~60% of the time, yielding similar power on average. However, because the CLQ algorithm tends to produce smaller clusters with stronger positive correlation, the MLC test is less likely to be affected by the occurrence of opposing signs in the individual SNP effect coefficients.

Download Full-text

Handling WSD using Hierarchical Clustering Algorithm with sentences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1841120 ◽

2018 ◽

pp. 83-88

Author(s):

Mohana Priya K ◽

Pooja Ragavi S ◽

Krishna Priya G

Keyword(s):

Hierarchical Clustering ◽

Similarity Measure ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Cosine Similarity Measure ◽

Hierarchical Clustering Algorithm ◽

Multiple Levels ◽

Pos Tagger ◽

Sentence Clustering ◽

The Right

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

User Power Behavior Similarity Clustering Based on Unsupervised Extreme Learning Machine Algorithm

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096512666191004130655 ◽

2020 ◽

Vol 13 (5) ◽

pp. 641-649

Author(s):

Yuancheng Li ◽

Yaqi Cui ◽

Xiaolong Zhang

Keyword(s):

Extreme Learning Machine ◽

Clustering Algorithm ◽

Characteristic Curve ◽

Clustering Algorithms ◽

Data Sets ◽

Residential Areas ◽

Processing Power ◽

Learning Machine ◽

Advanced Metering ◽

Matlab Programming

Background: Advanced Metering Infrastructure (AMI) for the smart grid is growing rapidly which results in the exponential growth of data collected and transmitted in the device. By clustering this data, it can give the electricity company a better understanding of the personalized and differentiated needs of the user. Objective: The existing clustering algorithms for processing data generally have some problems, such as insufficient data utilization, high computational complexity and low accuracy of behavior recognition. Methods: In order to improve the clustering accuracy, this paper proposes a new clustering method based on the electrical behavior of the user. Starting with the analysis of user load characteristics, the user electricity data samples were constructed. The daily load characteristic curve was extracted through improved extreme learning machine clustering algorithm and effective index criteria. Moreover, clustering analysis was carried out for different users from industrial areas, commercial areas and residential areas. The improved extreme learning machine algorithm, also called Unsupervised Extreme Learning Machine (US-ELM), is an extension and improvement of the original Extreme Learning Machine (ELM), which realizes the unsupervised clustering task on the basis of the original ELM. Results: Four different data sets have been experimented and compared with other commonly used clustering algorithms by MATLAB programming. The experimental results show that the US-ELM algorithm has higher accuracy in processing power data. Conclusion: The unsupervised ELM algorithm can greatly reduce the time consumption and improve the effectiveness of clustering.

Download Full-text

Pinball Loss Twin Support Vector Clustering

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3409264 ◽

2021 ◽

Vol 17 (2s) ◽

pp. 1-23

Author(s):

M. Tanveer ◽

Tarun Gupta ◽

Miten Shah ◽

Keyword(s):

Loss Function ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Structural Mri ◽

Twin Support Vector Machine ◽

Support Vector ◽

Support Vector Clustering ◽

Hinge Loss ◽

Pinball Loss ◽

Vector Clustering

Twin Support Vector Clustering (TWSVC) is a clustering algorithm inspired by the principles of Twin Support Vector Machine (TWSVM). TWSVC has already outperformed other traditional plane based clustering algorithms. However, TWSVC uses hinge loss, which maximizes shortest distance between clusters and hence suffers from noise-sensitivity and low re-sampling stability. In this article, we propose Pinball loss Twin Support Vector Clustering (pinTSVC) as a clustering algorithm. The proposed pinTSVC model incorporates the pinball loss function in the plane clustering formulation. Pinball loss function introduces favorable properties such as noise-insensitivity and re-sampling stability. The time complexity of the proposed pinTSVC remains equivalent to that of TWSVC. Extensive numerical experiments on noise-corrupted benchmark UCI and artificial datasets have been provided. Results of the proposed pinTSVC model are compared with TWSVC, Twin Bounded Support Vector Clustering (TBSVC) and Fuzzy c-means clustering (FCM). Detailed and exhaustive comparisons demonstrate the better performance and generalization of the proposed pinTSVC for noise-corrupted datasets. Further experiments and analysis on the performance of the above-mentioned clustering algorithms on structural MRI (sMRI) images taken from the ADNI database, face clustering, and facial expression clustering have been done to demonstrate the effectiveness and feasibility of the proposed pinTSVC model.

Download Full-text

Assessing single nucleotide polymorphism selection methods for the development of a low-density panel optimized for imputation in South African Drakensberger beef cattle

Journal of Animal Science ◽

10.1093/jas/skab118 ◽

2021 ◽

Author(s):

Simon F Lashmar ◽

Donagh P Berry ◽

Rian Pierneef ◽

Farai C Muchadeyi ◽

Carina Visser

Keyword(s):

South African ◽

Clustering Algorithm ◽

Imputation Accuracy ◽

Developed Countries ◽

Genotype Imputation ◽

Selection Strategy ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Single Nucleotide Polymorphism Selection ◽

The Mean

Abstract A major obstacle in applying genomic selection (GS) to uniquely adapted local breeds in less-developed countries has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNP). Cost reduction can be achieved by imputing genotypes from lower to higher densities. Locally adapted breeds tend to be admixed and exhibit a high degree of genomic heterogeneity thus necessitating the optimization of SNP selection for downstream imputation. The aim of this study was to quantify the achievable imputation accuracy for a sample of 1,135 South African (SA) Drakensberger using several custom-derived lower-density panels varying in both SNP density and how the SNP were selected. From a pool of 120,608 genotyped SNP, subsets of SNP were chosen 1) at random, 2) with even genomic dispersion, 3) by maximizing the mean minor allele frequency (MAF), 4) using a combined score of MAF and linkage disequilibrium (LD), 5) using a partitioning-around-medoids (PAM) algorithm, and finally 6) using a hierarchical LD-based clustering algorithm. Imputation accuracy to higher density improved as SNP density increased; animal-wise imputation accuracy defined as the within-animal correlation between the imputed and actual alleles ranged from 0.625 to 0.990 when 2,500 randomly selected SNP were chosen versus a range of 0.918 to 0.999 when 50,000 randomly selected SNP were used. At a panel density of 10,000 SNP, the mean (standard deviation) animal-wise allele concordance rate was 0.976 (0.018) versus 0.982 (0.014) when the worst (i.e., random) as opposed to the best (i.e., combination of MAF and LD) SNP selection strategy was employed. A difference of 0.071 units was observed between the mean correlation-based accuracy of imputed SNP categorized as low (0.01<MAF≤0.1) versus high MAF (0.4<MAF≤0.5). Greater mean imputation accuracy was achieved for SNP located on autosomal extremes when these regions were populated with more SNP. The presented results suggested that genotype imputation can be a practical cost-saving strategy for indigenous breeds such as the South African Drakensberger. Based on the results, a genotyping panel consisting of approximately 10,000 SNP selected based on a combination of MAF and LD would suffice in achieving a less than 3% imputation error rate for a breed characterized by genomic admixture on the condition that these SNP are selected based on breed-specific selection criteria.

Download Full-text