Approximate single linkage cluster analysis of large data sets in high-dimensional spaces

1996 ◽  
Vol 23 (1) ◽  
pp. 29-43 ◽  
Author(s):  
William F. Eddy ◽  
Audris Mockus ◽  
Shingo Oue
Author(s):  
Thomas W. Shattuck ◽  
James R. Anderson ◽  
Neil W. Tindale ◽  
Peter R. Buseck

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.


Author(s):  
S.R. Singh ◽  
S. Rajan ◽  
Dinesh Kumar ◽  
V.K. Soni

Background: Dolichos bean occupies a unique position among the legume vegetables of Indian origin for its high nutritive value and wider climatic adaptability. Despite its wide genetic diversity, no much effort has been undertaken towards genetic improvement of this vegetable crop. Knowledge on genetic variability is an essential pre-requisite as hybrid between two diverse parental lines generates broad spectrum of variability in segregating population. The current study aims to assess the genetic diversity in dolichos genotypes to make an effective selection for yield improvement.Methods: Twenty genotypes collected from different regions were evaluated during year 2016-17 and 2017-18. Data on twelve quantitative traits was analysed using principal component analysis and single linkage cluster analysis for estimation of genetic diversity.Result: Principal component analysis revealed that first five principal components possessed Eigen value greater than 1, cumulatively contributed greater than 82.53% of total variability. The characters positively contributing towards PC-I to PC-V may be considered for dolichos improvement programme as they are major traits involved in genetic variation of pod yield. All genotypes were grouped into three clusters showing non parallelism between geographic and genetic diversity. Cluster-I was best for earliness and number of cluster/plant. Cluster-II for vine length, per cent fruit set, pod length, pod width, pod weight and number of seed /pod, cluster III for number of pods/cluster and pod yield /plant. Selection of parent genotypes from divergent cluster and component having more than one positive trait of interest for hybridization is likely to give better progenies for development of high yielding varieties in Dolichos bean.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


1990 ◽  
Vol 17 (6) ◽  
pp. 551 ◽  
Author(s):  
GW Arnold ◽  
DE Steven ◽  
A Grassia

Associations between different classes of animals, and between individuals, were analysed in a population of 150-170 western grey kangaroos living in a 300-ha remnant of wandoo woodland and adjacent farmland. The commonest group size was one, and 71% of groups were of three or fewer individuals. Females with juveniles at foot were seen in a significantly different distribution of group sizes than females without juveniles, or males. The associations between classes in groups of 2, 3 and 4 changed with the size of group. In groups of two, but not in groups of three and four, males were seen together more frequently than expected. Females without juveniles at foot associated with their peers more frequently than expected in groups of two and three, but those with juveniles at foot associated with their peers less frequently than expected. Other associations between classes were significantly different from expectation. About 70% of the sub-adult and adult animals were individually identifiable by numbered collars. The highest frequency of association of one individual with another was less than 40% of the times the two were seen on the same night. However, nearly all individuals had statistically significant associations with one or more individual in each year, and dissociations with others. The associations did not persist from year to year. The overall group social structure, as shown by single-linkage cluster analysis, was for individuals to associate with others of the same sex, although sub-adults were more generally associated with adult females. The overall level of association was lower in males than in females and juveniles.


Sign in / Sign up

Export Citation Format

Share Document