Clustering Techniques
Recently Published Documents





2021 ◽  
Vol 11 (1) ◽  
Caspar Matzhold ◽  
Jana Lasser ◽  
Christa Egger-Danner ◽  
Birgit Fuerst-Waltl ◽  
Thomas Wittek ◽  

AbstractIn this study we present systematic framework to analyse the impact of farm profiles as combinations of environmental conditions and management practices on common diseases in dairy cattle. The data used for this secondary data analysis includes observational data from 166 farms with a total of 5828 dairy cows. Each farm is characterised by features from five categories: husbandry, feeding, environmental conditions, housing, and milking systems. We combine dimension reduction with clustering techniques to identify groups of similar farm attributes, which we refer to as farm profiles. A statistical analysis of the farm profiles and their related disease risks is carried out to study the associations between disease risk, farm membership to a specific cluster as well as variables that characterise a given cluster by means of a multivariate regression model. The disease risks of five different farm profiles arise as the result of complex interactions between environmental conditions and farm management practices. We confirm previously documented relationships between diseases, feeding and husbandry. Furthermore, novel associations between housing and milking systems and specific disorders like lameness and ketosis have been discovered. Our approach contributes to paving a way towards a more holistic and data-driven understanding of bovine health and its risk factors.

2021 ◽  
Vol 0 (0) ◽  
pp. 1-30
Chunyan Lin ◽  
Jia Liu ◽  
Peide Liu

In this paper, the quantitative analysis is implemented on the relationship between strategy deviation of listed firms and institutional investors’ recognition. For research methodology, financial complex networks and clustering techniques are employed to measure the de-gree of recognition by creating links to the common stockholding behaviour of institutional investors. Besides, quarterly panel data from 2006 to 2020 are constructed for an innovative study of the degree of recognition of institutional investors’ strategy deviation of listed firms under different innovation fields, firm properties, and market style heterogeneity and asymmetry. The stability test is conducted by the transformation of the measures and methods, thereby effectively avoiding the “cluster fallacy”. We validate the mechanism by which the differences in strategic choices and propensities of listed firms affect capital market recognition, and enrich the microscopic research perspective and methodology on related issues.

Alicia Veninga ◽  
Constance C. F. M. J. Baaten ◽  
Ilaria De Simone ◽  
Bibian M. E. Tullemans ◽  
Marijke J. E. Kuijpers ◽  

AbstractPlatelets from healthy donors display heterogeneity in responsiveness to agonists. The response thresholds of platelets are controlled by multiple bioactive molecules, acting as negatively or positively priming substances. Higher circulating levels of priming substances adenosine and succinate, as well as the occurrence of hypercoagulability, have been described for patients with ischaemic heart disease. Here, we present an improved methodology of flow cytometric analyses of platelet activation and the characterisation of platelet populations following activation and priming by automated clustering analysis.Platelets were treated with adenosine, succinate, or coagulated plasma before stimulation with CRP-XL, 2-MeSADP, or TRAP6 and labelled for activated integrin αIIbβ3 (PAC1), CD62P, TLT1, CD63, and GPIX. The Super-Enhanced Dmax subtraction algorithm and 2% marker (quadrant) setting were applied to identify populations, which were further defined by state-of-the-art clustering techniques (tSNE, FlowSOM).Following activation, five platelet populations were identified: resting, aggregating (PAC1 + ), secreting (α- and dense-granules; CD62P + , TLT1 + , CD63 + ), aggregating plus α-granule secreting (PAC1 + , CD62P + , TLT1 + ), and fully active platelet populations. The type of agonist determined the distribution of platelet populations. Adenosine in a dose-dependent way suppressed the fraction of fully activated platelets (TRAP6 > 2-MeSADP > CRP-XL), whereas succinate and coagulated plasma increased this fraction (CRP-XL > TRAP6 > 2-MeSADP). Interestingly, a subset of platelets showed a constant response (aggregating, secreting, or aggregating plus α-granule secreting), which was hardly affected by the stimulus strength or priming substances.

2021 ◽  
Vol 12 ◽  
Lucas Miranda ◽  
Riya Paul ◽  
Benno Pütz ◽  
Nikolaos Koutsouleris ◽  
Bertram Müller-Myhsok

Background: Psychiatric disorders have been historically classified using symptom information alone. Recently, there has been a dramatic increase in research interest not only in identifying the mechanisms underlying defined pathologies but also in redefining their etiology. This is particularly relevant for the field of personalized medicine, which searches for data-driven approaches to improve diagnosis, prognosis, and treatment selection for individual patients.Methods: This review aims to provide a high-level overview of the rapidly growing field of functional magnetic resonance imaging (fMRI) from the perspective of unsupervised machine learning applications for disease subtyping. Following the PRISMA guidelines for protocol reproducibility, we searched the PubMed database for articles describing functional MRI applications used to obtain, interpret, or validate psychiatric disease subtypes. We also employed the active learning framework ASReview to prioritize publications in a machine learning-guided way.Results: From the 20 studies that met the inclusion criteria, five used functional MRI data to interpret symptom-derived disease clusters, four used it to interpret clusters derived from biomarker data other than fMRI itself, and 11 applied clustering techniques involving fMRI directly. Major depression disorder and schizophrenia were the two most frequently studied pathologies (35% and 30% of the retrieved studies, respectively), followed by ADHD (15%), psychosis as a whole (10%), autism disorder (5%), and the consequences of early exposure to violence (5%).Conclusions: The increased interest in personalized medicine and data-driven disease subtyping also extends to psychiatric disorders. However, to date, this subfield is at an incipient exploratory stage, and all retrieved studies were mostly proofs of principle where further validation and increased sample sizes are craved for. Whereas results for all explored diseases are inconsistent, we believe this reflects the need for concerted, multisite data collection efforts with a strong focus on measuring the generalizability of results. Finally, whereas functional MRI is the best way of measuring brain function available to date, its low signal-to-noise ratio and elevated monetary cost make it a poor clinical alternative. Even with technology progressing and costs decreasing, this might incentivize the search for more accessible, clinically ready functional proxies in the future.

2021 ◽  
Simon Berry ◽  
Zahid Khan ◽  
Diego Corbo ◽  
Tom Marsh ◽  
Alexandra Kidd ◽  

Abstract Redevelopment of a mature field enables reassessment of the current field understanding to maximise its economic return. However, the redevelopment process is associated with several challenges: 1) analysis of large data sets is a time-consuming process, 2) extrapolation of the existing data on new areas is associated with significant uncertainties, 3) screening multiple potential scenarios can be tedious. Traditional workflows have not combatted these challenges in an efficient manner. In this work, we suggest an integrated approach to combine static and dynamic uncertainties to streamline evaluating of multiple possible scenarios is adopted, while quantifying the associated uncertainties to improve reservoir history matching and forecasting. The creation of a fully integrated automated workflow which includes geological and fluid models is used to perform Assisted History Matching (AHM) that allows the screening of different parameter combinations whilst also calibrating to the historical data. An ensemble of history matched models is then selected using dimensionality reduction and clustering techniques. The selected ensemble is used for reservoir predictions and represents a spread of possible solutions accounting for uncertainty. Finally, well location optimisation under uncertainty is performed to find the optimal well location for multiple equiprobable scenarios simultaneously. The suggested workflow was applied to the Northern Area Claymore (NAC) field. NAC is a structurally complex, Lower Cretaceous stacked turbidite, composed of three reservoirs, which have produced ~170 MMbbls of oil since 1978 from an estimated STOIIP of ~500 MMstb. The integrated workflow helps to streamline the redevelopment project by allowing geoscientists and engineers to work together, account for multiple scenarios and quantify the associated uncertainties. Working with static and dynamic variables simultaneously helps to get a better insight into how different properties and property combinations can help to achieve a history match. Using powerful hardware, cloud-computing and fully parallel software allow to evaluate a range of possible solutions and work with an ensemble of equally probable matched models. As an ultimate outcome of the redevelopment project, several prediction profiles have been produced in a time-efficient manner, aiming to improve field recovery and accounting for the associated uncertainty. The current project shows the value of the integrated approach applied to a real case to overcome the shortcomings of the traditional approach. The collaboration of experts with different backgrounds in a common project permits the assessment of multiple hypotheses in an efficient manner and helps to get a deeper understanding of the reservoir. Finally, the project provides evidence that working with an ensemble of models allows to evaluate a range of possible solutions and account for potential risks, providing more robust predictions for future field redevelopment.

Muhammed-Fatih Kaya ◽  
Mareike Schoop

AbstractThe systematic processing of unstructured communication data as well as the milestone of pattern recognition in order to determine communication groups in negotiations bears many challenges in Machine Learning. In particular, the so-called curse of dimensionality makes the pattern recognition process demanding and requires further research in the negotiation environment. In this paper, various selected renowned clustering approaches are evaluated with regard to their pattern recognition potential based on high-dimensional negotiation communication data. A research approach is presented to evaluate the application potential of selected methods via a holistic framework including three main evaluation milestones: the determination of optimal number of clusters, the main clustering application, and the performance evaluation. Hence, quantified Term Document Matrices are initially pre-processed and afterwards used as underlying databases to investigate the pattern recognition potential of clustering techniques by considering the information regarding the optimal number of clusters and by measuring the respective internal as well as external performances. The overall research results show that certain cluster separations are recommended by internal and external performance measures by means of a holistic evaluation approach, whereas three of the clustering separations are eliminated based on the evaluation results.

2021 ◽  
Herdiantri Sufriyana ◽  
Yu Wei Wu ◽  
Emily Chia-Yu Su

Abstract We aimed to provide a framework that organizes internal properties of a convolutional neural network (CNN) model using non-image data to be interpretable by human. The interface was represented as ontology map and network respectively by dimensional reduction and hierarchical clustering techniques. The applicability is to implement a prediction model either to classify categorical or to estimate numerical outcome, including but not limited to that using data from electronic health records. This pipeline harnesses invention of CNN algorithms for non-image data while improving the depth of interpretability by data-driven ontology. However, the DI-VNN is only for exploration beyond its predictive ability, which requires further explanatory studies, and needs a human user with specific competences in medicine, statistics, and machine learning to explore the DI-VNN with high confidence. The key stages consisted of data preprocessing, differential analysis, feature mapping, network architecture construction, model training and validation, and exploratory analysis.

2021 ◽  
Vol 17 (10) ◽  
pp. e1009459
Jason Bennett ◽  
Mikhail Pomaznoy ◽  
Akul Singhania ◽  
Bjoern Peters

Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.

2021 ◽  
pp. 1-24
Mohamed Chebel ◽  
Chiraz Latiri ◽  
Eric Gaussier

Abstract Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concept vectors which can be used to enhance comparable corpora through clustering techniques. We then show how one can extract bilingual lexicons of improved quality from these enhanced corpora. We finally show that the bilingual lexicons obtained can complement existing bilingual dictionaries and improve cross-language information retrieval systems.

2021 ◽  
Vol 14 (4) ◽  
pp. 33-44
G. Chamundeswari ◽  
G. P. S. Varma ◽  
C. Satyanarayana

Clustering techniques are used widely in computer vision and pattern recognition. The clustering techniques are found to be efficient with the feature vector of the input image. So, the present paper uses an approach for evaluating the feature vector by using Hough transformation. With the Hough transformation, the present paper mapped the points to line segment. The line features are considered as the feature vector and are given to the neural network for performing clustering. The present paper uses self-organizing map (SOM) neural network for performing the clustering process. The proposed method is evaluated with various leaf images, and the evaluated performance measures show the efficiency of the proposed method.

Sign in / Sign up

Export Citation Format

Share Document