scholarly journals Implicit consensus clustering from multiple graphs

Author(s):  
Rafika Boutalbi ◽  
Lazhar Labiod ◽  
Mohamed Nadif

AbstractDealing with relational learning generally relies on tools modeling relational data. An undirected graph can represent these data with vertices depicting entities and edges describing the relationships between the entities. These relationships can be well represented by multiple undirected graphs over the same set of vertices with edges arising from different graphs catching heterogeneous relations. The vertices of those networks are often structured in unknown clusters with varying properties of connectivity. These multiple graphs can be structured as a three-way tensor, where each slice of tensor depicts a graph which is represented by a count data matrix. To extract relevant clusters, we propose an appropriate model-based co-clustering capable of dealing with multiple graphs. The proposed model can be seen as a suitable tensor extension of mixture models of graphs, while the obtained co-clustering can be treated as a consensus clustering of nodes from multiple graphs. Applications on real datasets and comparisons with multi-view clustering and tensor decomposition methods show the interest of our contribution.

Author(s):  
Moritz Berger ◽  
Gerhard Tutz

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.


2020 ◽  
Vol 5 (2) ◽  
pp. 13-32
Author(s):  
Hye-Kyung Yang ◽  
Hwan-Seung Yong

AbstractPurposeWe propose InParTen2, a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework. The proposed method reduces re-decomposition cost and can handle large tensors.Design/methodology/approachConsidering that tensor addition increases the size of a given tensor along all axes, the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors. Additionally, InParTen2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.FindingsThe performance of InParTen2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets. The results confirm that InParTen2 can process large tensors and reduce the re-calculation cost of tensor decomposition. Consequently, the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.Research limitationsThere are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods. However, the former require longer iteration time, and therefore their execution time cannot be compared with that of Spark-based algorithms, whereas the latter run on a single machine, thus limiting their ability to handle large data.Practical implicationsThe proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.Originality/valueThe proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark. Moreover, InParTen2 can handle static as well as incremental tensor decomposition.


Mathematics ◽  
2021 ◽  
Vol 9 (18) ◽  
pp. 2208
Author(s):  
Ekaterina Morozova ◽  
Vladimir Panov

This paper deals with the extreme value analysis for the triangular arrays which appear when some parameters of the mixture model vary as the number of observations grows. When the mixing parameter is small, it is natural to associate one of the components with “an impurity” (in the case of regularly varying distribution, “heavy-tailed impurity”), which “pollutes” another component. We show that the set of possible limit distributions is much more diverse than in the classical Fisher–Tippett–Gnedenko theorem, and provide the numerical examples showing the efficiency of the proposed model for studying the maximal values of the stock returns.


1965 ◽  
Vol 17 ◽  
pp. 923-932 ◽  
Author(s):  
Laurence R. Alvarez

If (L, ≥) is a lattice or partial order we may think of its Hesse diagram as a directed graph, G, containing the single edge E(c, d) if and only if c covers d in (L, ≥). This graph we shall call the graph of (L, ≥). Strictly speaking it is the basis graph of (L, ≥) with the loops at each vertex removed; see (3, p. 170).We shall say that an undirected graph Gu can be realized as the graph of a (modular) (distributive) lattice if and only if there is some (modular) (distributive) lattice whose graph has Gu as its associated undirected graph.


2020 ◽  
Vol 39 (5) ◽  
pp. 6891-6901
Author(s):  
Godrick Oketch ◽  
Filiz Karaman

Count data models are based on definite counts of events as dependent variables. But there are practical situations in which these counts may fail to be specific and are seen as imprecise. In this paper, an assumption that heaped data points are fuzzy is used as a way of identifying counts that are not definite since heaping can result from imprecisely reported counts. Because it is practically unlikely to report all counts in an entire dataset as imprecise, this paper proposes a likelihood function that not only considers both precise and imprecisely reported counts but also incorporates α - cuts of fuzzy numbers with the aim of varying impreciseness of fuzzy reported counts. The proposed model is then illustrated through a smoking cessation study data that attempts to identify factors associated with the number of cigarettes smoked in a month. Through the real data illustration and a simulation study, it is shown that the proposed model performs better in predicting the outcome counts especially when the imprecision of the fuzzy points in a dataset are increased. The results also show that inclusion of α - cuts makes it possible to identify better models, a feature that was not previously possible.


2020 ◽  
Vol 10 ◽  
Author(s):  
Jiafeng Zheng ◽  
Tongqiang Zhang ◽  
Wei Guo ◽  
Caili Zhou ◽  
Xiaojian Cui ◽  
...  

BackgroundAcute myelogenous leukemia (AML) is a common pediatric malignancy in children younger than 15 years old. Although the overall survival (OS) has been improved in recent years, the mechanisms of AML remain largely unknown. Hence, the purpose of this study is to explore the differentially methylated genes and to investigate the underlying mechanism in AML initiation and progression based on the bioinformatic analysis.MethodsMethylation array data and gene expression data were obtained from TARGET Data Matrix. The consensus clustering analysis was performed using ConsensusClusterPlus R package. The global DNA methylation was analyzed using methylationArrayAnalysis R package and differentially methylated genes (DMGs), and differentially expressed genes (DEGs) were identified using Limma R package. Besides, the biological function was analyzed using clusterProfiler R package. The correlation between DMGs and DEGs was determined using psych R package. Moreover, the correlation between DMGs and AML was assessed using varElect online tool. And the overall survival and progression-free survival were analyzed using survival R package.ResultsAll AML samples in this study were divided into three clusters at k = 3. Based on consensus clustering, we identified 1,146 CpGs, including 40 hypermethylated and 1,106 hypomethylated CpGs in AML. Besides, a total 529 DEGs were identified, including 270 upregulated and 259 downregulated DEGs in AML. The function analysis showed that DEGs significantly enriched in AML related biological process. Moreover, the correlation between DMGs and DEGs indicated that seven DMGs directly interacted with AML. CD34, HOXA7, and CD96 showed the strongest correlation with AML. Further, we explored three CpG sites cg03583857, cg26511321, cg04039397 of CD34, HOXA7, and CD96 which acted as the clinical prognostic biomarkers.ConclusionOur study identified three novel methylated genes in AML and also explored the mechanism of methylated genes in AML. Our finding may provide novel potential prognostic markers for AML.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Kevin M. Kocot ◽  
Albert J. Poustka ◽  
Isabella Stöger ◽  
Kenneth M. Halanych ◽  
Michael Schrödl

AbstractRelationships among the major lineages of Mollusca have long been debated. Morphological studies have considered the rarely collected Monoplacophora (Tryblidia) to have several plesiomorphic molluscan traits. The phylogenetic position of this group is contentious as morphologists have generally placed this clade as the sister taxon of the rest of Conchifera whereas earlier molecular studies supported a clade of Monoplacophora + Polyplacophora (Serialia) and phylogenomic studies have generally recovered a clade of Monoplacophora + Cephalopoda. Phylogenomic studies have also strongly supported a clade including Gastropoda, Bivalvia, and Scaphopoda, but relationships among these taxa have been inconsistent. In order to resolve conchiferan relationships and improve understanding of early molluscan evolution, we carefully curated a high-quality data matrix and conducted phylogenomic analyses with broad taxon sampling including newly sequenced genomic data from the monoplacophoran Laevipilina antarctica. Whereas a partitioned maximum likelihood (ML) analysis using site-homogeneous models recovered Monoplacophora sister to Cephalopoda with moderate support, both ML and Bayesian inference (BI) analyses using mixture models recovered Monoplacophora sister to all other conchiferans with strong support. A supertree approach also recovered Monoplacophora as the sister taxon of a clade composed of the rest of Conchifera. Gastropoda was recovered as the sister taxon of Scaphopoda in most analyses, which was strongly supported when mixture models were used. A molecular clock based on our BI topology dates diversification of Mollusca to ~546 MYA (+/− 6 MYA) and Conchifera to ~540 MYA (+/− 9 MYA), generally consistent with previous work employing nuclear housekeeping genes. These results provide important resolution of conchiferan mollusc phylogeny and offer new insights into ancestral character states of major mollusc clades.


1992 ◽  
Vol 29 (3) ◽  
pp. 745-749 ◽  
Author(s):  
F. Matúš

The dependence of coincidence of the global, local and pairwise Markov properties on the underlying undirected graph is examined. The pairs of these properties are found to be equivalent for graphs with some small excluded subgraphs. Probabilistic representations of the corresponding conditional independence structures are discussed.


2021 ◽  
Author(s):  
Joris Vanhoutven ◽  
Bart Cuypers ◽  
Pieter Meysman ◽  
Jef Hooyberghs ◽  
Kris Laukens ◽  
...  

AbstractIn high-throughput omics disciplines like transcriptomics, researchers face a need to assess the quality of an experiment prior to an in-depth statistical analysis. To efficiently analyze such voluminous collections of data, researchers need triage methods that are both quick and easy to use. Such a normalization method for relative quantitation, CONSTANd, was recently introduced for isobarically-labeled mass spectra in proteomics. It transforms the data matrix of abundances through an iterative, convergent process enforcing three constraints: (I) identical column sums; (II) each row sum is fixed (across matrices) and (III) identical to all other row sums. In this study, we investigate whether CONSTANd is suitable for count data from massively parallel sequencing, by qualitatively comparing its results to those of DESeq2. Further, we propose an adjustment of the method so that it may be applied to identically balanced but differently sized experiments for joint analysis. We find that CONSTANd can process large data sets with about 2 million count records in less than a second whilst removing unwanted systematic bias and thus quickly uncovering the underlying biological structure when combined with a PCA plot or hierarchical clustering. Moreover, it allows joint analysis of data sets obtained from different batches, with different protocols and from different labs but without exploiting information from the experimental setup other than the delineation of samples into identically processed sets (IPSs). CONSTANd’s simplicity and applicability to proteomics as well as transcriptomics data make it an interesting candidate for integration in multi-omics workflows.


Sign in / Sign up

Export Citation Format

Share Document