Implicit consensus clustering from multiple graphs

AbstractDealing with relational learning generally relies on tools modeling relational data. An undirected graph can represent these data with vertices depicting entities and edges describing the relationships between the entities. These relationships can be well represented by multiple undirected graphs over the same set of vertices with edges arising from different graphs catching heterogeneous relations. The vertices of those networks are often structured in unknown clusters with varying properties of connectivity. These multiple graphs can be structured as a three-way tensor, where each slice of tensor depicts a graph which is represented by a count data matrix. To extract relevant clusters, we propose an appropriate model-based co-clustering capable of dealing with multiple graphs. The proposed model can be seen as a suitable tensor extension of mixture models of graphs, while the obtained co-clustering can be treated as a consensus clustering of nodes from multiple graphs. Applications on real datasets and comparisons with multi-view clustering and tensor decomposition methods show the interest of our contribution.

Download Full-text

Transition models for count data: a flexible alternative to fixed distribution models

Statistical Methods & Applications ◽

10.1007/s10260-021-00558-6 ◽

2021 ◽

Author(s):

Moritz Berger ◽

Gerhard Tutz

Keyword(s):

Count Data ◽

Regression Models ◽

Negative Binomial ◽

Real Data ◽

Distribution Models ◽

Explanatory Variables ◽

Excess Zeros ◽

Proposed Model ◽

Transition Models ◽

Fixed Distribution

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.

Download Full-text

Multi-Aspect Incremental Tensor Decomposition Based on Distributed In-Memory Big Data Systems

Journal of Data and Information Science ◽

10.2478/jdis-2020-0010 ◽

2020 ◽

Vol 5 (2) ◽

pp. 13-32

Author(s):

Hye-Kyung Yang ◽

Hwan-Seung Yong

Keyword(s):

Execution Time ◽

Three Dimensional ◽

Large Data ◽

Tensor Decomposition ◽

Decomposition Methods ◽

Apache Spark ◽

Decomposition Algorithms ◽

Data Systems ◽

Big Data Systems ◽

Spark Framework

AbstractPurposeWe propose InParTen2, a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework. The proposed method reduces re-decomposition cost and can handle large tensors.Design/methodology/approachConsidering that tensor addition increases the size of a given tensor along all axes, the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors. Additionally, InParTen2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.FindingsThe performance of InParTen2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets. The results confirm that InParTen2 can process large tensors and reduce the re-calculation cost of tensor decomposition. Consequently, the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.Research limitationsThere are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods. However, the former require longer iteration time, and therefore their execution time cannot be compared with that of Spark-based algorithms, whereas the latter run on a single machine, thus limiting their ability to handle large data.Practical implicationsThe proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.Originality/valueThe proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark. Moreover, InParTen2 can handle static as well as incremental tensor decomposition.

Download Full-text

Extreme Value Analysis for Mixture Models with Heavy-Tailed Impurity

Mathematics ◽

10.3390/math9182208 ◽

2021 ◽

Vol 9 (18) ◽

pp. 2208

Author(s):

Ekaterina Morozova ◽

Vladimir Panov

Keyword(s):

Stock Returns ◽

Mixture Models ◽

Extreme Value ◽

Extreme Value Analysis ◽

Value Analysis ◽

Numerical Examples ◽

Proposed Model ◽

Mixing Parameter ◽

Heavy Tailed ◽

Regularly Varying Distribution

This paper deals with the extreme value analysis for the triangular arrays which appear when some parameters of the mixture model vary as the number of observations grows. When the mixing parameter is small, it is natural to associate one of the components with “an impurity” (in the case of regularly varying distribution, “heavy-tailed impurity”), which “pollutes” another component. We show that the set of possible limit distributions is much more diverse than in the classical Fisher–Tippett–Gnedenko theorem, and provide the numerical examples showing the efficiency of the proposed model for studying the maximal values of the stock returns.

Download Full-text

Undirected Graphs Realizable as Graphs of Modular Lattices

Canadian Journal of Mathematics ◽

10.4153/cjm-1965-088-1 ◽

1965 ◽

Vol 17 ◽

pp. 923-932 ◽

Cited By ~ 15

Author(s):

Laurence R. Alvarez

Keyword(s):

Distributive Lattice ◽

Partial Order ◽

Directed Graph ◽

Undirected Graph ◽

Single Edge ◽

Undirected Graphs ◽

Modular Lattices

If (L, ≥) is a lattice or partial order we may think of its Hesse diagram as a directed graph, G, containing the single edge E(c, d) if and only if c covers d in (L, ≥). This graph we shall call the graph of (L, ≥). Strictly speaking it is the basis graph of (L, ≥) with the loops at each vertex removed; see (3, p. 170).We shall say that an undirected graph Gu can be realized as the graph of a (modular) (distributive) lattice if and only if there is some (modular) (distributive) lattice whose graph has Gu as its associated undirected graph.

Download Full-text

Comparison of Tensor Decomposition Methods for Simulation of Multilinear Time-Invariant Systems with the MTI Toolbox * *This work was partly supported by the project OBSERVE of the Federal Ministry for Economic Affairs and Energy Germany (Grant-No.: 03ET1225B).

IFAC-PapersOnLine ◽

10.1016/j.ifacol.2017.08.1107 ◽

2017 ◽

Vol 50 (1) ◽

pp. 5610-5615 ◽

Cited By ~ 3

Author(s):

Kai Kruppa

Keyword(s):

Tensor Decomposition ◽

Decomposition Methods ◽

Time Invariant

Download Full-text

Maximum likelihood function for fuzzy count data models (using heaped data as fuzzy)

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-192094 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6891-6901

Author(s):

Godrick Oketch ◽

Filiz Karaman

Keyword(s):

Count Data ◽

Likelihood Function ◽

Real Data ◽

Study Data ◽

Data Models ◽

Count Data Models ◽

Proposed Model ◽

Data Points ◽

Heaped Data ◽

Entire Dataset

Count data models are based on definite counts of events as dependent variables. But there are practical situations in which these counts may fail to be specific and are seen as imprecise. In this paper, an assumption that heaped data points are fuzzy is used as a way of identifying counts that are not definite since heaping can result from imprecisely reported counts. Because it is practically unlikely to report all counts in an entire dataset as imprecise, this paper proposes a likelihood function that not only considers both precise and imprecisely reported counts but also incorporates α - cuts of fuzzy numbers with the aim of varying impreciseness of fuzzy reported counts. The proposed model is then illustrated through a smoking cessation study data that attempts to identify factors associated with the number of cigarettes smoked in a month. Through the real data illustration and a simulation study, it is shown that the proposed model performs better in predicting the outcome counts especially when the imprecision of the fuzzy points in a dataset are increased. The results also show that inclusion of α - cuts makes it possible to identify better models, a feature that was not previously possible.

Download Full-text

Integrative Analysis of Multi-Omics Identified the Prognostic Biomarkers in Acute Myelogenous Leukemia

Frontiers in Oncology ◽

10.3389/fonc.2020.591937 ◽

2020 ◽

Vol 10 ◽

Author(s):

Jiafeng Zheng ◽

Tongqiang Zhang ◽

Wei Guo ◽

Caili Zhou ◽

Xiaojian Cui ◽

...

Keyword(s):

Overall Survival ◽

Acute Myelogenous Leukemia ◽

Function Analysis ◽

R Package ◽

Myelogenous Leukemia ◽

Prognostic Biomarkers ◽

Data Matrix ◽

Consensus Clustering ◽

Differentially Methylated Genes ◽

Acute Myelogenous

BackgroundAcute myelogenous leukemia (AML) is a common pediatric malignancy in children younger than 15 years old. Although the overall survival (OS) has been improved in recent years, the mechanisms of AML remain largely unknown. Hence, the purpose of this study is to explore the differentially methylated genes and to investigate the underlying mechanism in AML initiation and progression based on the bioinformatic analysis.MethodsMethylation array data and gene expression data were obtained from TARGET Data Matrix. The consensus clustering analysis was performed using ConsensusClusterPlus R package. The global DNA methylation was analyzed using methylationArrayAnalysis R package and differentially methylated genes (DMGs), and differentially expressed genes (DEGs) were identified using Limma R package. Besides, the biological function was analyzed using clusterProfiler R package. The correlation between DMGs and DEGs was determined using psych R package. Moreover, the correlation between DMGs and AML was assessed using varElect online tool. And the overall survival and progression-free survival were analyzed using survival R package.ResultsAll AML samples in this study were divided into three clusters at k = 3. Based on consensus clustering, we identified 1,146 CpGs, including 40 hypermethylated and 1,106 hypomethylated CpGs in AML. Besides, a total 529 DEGs were identified, including 270 upregulated and 259 downregulated DEGs in AML. The function analysis showed that DEGs significantly enriched in AML related biological process. Moreover, the correlation between DMGs and DEGs indicated that seven DMGs directly interacted with AML. CD34, HOXA7, and CD96 showed the strongest correlation with AML. Further, we explored three CpG sites cg03583857, cg26511321, cg04039397 of CD34, HOXA7, and CD96 which acted as the clinical prognostic biomarkers.ConclusionOur study identified three novel methylated genes in AML and also explored the mechanism of methylated genes in AML. Our finding may provide novel potential prognostic markers for AML.

Download Full-text

New data from Monoplacophora and a carefully-curated dataset resolve molluscan relationships

Scientific Reports ◽

10.1038/s41598-019-56728-w ◽

2020 ◽

Vol 10 (1) ◽

Cited By ~ 11

Author(s):

Kevin M. Kocot ◽

Albert J. Poustka ◽

Isabella Stöger ◽

Kenneth M. Halanych ◽

Michael Schrödl

Keyword(s):

Mixture Models ◽

Strong Support ◽

Housekeeping Genes ◽

Data Matrix ◽

Quality Data ◽

Sister Taxon ◽

Phylogenetic Position ◽

Morphological Studies ◽

Homogeneous Models ◽

Molluscan Evolution

AbstractRelationships among the major lineages of Mollusca have long been debated. Morphological studies have considered the rarely collected Monoplacophora (Tryblidia) to have several plesiomorphic molluscan traits. The phylogenetic position of this group is contentious as morphologists have generally placed this clade as the sister taxon of the rest of Conchifera whereas earlier molecular studies supported a clade of Monoplacophora + Polyplacophora (Serialia) and phylogenomic studies have generally recovered a clade of Monoplacophora + Cephalopoda. Phylogenomic studies have also strongly supported a clade including Gastropoda, Bivalvia, and Scaphopoda, but relationships among these taxa have been inconsistent. In order to resolve conchiferan relationships and improve understanding of early molluscan evolution, we carefully curated a high-quality data matrix and conducted phylogenomic analyses with broad taxon sampling including newly sequenced genomic data from the monoplacophoran Laevipilina antarctica. Whereas a partitioned maximum likelihood (ML) analysis using site-homogeneous models recovered Monoplacophora sister to Cephalopoda with moderate support, both ML and Bayesian inference (BI) analyses using mixture models recovered Monoplacophora sister to all other conchiferans with strong support. A supertree approach also recovered Monoplacophora as the sister taxon of a clade composed of the rest of Conchifera. Gastropoda was recovered as the sister taxon of Scaphopoda in most analyses, which was strongly supported when mixture models were used. A molecular clock based on our BI topology dates diversification of Mollusca to ~546 MYA (+/− 6 MYA) and Conchifera to ~540 MYA (+/− 9 MYA), generally consistent with previous work employing nuclear housekeeping genes. These results provide important resolution of conchiferan mollusc phylogeny and offer new insights into ancestral character states of major mollusc clades.

Download Full-text

On equivalence of Markov properties over undirected graphs

Journal of Applied Probability ◽

10.2307/3214910 ◽

1992 ◽

Vol 29 (3) ◽

pp. 745-749 ◽

Cited By ~ 8

Author(s):

F. Matúš

Keyword(s):

Conditional Independence ◽

Undirected Graph ◽

Undirected Graphs ◽

Markov Properties

The dependence of coincidence of the global, local and pairwise Markov properties on the underlying undirected graph is examined. The pairs of these properties are found to be equivalent for graphs with some small excluded subgraphs. Probabilistic representations of the corresponding conditional independence structures are discussed.

Download Full-text

Constrained standardization of count data from massive parallel sequencing

10.1101/2021.03.04.433870 ◽

2021 ◽

Author(s):

Joris Vanhoutven ◽

Bart Cuypers ◽

Pieter Meysman ◽

Jef Hooyberghs ◽

Kris Laukens ◽

...

Keyword(s):

Count Data ◽

Massively Parallel Sequencing ◽

Large Data ◽

Data Matrix ◽

Joint Analysis ◽

Data Sets ◽

Systematic Bias ◽

Parallel Sequencing ◽

Interesting Candidate ◽

Transcriptomics Data

AbstractIn high-throughput omics disciplines like transcriptomics, researchers face a need to assess the quality of an experiment prior to an in-depth statistical analysis. To efficiently analyze such voluminous collections of data, researchers need triage methods that are both quick and easy to use. Such a normalization method for relative quantitation, CONSTANd, was recently introduced for isobarically-labeled mass spectra in proteomics. It transforms the data matrix of abundances through an iterative, convergent process enforcing three constraints: (I) identical column sums; (II) each row sum is fixed (across matrices) and (III) identical to all other row sums. In this study, we investigate whether CONSTANd is suitable for count data from massively parallel sequencing, by qualitatively comparing its results to those of DESeq2. Further, we propose an adjustment of the method so that it may be applied to identically balanced but differently sized experiments for joint analysis. We find that CONSTANd can process large data sets with about 2 million count records in less than a second whilst removing unwanted systematic bias and thus quickly uncovering the underlying biological structure when combined with a PCA plot or hierarchical clustering. Moreover, it allows joint analysis of data sets obtained from different batches, with different protocols and from different labs but without exploiting information from the experimental setup other than the delineation of samples into identically processed sets (IPSs). CONSTANd’s simplicity and applicability to proteomics as well as transcriptomics data make it an interesting candidate for integration in multi-omics workflows.

Download Full-text