scholarly journals minicore: Fast scRNA-seq clustering with various distances

2021 ◽  
Author(s):  
Daniel N. Baker ◽  
Nathan Dyjack ◽  
Vladimir Braverman ◽  
Stephanie C. Hicks ◽  
Ben Langmead

AbstractSingle-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions.Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and minibatch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.AvailabilityThe open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

Mathematics ◽  
2021 ◽  
Vol 9 (5) ◽  
pp. 555
Author(s):  
Chénangnon Frédéric Tovissodé ◽  
Sèwanou Hermann Honfo ◽  
Jonas Têlé Doumatè ◽  
Romain Glèlè Kakaï

Most existing flexible count distributions allow only approximate inference when used in a regression context. This work proposes a new framework to provide an exact and flexible alternative for modeling and simulating count data with various types of dispersion (equi-, under-, and over-dispersion). The new method, referred to as “balanced discretization”, consists of discretizing continuous probability distributions while preserving expectations. It is easy to generate pseudo random variates from the resulting balanced discrete distribution since it has a simple stochastic representation (probabilistic rounding) in terms of the continuous distribution. For illustrative purposes, we develop the family of balanced discrete gamma distributions that can model equi-, under-, and over-dispersed count data. This family of count distributions is appropriate for building flexible count regression models because the expectation of the distribution has a simple expression in terms of the parameters of the distribution. Using the Jensen–Shannon divergence measure, we show that under the equidispersion restriction, the family of balanced discrete gamma distributions is similar to the Poisson distribution. Based on this, we conjecture that while covering all types of dispersions, a count regression model based on the balanced discrete gamma distribution will allow recovering a near Poisson distribution model fit when the data are Poisson distributed.


2018 ◽  
Author(s):  
Tomohiro Nishiyama

In the field of statistics, many kind of divergence functions have been studied as an amount which measures the discrepancy between two probability distributions. In the differential geometrical approach in statistics (information geometry), dually flat spaces play a key role. In a dually flat space, there exist dual affine coordinate systems and strictly convex functions called potential and a canonical divergence is naturally introduced as a function of the affine coordinates and potentials. The canonical divergence satisfies a relational expression called triangular relation. This can be regarded as a generalization of the law of cosines in Euclidean space.In this paper, we newly introduce two kinds of divergences. The first divergence is a function of affine coordinates and it is consistent with the Jeffreys divergence for exponential or mixture families. For this divergence, we show that more relational equations and theorems similar to Euclidean space hold in addition to the law of cosines. The second divergences are functions of potentials and they are consistent with the Bhattacharyya distance for exponential families and are consistent with the Jensen-Shannon divergence for mixture families respectively. We derive an inequality between the the first and the second divergences and show that the inequality is a generalization of Lin's inequality.


2018 ◽  
Author(s):  
Sebastian Proost ◽  
Marek Mutwil

ABSTRACTThe recent accumulation of gene expression data in the form of RNA sequencing creates unprecedented opportunities to study gene regulation and function. Furthermore, comparative analysis of the expression data from multiple species can elucidate which functional gene modules are conserved across species, allowing the study of the evolution of these modules. However, performing such comparative analyses on raw data is not feasible for many biologists. Here, we present CoNekT (Co-expression Network Toolkit), an open source, user-friendly web server, that contains user-friendly tools and interactive visualizations for comparative analyses of gene expression data and co-expression networks. These tools allow analysis and cross-species comparison of (i) gene expression profiles; (ii) co-expression networks; (iii) co-expressed clusters involved in specific biological processes; (iv) tissue-specific gene expression; and (v) expression profiles of gene families. To demonstrate these features, we constructed CoNekT-Plants for green alga, seed plants and flowering plants (Picea abies, Chlamydomonas reinhardtii, Vitis vinifera, Arabidopsis thaliana, Oryza sativa, Zea mays and Solanum lycopersicum) and thus provide a web-tool with the broadest available collection of plant phyla. CoNekT-Plants is freely available from http://conekt.plant.tools, while the CoNekT source code and documentation can be found at https://github.molgen.mpg.de/proost/CoNekT/.


Synthese ◽  
2021 ◽  
Author(s):  
Ilkka Niiniluoto

AbstractIn the general problem of verisimilitude, we try to define the distance of a statement from a target, which is an informative truth about some domain of investigation. For example, the target can be a state description, a structure description, or a constituent of a first-order language (Sect. 1). In the problem of legisimilitude, the target is a deterministic or universal law, which can be expressed by a nomic constituent or a quantitative function involving the operators of physical necessity and possibility (Sect. 2). The special case of legisimilitude, where the target is a probabilistic law (Sect. 3), has been discussed by Roger Rosenkrantz (Synthese, 1980) and Ilkka Niiniluoto (Truthlikeness, 1987, Ch. 11.5). Their basic proposal is to measure the distance between two probabilistic laws by the Kullback–Leibler notion of divergence, which is a semimetric on the space of probability measures. This idea can be applied to probabilistic laws of coexistence and laws of succession, and the examples may involve discrete or continuous state spaces (Sect. 3). In this paper, these earlier studies are elaborated in four directions (Sect. 4). First, even though deterministic laws are limiting cases of probabilistic laws, the target-sensitivity of truthlikeness measures implies that the legisimilitude of probabilistic laws is not easily reducible to the deterministic case. Secondly, the Jensen-Shannon divergence is applied to mixed probabilistic laws which entail some universal laws. Thirdly, a new class of distance measures between probability distributions is proposed, so that their horizontal differences are taken into account in addition to vertical ones (Sect. 5). Fourthly, a solution is given for the epistemic problem of estimating degrees of probabilistic legisimilitude on the basis of empirical evidence (Sect. 6).


Author(s):  
Chénangnon Frédéric Tovissodé ◽  
Romain Glèlè Kakaï ◽  
Sèwanou Hermann Honfo ◽  
Jonas Têlé Doumatè

Most existing flexible count regression models allow only approximate inference. This work proposes a new framework to provide an exact and flexible alternative for modeling and simulating count data with various types of dispersion (equi-, under- and overdispersion). The new method, referred as “balanced discretization”, consists in discretizing continuous probability distributions while preserving expectations. It is easy to generate pseudo random variates from the resulting balanced discrete distribution since it has a simple stochastic representation in terms of the continuous distribution. For illustrative purposes, we have developed the family of balanced discrete gamma distributions which can model equi-, under- and overdispersed count data. This family of count distributions is appropriate for building flexible count regressionmodels because the expectation of the distribution has a simple expression in terms of the parameters of the distribution. Using the Jensen–Shannon divergence measure, we have shown that under equidispersion restriction, the family of balanced discrete gamma distributions is similar to the Poisson distribution. Based on this, we conjecture that while covering all types of dispersion, a count regression model based on the balanced discrete gamma distribution will allow recovering a near Poisson distribution model fit when the data is Poisson distributed.


Technologies ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. 26
Author(s):  
Antonios Lionis ◽  
Konstantinos P. Peppas ◽  
Hector E. Nistazakis ◽  
Andreas Tsigopoulos

The performance of a free-space optical (FSO) communications link suffers from the deleterious effects of weather conditions and atmospheric turbulence. In order to better estimate the reliability and availability of an FSO link, a suitable distribution needs to be employed. The accuracy of this model depends strongly on the atmospheric turbulence strength which causes the scintillation effect. To this end, a variety of probability density functions were utilized to model the optical channel according to the strength of the refractive index structure parameter. Although many theoretical models have shown satisfactory performance, in reality they can significantly differ. This work employs an information theoretic method, namely the so-called Jensen–Shannon divergence, a symmetrization of the Kullback–Leibler divergence, to measure the similarity between different probability distributions. In doing so, a large experimental dataset of received signal strength measurements from a real FSO link is utilized. Additionally, the Pearson family of continuous probability distributions is also employed to determine the best fit according to the mean, standard deviation, skewness and kurtosis of the modeled data.


2020 ◽  
pp. 002202212098237
Author(s):  
Wolfgang Messner

The past few decades have seen an explosion in the interest in cultural differences and their impact on many aspects of business management. A noticeable feature of most academic studies and practitioner approaches is the predominant use of national boundaries and group-level averages as delimiters and proxies for culture. However, this largely ignores the significance that intra-country differences and cross-country similarities can have for identifying psychological phenomena. This article argues for the importance of considering intra-cultural variation for establishing connections between two different cultures. It uses empirical distributions of cultural values that occur naturally within a country, thereby making intracultural differences interpretable and actionable. For measuring cross-country differences, the Gini/Weitzman overlapping index and the Kullback-Leibler divergence coefficient are used as difference measures between two distributions. The properties of these measures in comparison to traditional group-level mean-based distance measures are analyzed, and implications for cross-cultural and international business research are discussed.


Author(s):  
Robin Lovelace

AbstractGeographic analysis has long supported transport plans that are appropriate to local contexts. Many incumbent ‘tools of the trade’ are proprietary and were developed to support growth in motor traffic, limiting their utility for transport planners who have been tasked with twenty-first century objectives such as enabling citizen participation, reducing pollution, and increasing levels of physical activity by getting more people walking and cycling. Geographic techniques—such as route analysis, network editing, localised impact assessment and interactive map visualisation—have great potential to support modern transport planning priorities. The aim of this paper is to explore emerging open source tools for geographic analysis in transport planning, with reference to the literature and a review of open source tools that are already being used. A key finding is that a growing number of options exist, challenging the current landscape of proprietary tools. These can be classified as command-line interface, graphical user interface or web-based user interface tools and by the framework in which they were implemented, with numerous tools released as R, Python and JavaScript packages, and QGIS plugins. The review found a diverse and rapidly evolving ‘ecosystem’ tools, with 25 tools that were designed for geographic analysis to support transport planning outlined in terms of their popularity and functionality based on online documentation. They ranged in size from single-purpose tools such as the QGIS plugin AwaP to sophisticated stand-alone multi-modal traffic simulation software such as MATSim, SUMO and Veins. Building on their ability to re-use the most effective components from other open source projects, developers of open source transport planning tools can avoid ‘reinventing the wheel’ and focus on innovation, the ‘gamified’ A/B Street https://github.com/dabreegster/abstreet/#abstreet simulation software, based on OpenStreetMap, a case in point. The paper, the source code of which can be found at https://github.com/robinlovelace/open-gat, concludes that, although many of the tools reviewed are still evolving and further research is needed to understand their relative strengths and barriers to uptake, open source tools for geographic analysis in transport planning already hold great potential to help generate the strategic visions of change and evidence that is needed by transport planners in the twenty-first century.


2017 ◽  
Author(s):  
Fangzheng Xie ◽  
Mingyuan Zhou ◽  
Yanxun Xu

AbstractTumors are heterogeneous - a tumor sample usually consists of a set of subclones with distinct transcriptional profiles and potentially different degrees of aggressiveness and responses to drugs. Understanding tumor heterogeneity is therefore critical for precise cancer prognosis and treatment. In this paper, we introduce BayCount, a Bayesian decomposition method to infer tumor heterogeneity with highly over-dispersed RNA sequencing count data. Using negative binomial factor analysis, BayCount takes into account both the between-sample and gene-specific random effects on raw counts of sequencing reads mapped to each gene. For the posterior inference, we develop an efficient compound Poisson based blocked Gibbs sampler. Simulation studies show that BayCount is able to accurately estimate the subclonal inference, including number of subclones, the proportions of these subclones in each tumor sample, and the gene expression profiles in each subclone. For real-world data examples, we apply BayCount to The Cancer Genome Atlas lung cancer and kidney cancer RNA sequencing count data and obtain biologically interpretable results. Our method represents the first effort in characterizing tumor heterogeneity using RNA sequencing count data that simultaneously removes the need of normalizing the counts, achieves statistical robustness, and obtains biologically/clinically meaningful insights. The R package BayCount implementing our model and algorithm is available for download.


2017 ◽  
Author(s):  
Mickael Silva ◽  
Miguel Machado ◽  
Diogo N. Silva ◽  
Mirko Rossi ◽  
Jacob Moran-Gilad ◽  
...  

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.


Sign in / Sign up

Export Citation Format

Share Document