Algorithmic Exploration of Axiom Spaces for Efficient Similarity Search at Large Scale

Author(s):  
Tomáš Skopal ◽  
Tomáš Bartoš
2021 ◽  
Author(s):  
Bengt Ljungquist ◽  
Masood A Akram ◽  
Giorgio A Ascoli

Most functions of the nervous system depend on neuronal and glial morphology. Continuous advances in microscopic imaging and tracing software have provided an increasingly abundant availability of 3D reconstructions of arborizing dendrites, axons, and processes, allowing their detailed study. However, efficient, large-scale methods to rank neural morphologies by similarity to an archetype are still lacking. Using the NeuroMorpho.Org database, we present a similarity search software enabling fast morphological comparison of hundreds of thousands of neural reconstructions from any species, brain regions, cell types, and preparation protocols. We compared the performance of different morphological measurements: 1) summary morphometrics calculated by L-Measure, 2) persistence vectors, a vectorized descriptor of branching structure, 3) the combination of the two. In all cases, we also investigated the impact of applying dimensionality reduction using principal component analysis (PCA). We assessed qualitative performance by gauging the ability to rank neurons in order of visual similarity. Moreover, we quantified information content by examining explained variance and benchmarked the ability to identify occasional duplicate reconstructions of the same specimen. The results indicate that combining summary morphometrics and persistence vectors with applied PCA provides an information rich characterization that enables efficient and precise comparison of neural morphology. The execution time scaled linearly with data set size, allowing seamless live searching through the entire NeuroMorpho.Org content in fractions of a second. We have deployed the similarity search function as an open-source online software tool both through a user-friendly graphical interface and as an API for programmatic access.


2018 ◽  
Vol 5 (1) ◽  
pp. 24-34
Author(s):  
I. P. Bangov ◽  
M. Moskovkina ◽  
B. P. Stojanov

Abstract This study continues the attempt to use the statistical process for a large-scale analytical data. A group of 3898 white wines, each with 11 analytical laboratory benchmarks was analyzed by a fingerprint similarity search in order to be grouped into separate clusters. A characterization of the wine’s quality in each individual cluster was carried out according to individual laboratory parameters.


2017 ◽  
Author(s):  
Sanjoy Dasgupta ◽  
Charles F. Stevens ◽  
Saket Navlakha

Similarity search, such as identifying similar images in a database or similar documents on the Web, is a fundamental computing problem faced by many large-scale information retrieval systems. We discovered that the fly’s olfac-tory circuit solves this problem using a novel variant of a traditional computer science algorithm (called locality-sensitive hashing). The fly’s circuit assigns similar neural activity patterns to similar input stimuli (odors), so that behav-iors learned from one odor can be applied when a similar odor is experienced. The fly’s algorithm, however, uses three new computational ingredients that depart from traditional approaches. We show that these ingredients can be translated to improve the performance of similarity search compared to tra-ditional algorithms when evaluated on several benchmark datasets. Overall, this perspective helps illuminate the logic supporting an important sensory function (olfaction), and it provides a conceptually new algorithm for solving a fundamental computational problem.


Author(s):  
Timothy Reynolds ◽  
Jason A. Bubier ◽  
Michael A. Langston ◽  
Elissa J. Chesler ◽  
Erich J. Baker

AbstractDisease diagnosis and treatment is challenging in part due to the misalignment of diagnostic categories with the underlying biology of disease. The evaluation of large-scale genomic experimental datasets is a compelling approach to refining the classification of biological concepts, such as disease. Well-established approaches, some of which rely on information theory or network analysis, quantitatively assess relationships among biological entities using gene annotations, structured vocabularies, and curated data sources. However, the gene annotations used in these evaluations are often sparse, potentially biased due to uneven study and representation in the literature, and constrained to the single species from which they were derived. In order to overcome these deficiencies inherent in the structure and sparsity of these annotated datasets, we developed a novel Network Enhanced Similarity Search (NESS) tool which takes advantage of multi-species networks of heterogeneous data to bridge sparsely populated datasets.NESS employs a random walk with restart algorithm across harmonized multi-species data, effectively compensating for sparsely populated and noisy genomic studies. We further demonstrate that it is highly resistant to spurious or sparse datasets and generates significantly better recapitulation of ground truth biological pathways than other similarity metrics alone. Furthermore, since NESS has been deployed as an embedded tool in the GeneWeaver environment, it can rapidly take advantage of curated multi-species networks to provide informative assertions of relatedness of any pair of biological entities or concepts, e.g., gene-gene, gene-disease, or phenotype-disease associations. NESS ultimately enables multi-species analysis applications to leverage model organism data to overcome the challenge of data sparsity in the study of human disease.Availability and ImplementationImplementation available at https://geneweaver.org/ness. Source code freely available at https://github.com/treynr/ness.Author summaryFinding consensus among large-scale genomic datasets is an ongoing challenge in the biomedical sciences. Harmonizing and analyzing such data is important because it allows researchers to mitigate the idiosyncrasies of experimental systems, alleviate study biases, and augment sparse datasets. Additionally, it allows researchers to utilize animal model studies and cross-species experiments to better understand biological function in health and disease. Here we provide a tool for integrating and analyzing heterogeneous functional genomics data using a graph-based model. We show how this type of analysis can be used to identify similar relationships among biological entities such as genes, processes, and disease through shared genomic associations. Our results indicate this approach is effective at reducing biases caused by sparse and noisy datasets. We show how this type of analysis can be used to aid the classification gene function and prioritization of genes involved in substance use disorders. In addition, our analysis reveals genes and biological pathways with shared association to multiple, co-occurring substance use disorders.


Sign in / Sign up

Export Citation Format

Share Document