scholarly journals Addressing dereplication crisis: Taxonomy-free reduction of massive genome collections using embeddings of protein content

2019 ◽  
Author(s):  
A. Viehweger ◽  
M. Hoelzer ◽  
C. Brandt

AbstractMany recent microbial genome collections curate hundreds of thousands of genomes. This volume complicates many genomic analyses such as taxon assignment because the associated computational burden is substantial. However, the number of representatives of each species is highly skewed towards human pathogens and model organisms. Thus many genomes contain little additional information and could be removed. We created a frugal dereplication method that can reduce massive genome collections based on genome sequence alone, without the need for manual curation nor taxonomic information.We recently created a genome representation for bacteria and archaea called “nanotext”. This method embeds each genome in a low-dimensional vector of numbers. Extending nanotext, our proposed algorithm called “thinspace” uses these vectors to group and dereplicate similar genomes.We dereplicated the Genome Taxonomy Database (GTDB) from about 150 thousand genomes to less than 22 thousand. The resulting collection increases the percent of classified reads in a metagenomic dataset by a factor of 5 compared to NCBI RefSeq and performs equal to both a larger as well as a manually curated GTDB subset.With thinspace, massive genome collections can be dereplicated on regular hardware, without affecting downstream results. It is released under a BSD-3 license (github.com/phiweger/thinspace).

Genetics ◽  
2002 ◽  
Vol 162 (4) ◽  
pp. 1863-1873 ◽  
Author(s):  
J Slate ◽  
P M Visscher ◽  
S MacGregor ◽  
D Stevens ◽  
M L Tate ◽  
...  

Abstract Recent empirical evidence indicates that although fitness and fitness components tend to have low heritability in natural populations, they may nonetheless have relatively large components of additive genetic variance. The molecular basis of additive genetic variation has been investigated in model organisms but never in the wild. In this article we describe an attempt to map quantitative trait loci (QTL) for birth weight (a trait positively associated with overall fitness) in an unmanipulated, wild population of red deer (Cervus elaphus). Two approaches were used: interval mapping by linear regression within half-sib families and a variance components analysis of a six-generation pedigree of >350 animals. Evidence for segregating QTL was found on three linkage groups, one of which was significant at the genome-wide suggestive linkage threshold. To our knowledge this is the first time that a QTL for any trait has been mapped in a wild mammal population. It is hoped that this study will stimulate further investigations of the genetic architecture of fitness traits in the wild.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4454 ◽  
Author(s):  
Marek Piorecky ◽  
Vlastimil Koudelka ◽  
Jan Strobl ◽  
Martin Brunovsky ◽  
Vladimir Krajca

Simultaneous recordings of electroencephalogram (EEG) and functional magnetic resonance imaging (fMRI) are at the forefront of technologies of interest to physicians and scientists because they combine the benefits of both modalities—better time resolution (hdEEG) and space resolution (fMRI). However, EEG measurements in the scanner contain an electromagnetic field that is induced in leads as a result of gradient switching slight head movements and vibrations, and it is corrupted by changes in the measured potential because of the Hall phenomenon. The aim of this study is to design and test a methodology for inspecting hidden EEG structures with respect to artifacts. We propose a top-down strategy to obtain additional information that is not visible in a single recording. The time-domain independent component analysis algorithm was employed to obtain independent components and spatial weights. A nonlinear dimension reduction technique t-distributed stochastic neighbor embedding was used to create low-dimensional space, which was then partitioned using the density-based spatial clustering of applications with noise (DBSCAN). The relationships between the found data structure and the used criteria were investigated. As a result, we were able to extract information from the data structure regarding electrooculographic, electrocardiographic, electromyographic and gradient artifacts. This new methodology could facilitate the identification of artifacts and their residues from simultaneous EEG in fMRI.


2002 ◽  
Vol 06 (24) ◽  
pp. 958-965
Author(s):  
Jun Yu ◽  
Jian Wang ◽  
Huanming Yang

A coordinated international effort to sequence agricultural and livestock genomes has come to its time. While human genome and genomes of many model organisms (related to human health and basic biological interests) have been sequenced or plugged in the sequencing pipelines, agronomically important crop and livestock genomes have not been given high enough priority. Although we are facing many challenges in policy-making, grant funding, regional task emphasis, research community consensus and technology innovations, many initiatives are being announced and formulated based on the cost-effective and large-scale sequencing procedure, known as whole genome shotgun (WGS) sequencing that produces draft sequences covering a genome from 95 percent to 99 percent. Identified genes from such draft sequences, coupled with other resources, such as molecular markers, large-insert clones and cDNA sequences, provide ample information and tools to further our knowledge in agricultural and environmental biology in the genome era that just comes to its accelerated period. If the campaign succeeds, molecular biologists, geneticists and field biologists from all countries, rich or poor, would be brought to the same starting point and expect another astronomical increase of basic genomic information, ready to convert effectively into knowledge that will ultimately change our lives and environment into a greater and better future. We call upon national and international governmental agencies and organizations as well as research foundations to support this unprecedented movement.


2013 ◽  
Vol 203-204 ◽  
pp. 42-47
Author(s):  
Albert Prodan ◽  
Herman J.P. van Midden ◽  
Erik Zupanič ◽  
Rok Žitko

Charge density wave (CDW) ordering in NbSe3 and the structurally related quasi one-dimensional compounds is reconsidered. Since the modulated ground state is characterized by unstable nano-domains, the structural information obtained from diffraction experiments is to be supplemented by some additional information from a method, able to reveal details on a unit cell level. Low-temperature (LT) scanning tunneling microscopy (STM) can resolve both, the local atomic structure and the superimposed charge density modulation. It is shown that the established model for NbSe3 with two incommensurate (IC) modes, q1 = (0,0.241,0) and q2 = (0.5,0.260,0.5), locked in at T1=144K and T2=59K and separately confined to two of the three available types of bi-capped trigonal prismatic (BCTP) columns, must be modified. The alternative explanation is based on the existence of modulated layered nano-domains and is in good accord with the available LT STM results. These confirm i.a. the presence of both IC modes above the lower CDW transition temperature. Two BCTP columns, belonging to a symmetry-related pair, are as a rule alternatively modulated by the two modes. Such pairs of columns are ordered into unstable layered nano-domains, whose q1 and q2 sub-layers are easily interchanged. The mutually interchangeable sections of the two unstable IC modes keep a temperature dependent long-range ordering. Both modes can formally be replaced by a single highly inharmonic long-period commensurate CDW.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Brian D. Ondov ◽  
Gabriel J. Starrett ◽  
Anna Sappington ◽  
Aleksandra Kostic ◽  
Sergey Koren ◽  
...  

Abstract The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome and demonstrate the identification of a novel polyomavirus species from a public metagenome.


2013 ◽  
Vol 79 (18) ◽  
pp. 5728-5734 ◽  
Author(s):  
Gerardo U. Lopez ◽  
Charles P. Gerba ◽  
Akrum H. Tamimi ◽  
Masaaki Kitajima ◽  
Sheri L. Maxwell ◽  
...  

ABSTRACTFomites can serve as routes of transmission for both enteric and respiratory pathogens. The present study examined the effect of low and high relative humidity on fomite-to-finger transfer efficiency of five model organisms from several common inanimate surfaces (fomites). Nine fomites representing porous and nonporous surfaces of different compositions were studied.Escherichia coli,Staphylococcus aureus,Bacillus thuringiensis, MS2 coliphage, and poliovirus 1 were placed on fomites in 10-μl drops and allowed to dry for 30 min under low (15% to 32%) or high (40% to 65%) relative humidity. Fomite-to-finger transfers were performed using 1.0 kg/cm2of pressure for 10 s. Transfer efficiencies were greater under high relative humidity for both porous and nonporous surfaces. Most organisms on average had greater transfer efficiencies under high relative humidity than under low relative humidity. Nonporous surfaces had a greater transfer efficiency (up to 57%) than porous surfaces (<6.8%) under low relative humidity, as well as under high relative humidity (nonporous, up to 79.5%; porous, <13.4%). Transfer efficiency also varied with fomite material and organism type. The data generated can be used in quantitative microbial risk assessment models to assess the risk of infection from fomite-transmitted human pathogens and the relative levels of exposure to different types of fomites and microorganisms.


2001 ◽  
Vol 4 ◽  
pp. 22-63 ◽  
Author(s):  
Gerhard Hiss ◽  
Gunter Malle

AbstractThe authors determine all the absolutely irreducible representations of degree up to 250 of quasi-simple finite groups, excluding groups that are of Lie type in their defining characteristic. Additional information is also given on the Frobenius-Schur indicators and the Brauer character fields of the representations.


2018 ◽  
Author(s):  
Jelle Slager ◽  
Rieza Aprianto ◽  
Jan-Willem Veening

ABSTRACTA precise understanding of the genomic organization into transcriptional units and their regulation is essential for our comprehension of opportunistic human pathogens and how they cause disease. Using single-molecule real-time (PacBio) sequencing we unambiguously determined the genome sequence ofStreptococcus pneumoniaestrain D39 and revealed several inversions previously undetected by short-read sequencing. Significantly, a chromosomal inversion results in antigenic variation of PhtD, an important surface-exposed virulence factor. We generated a new genome annotation using automated tools, followed by manual curation, reflecting the current knowledge in the field. By combining sequence-driven terminator prediction, deep paired-end transcriptome sequencing and enrichment of primary transcripts by Cappable-Seq, we mapped 1,015 transcriptional start sites and 748 termination sites. Using this new genomic map, we identified several new small RNAs (sRNAs), riboswitches (including twelve previously misidentified as sRNAs), and antisense RNAs. In total, we annotated 92 new protein-encoding genes, 39 sRNAs and 165 pseudogenes, bringing theS. pneumoniaeD39 repertoire to 2,151 genetic elements. We report operon structures and observed that 9% of operons lack a 5’-UTR. The genome data is accessible in an online resource called PneumoBrowse (https://veeninglab.com/pneumobrowse) providing one of the most complete inventories of a bacterial genome to date. PneumoBrowse will accelerate pneumococcal research and the development of new prevention and treatment strategies.


2021 ◽  
Author(s):  
Thomas Beder ◽  
Olufemi Aromolaran ◽  
Juergen Doenitz ◽  
Sofia Tapanelli ◽  
Eunice Oluwatobiloba Adedeji ◽  
...  

Identifying essential genes on a genome scale is resource intensive and has been performed for only a few eukaryotes. For less studied organisms essentiality might be predicted by gene homology. However, this approach cannot be applied to non-conserved genes. Additionally, divergent essentiality information is obtained from studying single cells or whole, multi-cellular organisms, and particularly when derived from human cell line screens and human population studies. We employed machine learning across six model eukaryotes and 60,381 genes, using 41,635 features derived from sequence, gene functions and network topology. Within a leave-one-organism-out cross-validation, the classifiers showed a high generalizability with an average accuracy close to 80% in the left-out species. As a case study, we applied the method to Tribolium castaneum and validated predictions experimentally yielding similar performance. Finally, using the classifier based on the studied model organisms enabled linking the essentiality information of human cell line screens and population studies.


2020 ◽  
Author(s):  
Matthew R Whiteway ◽  
Bruno Averbeck ◽  
Daniel A Butts

AbstractDecoding is a powerful approach for measuring the information contained in the activity of neural populations. As a result, decoding analyses are now used across a wide range of model organisms and experimental paradigms. However, typical analyses employ general purpose decoding algorithms that do not explicitly take advantage of the structure of neural variability, which is often low-dimensional and can thus be effectively characterized using latent variables. Here we propose a new decoding framework that exploits the low-dimensional structure of neural population variability by removing correlated variability that is unrelated to the decoded variable, then decoding the resulting denoised activity. We demonstrate the efficacy of this framework using simulated data, where the true upper bounds for decoding performance are known. A linear version of our decoder provides an estimator for the decoded variable that can be more efficient than other commonly used linear estimators such as linear discriminant analysis. In addition, our proposed decoding framework admits a simple extension to nonlinear decoding that compares favorably to standard feed-forward neural networks. By explicitly modeling shared population variability, the success of the resulting linear and nonlinear decoders also offers a new perspective on the relationship between shared variability and information contained in large neural populations.


Sign in / Sign up

Export Citation Format

Share Document