unseen species
Recently Published Documents


TOTAL DOCUMENTS

17
(FIVE YEARS 6)

H-INDEX

6
(FIVE YEARS 1)

PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0253461
Author(s):  
Anna Tovo ◽  
Samuele Stivanello ◽  
Amos Maritan ◽  
Samir Suweis ◽  
Stefano Favaro ◽  
...  

Big data require new techniques to handle the information they come with. Here we consider four datasets (email communication, Twitter posts, Wikipedia articles and Gutenberg books) and propose a novel statistical framework to predict global statistics from random samples. More precisely, we infer the number of senders, hashtags and words of the whole dataset and how their abundances (i.e. the popularity of a hashtag) change through scales from a small sample of sent emails per sender, posts per hashtag and word occurrences. Our approach is grounded on statistical ecology as we map inference of human activities into the unseen species problem in biodiversity. Our findings may have applications to resource management in emails, collective attention monitoring in Twitter and language learning process in word databases.


2021 ◽  
Vol 12 ◽  
Author(s):  
Sandra Wiegand ◽  
Hang T. Dam ◽  
Julian Riba ◽  
John Vollmers ◽  
Anne-Kristin Kaster

As of today, the majority of environmental microorganisms remain uncultured. They are therefore referred to as “microbial dark matter.” In the recent past, cultivation-independent methods like single-cell genomics (SCG) enabled the discovery of many previously unknown microorganisms, among them the Patescibacteria/Candidate Phyla Radiation (CPR). This approach was shown to be complementary to metagenomics, however, the development of additional and refined sorting techniques beyond the most commonly used fluorescence-activated cell sorting (FACS) is still desirable to enable additional downstream applications. Adding image information on the number and morphology of sorted cells would be beneficial, as would be minimizing cell stress caused by sorting conditions such as staining or pressure. Recently, a novel cell sorting technique has been developed, a microfluidic single-cell dispenser, which assesses the number and morphology of the cell in each droplet by automated light microscopic processing. Here, we report for the first time the successful application of the newly developed single-cell dispensing system for label-free isolation of individual bacteria from a complex sample retrieved from a wastewater treatment plant, demonstrating the potential of this technique for single cell genomics and other alternative downstream applications. Genome recovery success rated above 80% with this technique—out of 880 sorted cells 717 were successfully amplified. For 50.1% of these, analysis of the 16S rRNA gene was feasible and led to the sequencing of 50 sorted cells identified as Patescibacteria/CPR members. Subsequentially, 27 single amplified genomes (SAGs) of 15 novel and distinct Patescibacteria/CPR members, representing yet unseen species, genera and families could be captured and reconstructed. This phylogenetic distinctness of the recovered SAGs from available metagenome-assembled genomes (MAGs) is accompanied by the finding that these lineages—in whole or in part—have not been accessed by genome-resolved metagenomics of the same sample, thereby emphasizing the importance and opportunities of SCGs.


2021 ◽  
Author(s):  
Aparajita Dutta ◽  
Kusum Kumari Singh ◽  
Ashish Anand

AbstractDeep learning models like convolutional neural networks (CNN) and recurrent neural networks (RNN) have been frequently used to identify splice sites from genome sequences. Most of the deep learning applications identify splice sites from a single species. Furthermore, the models generally identify and interpret only the canonical splice sites. However, a model capable of identifying both canonical and non-canonical splice sites from multiple species with comparable accuracy is more generalizable and robust. We choose some state-of-the-art CNN and RNN models and compare their performances in identifying novel canonical and non-canonical splice sites in homo sapiens, mus musculus, and drosophila melanogaster.The RNN-based model named SpliceViNCI outperforms its counterparts in identifying splice sites from multiple species as well as on unseen species. SpliceViNCI maintains its performance when trained with imbalanced data making it more robust. We observe that all the models perform better when trained with more than one species. SpliceViNCI outperforms the counterparts when trained with such an augmented dataset. We further extract and compare the features learned by SpliceViNCI when trained with single and multiple species. We validate the extracted features with knowledge from the literature.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Johan Gustafsson ◽  
Jonathan Robinson ◽  
Jens Nielsen ◽  
Lior Pachter

AbstractThe incorporation of unique molecular identifiers (UMIs) in single-cell RNA-seq assays makes possible the identification of duplicated molecules, thereby facilitating the counting of distinct molecules from sequenced reads. However, we show that the naïve removal of duplicates can lead to a bias due to a “pooled amplification paradox,” and we propose an improved quantification method based on unseen species modeling. Our correction called BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow. We demonstrate its efficacy across cell types and genes and show that in some cases it can invert the relative abundance of genes.


2021 ◽  
Author(s):  
Yuwei Bao ◽  
Jack Wadden ◽  
John R. Erb-Downward ◽  
Piyush Ranjan ◽  
Robert P. Dickson ◽  
...  

AbstractSingle-molecule sequencers made by Oxford Nanopore provide results in real time as DNA passes through a nanopore and can eject a molecule after it has been partly sequenced. However, the computational challenge of deciding whether to keep or reject a molecule in real time has limited the application of this capability. We present SquiggleNet, the first deep learning model that can classify nanopore reads directly from their electrical signals. SquiggleNet operates faster than the DNA passes through the pore, allowing real-time classification and read ejection. When given the amount of sequencing data generated in one second, the classifier achieves significantly higher accuracy than base calling followed by sequence alignment. Our approach is also faster and requires an order of magnitude less memory than approaches based on alignment. SquiggleNet distinguished human from bacterial DNA with over 90% accuracy across test datasets from different flowcells and sample preparations, generalized to unseen species, and identified bacterial species in a human respiratory meta genome sample.


Author(s):  
Nived Rajaraman ◽  
Prafulla Chandra ◽  
Andrew Thangaraj ◽  
Ananda Theertha Suresh
Keyword(s):  

2017 ◽  
Author(s):  
Haotian Teng ◽  
Minh Duc Cao ◽  
Michael B. Hall ◽  
Tania Duarte ◽  
Sheng Wang ◽  
...  

ABSTRACTSequencing by translocating DNA fragments through an array of nanopores is a rapidly maturing technology which offers faster and cheaper sequencing than other approaches. However, accurately deciphering the DNA sequence from the noisy and complex electrical signal is challenging. Here, we report Chiron, the first deep learning model to achieve end-to-end basecalling: directly translating the raw signal to DNA sequence without the error-prone segmentation step. Trained with only a small set of 4000 reads, we show that our model provides state-of-the-art basecalling accuracy even on previously unseen species. Chiron achieves basecalling speeds of over 2000 bases per second using desktop computer graphics processing units.


2016 ◽  
Vol 113 (47) ◽  
pp. 13283-13288 ◽  
Author(s):  
Alon Orlitsky ◽  
Ananda Theertha Suresh ◽  
Yihong Wu

Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42−58], uses n samples to predict the number U of hitherto unseen species that would be observed if t⋅n new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45−63] constructed an intriguing estimator that predicts U for all t≤1. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435−447] proposed a modification that empirically predicts U even for some t>1, but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to t∝log⁡n. We also show that this range is the best possible and that the estimator’s mean-square error is near optimal for any t. Our approach yields a provable guarantee for the Efron−Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.


2015 ◽  
Vol 370 (1675) ◽  
pp. 20140291 ◽  
Author(s):  
Daniel J. Laydon ◽  
Charles R. M. Bangham ◽  
Becca Asquith

A highly diverse T-cell receptor (TCR) repertoire is a fundamental property of an effective immune system, and is associated with efficient control of viral infections and other pathogens. However, direct measurement of total TCR diversity is impossible. The diversity is high and the frequency distribution of individual TCRs is heavily skewed; the diversity therefore cannot be captured in a blood sample. Consequently, estimators of the total number of TCR clonotypes that are present in the individual, in addition to those observed, are essential. This is analogous to the ‘unseen species problem’ in ecology. We review the diversity (species richness) estimators that have been applied to T-cell repertoires and the methods used to validate these estimators. We show that existing approaches have significant shortcomings, and frequently underestimate true TCR diversity. We highlight our recently developed estimator, DivE, which can accurately estimate diversity across a range of immunological and biological systems.


2009 ◽  
Vol 4 (4) ◽  
pp. 763-792 ◽  
Author(s):  
Hongmei Zhang ◽  
Hal Stern

Sign in / Sign up

Export Citation Format

Share Document