scholarly journals MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale

Author(s):  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
Daniel Danciu ◽  
Marc Zimmermann ◽  
Christopher Barber ◽  
...  

AbstractThe amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by the index and its query performance. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework’s scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI’s Sequence Read Archive, representing a total input of more than three petabases.Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including those over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 20,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes created from public data comprising in total more than 1 million records are available for download or usage in the cloud.As an example of our indexes’ integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Luca Pappalardo ◽  
Paolo Cintia ◽  
Alessio Rossi ◽  
Emanuele Massucco ◽  
Paolo Ferragina ◽  
...  

Abstract Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of sensing technologies that provide high-fidelity data streams for every match. Unfortunately, these detailed data are owned by specialized companies and hence are rarely publicly available for scientific research. To fill this gap, this paper describes the largest open collection of soccer-logs ever released, containing all the spatio-temporal events (passes, shots, fouls, etc.) that occured during each match for an entire season of seven prominent soccer competitions. Each match event contains information about its position, time, outcome, player and characteristics. The nature of team sports like soccer, halfway between the abstraction of a game and the reality of complex social systems, combined with the unique size and composition of this dataset, provide an ideal ground for tackling a wide range of data science problems, including the measurement and evaluation of performance, both at individual and at collective level, and the determinants of success and failure.


Author(s):  
Kuan-Hao Chao ◽  
Kirston Barton ◽  
Sarah Palmer ◽  
Robert Lanfear

Abstract sangeranalyseR is feature-rich, free, and open-source R package for processing Sanger sequencing data. It allows users to go from loading reads to saving aligned contigs in a few lines of R code by using sensible defaults for most actions. It also provides complete flexibility for determining how individual reads and contigs are processed, both at the command-line in R and via interactive Shiny applications. sangeranalyseR provides a wide range of options for all steps in Sanger processing pipelines including trimming reads, detecting secondary peaks, viewing chromatograms, detecting indels and stop codons, aligning contigs, estimating phylogenetic trees, and more. Input data can be in either ABIF or FASTA format. sangeranalyseR comes with extensive online documentation and outputs aligned and unaligned reads and contigs in FASTA format, along with detailed interactive HTML reports. sangeranalyseR supports the use of colourblind-friendly palettes for viewing alignments and chromatograms. It is released under an MIT licence and available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR).


2021 ◽  
Vol 99 (2) ◽  
Author(s):  
Yuhua Fu ◽  
Pengyu Fan ◽  
Lu Wang ◽  
Ziqiang Shu ◽  
Shilin Zhu ◽  
...  

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
George Gillard ◽  
Ian M. Griffiths ◽  
Gautham Ragunathan ◽  
Ata Ulhaq ◽  
Callum McEwan ◽  
...  

AbstractCombining external control with long spin lifetime and coherence is a key challenge for solid state spin qubits. Tunnel coupling with electron Fermi reservoir provides robust charge state control in semiconductor quantum dots, but results in undesired relaxation of electron and nuclear spins through mechanisms that lack complete understanding. Here, we unravel the contributions of tunnelling-assisted and phonon-assisted spin relaxation mechanisms by systematically adjusting the tunnelling coupling in a wide range, including the limit of an isolated quantum dot. These experiments reveal fundamental limits and trade-offs of quantum dot spin dynamics: while reduced tunnelling can be used to achieve electron spin qubit lifetimes exceeding 1 s, the optical spin initialisation fidelity is reduced below 80%, limited by Auger recombination. Comprehensive understanding of electron-nuclear spin relaxation attained here provides a roadmap for design of the optimal operating conditions in quantum dot spin qubits.


2021 ◽  
Vol 11 (13) ◽  
pp. 5859
Author(s):  
Fernando N. Santos-Navarro ◽  
Yadira Boada ◽  
Alejandro Vignoni ◽  
Jesús Picó

Optimal gene expression is central for the development of both bacterial expression systems for heterologous protein production, and microbial cell factories for industrial metabolite production. Our goal is to fulfill industry-level overproduction demands optimally, as measured by the following key performance metrics: titer, productivity rate, and yield (TRY). Here we use a multiscale model incorporating the dynamics of (i) the cell population in the bioreactor, (ii) the substrate uptake and (iii) the interaction between the cell host and expression of the protein of interest. Our model predicts cell growth rate and cell mass distribution between enzymes of interest and host enzymes as a function of substrate uptake and the following main lab-accessible gene expression-related characteristics: promoter strength, gene copy number and ribosome binding site strength. We evaluated the differential roles of gene transcription and translation in shaping TRY trade-offs for a wide range of expression levels and the sensitivity of the TRY space to variations in substrate availability. Our results show that, at low expression levels, gene transcription mainly defined TRY, and gene translation had a limited effect; whereas, at high expression levels, TRY depended on the product of both, in agreement with experiments in the literature.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
David Pellow ◽  
Alvah Zorea ◽  
Maraike Probst ◽  
Ori Furman ◽  
Arik Segal ◽  
...  

Abstract Background Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)—an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. Conclusions SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP.


Urolithiasis ◽  
2017 ◽  
Vol 46 (4) ◽  
pp. 333-341 ◽  
Author(s):  
Léa Huguet ◽  
Marine Le Dudal ◽  
Marine Livrozet ◽  
Dominique Bazin ◽  
Vincent Frochot ◽  
...  

2020 ◽  
Vol 8 ◽  
Author(s):  
Devasis Bassu ◽  
Peter W. Jones ◽  
Linda Ness ◽  
David Shallcross

Abstract In this paper, we present a theoretical foundation for a representation of a data set as a measure in a very large hierarchically parametrized family of positive measures, whose parameters can be computed explicitly (rather than estimated by optimization), and illustrate its applicability to a wide range of data types. The preprocessing step then consists of representing data sets as simple measures. The theoretical foundation consists of a dyadic product formula representation lemma, and a visualization theorem. We also define an additive multiscale noise model that can be used to sample from dyadic measures and a more general multiplicative multiscale noise model that can be used to perturb continuous functions, Borel measures, and dyadic measures. The first two results are based on theorems in [15, 3, 1]. The representation uses the very simple concept of a dyadic tree and hence is widely applicable, easily understood, and easily computed. Since the data sample is represented as a measure, subsequent analysis can exploit statistical and measure theoretic concepts and theories. Because the representation uses the very simple concept of a dyadic tree defined on the universe of a data set, and the parameters are simply and explicitly computable and easily interpretable and visualizable, we hope that this approach will be broadly useful to mathematicians, statisticians, and computer scientists who are intrigued by or involved in data science, including its mathematical foundations.


2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
L Girardi ◽  
M Serdaroğulları ◽  
C Patassini ◽  
S Caroselli ◽  
M Costa ◽  
...  

Abstract Study question What is the effect of varying diagnostic thresholds on the accuracy of Next Generation Sequencing (NGS)-based preimplantation genetic testing for aneuploidies (PGT-A)? Summary answer When single trophectoderm biopsies are tested, the employment of 80% upper threshold increases mosaic calls and false negative aneuploidy results compared to more stringent thresholds. What is known already Trophectoderm (TE) biopsy coupled with NGS-based PGT-A technologies are able to accurately predict Inner Cell Mass’ (ICM) constitution when uniform whole chromosome aneuploidies are considered. However, minor technical and biological inconsistencies in NGS procedures and biopsy specimens can result in subtle variability in analytical results. In this context, the stringency of thresholds employed for diagnostic calls can lead to incorrect classification of uniformly aneuploid embryos into the mosaic category, ultimately affecting PGT-A accuracy. In this study, we evaluated the diagnostic predictivity of different aneuploidy classification criteria by employing blinded analysis of chromosome copy number values (CNV) in multifocal blastocyst biopsies. Study design, size, duration The accuracy of different aneuploidy diagnostic cut-offs was assessed comparing chromosomal CNV in intra-blastocysts multifocal biopsies. Enrolled embryos were donated for research between June and September 2020. The Institutional Review Board at the Near East University approved the study (project: YDU/20l9/70–849). Embryos diagnosed with uniform chromosomal alterations (single or multiple) in their clinical TE biopsy (n = 27) were disaggregated into 5 portions: the ICM and 4 TE biopsies. Overall, 135 specimens were collected and analysed. Participants/materials, setting, methods Twenty-seven donated blastocysts were warmed and disaggregated in TE biopsies and ICM (n = 135 biopsies). PGT-A analysis was performed using Ion ReproSeq PGS kit and Ion S5 sequencer (ThermoFisher). Sequencing data were blindly analysed with Ion-Reporter software. Intra-blastocyst comparison of raw NGS data was performed employing different thresholds commonly used for aneuploidy classification. CNV for each chromosome were reported as aneuploid according to 70% or 80% thresholds. Categorical variables were compared using Fisher’s exact test. Main results and the role of chance In this study, a total of 50 aneuploid patterns in 27 disaggregated embryos were explored. Single TE biopsy results were considered as true positive when they displayed the same alteration detected in the ICM at levels above the 70% or 80% thresholds. Alternatively, alterations detected in the euploid or mosaic range were considered as false negative aneuploidy results. When the 70% threshold was applied, aneuploidy findings were confirmed in 94.5% of TE biopsies analyzed (n = 189/200; 95%CI=90.37–37.22), while 5.5% showed a mosaic profile (50–70%) but uniformly abnormal ICM. Positive (PPV) and negative predictive value (NPV) per chromosome were 100.0% (n = 189/189; 95%CI=98.07–100.00) and 99.5% (n = 2192/2203; 95%CI=99.11–99.75) respectively. When the upper cut-off was experimentally placed at 80% of abnormal cells, a significant decrease (p-value=0.0097) in the percentage of confirmed aneuploid calls was observed (86.5%; n = 173/200; 95%CI=80.97–90.91), resulting in mosaicism overcalling, especially in the high range (50–80%). Less stringent thresholds led to extremely high PPV (100.0%; n = 173/173; 95%CI=97.89–100.00), while NPV decreased to 98.8% (n = 2192/2219; 95%CI=98.30–99.23). Furthermore, no additional true mosaic patterns were identified with the use of wide range thresholds for aneuploidy classification. Limitations, reasons for caution This approach involved the analysis of aneuploidy CNV thresholds at the embryo level and lacked from genotyping-based confirmation analysis. Moreover, aneuploid embryos with known meiotic partial deletion/duplication were not included. Wider implications of the findings: The use of wide thresholds for detecting intermediate chromosomal CNV up to 80% doesn’t improve PGT-A ability to discriminate true mosaic from uniformly aneuploid embryos, lowering overall diagnostic accuracy. Hence, a proportion of the embryos diagnosed as mosaic using wide calling thresholds may actually be uniformly aneuploid and inadvertently transferred. Trial registration number N/A


2021 ◽  
pp. 1-36
Author(s):  
Benjamin Knisely ◽  
Monifa Vaughn-Cooke

Abstract Human beings are physically and cognitively variable, leading to a wide array of potential system use cases. To design safe and effective systems for highly heterogeneous populations, engineers must cater to this variability to minimize the chance of error and system failure. This can be a challenge because of the increasing costs associated with providing additional product variety. Most guidance for navigating these trade-offs is intended for late-stage design, when significant resources have been expended, thus risking expensive redesign or exclusion of users when new human concerns become apparent. Despite the critical need to evaluate accommodation-cost trade-offs in early stages of design, there is currently a lack of structured guidance. In this work, an approach to function modeling is proposed that allows the simultaneous consideration of human and machine functionality. This modeling approach facilitates the allocation of system functions to humans and machines to be used as an accessible baseline for concept development. Further, a multi-objective optimization model was developed to allocate functions with metrics for accommodation and cost. The model was demonstrated on a design case study. 16 senior mechanical engineering students were recruited and tasked with performing the allocation task manually. The results were compared to the output of the optimization model. Results indicated that participants were unable to produce concepts with the same accommodation-cost efficiency as the optimization model. Further, the optimization model successfully produced a wide range of potential product concepts, demonstrating its utility as a decision-aid.


Sign in / Sign up

Export Citation Format

Share Document