Accurate detection of de novo and transmitted INDELs within exome-capture data using micro-assembly

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

Download Full-text

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

10.7287/peerj.preprints.747v1 ◽

2014 ◽

Author(s):

Rebecca R Murphy ◽

Jared M O'Connell ◽

Anthony J Cox ◽

Ole B Schulz-Trieglaff

Keyword(s):

Error Correction ◽

Large Scale ◽

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Sequencing Data ◽

Additional Information ◽

Mate Pair ◽

De Bruijn ◽

De Novo Sequence Assembly

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.

Download Full-text

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

10.7287/peerj.preprints.747 ◽

2014 ◽

Author(s):

Rebecca R Murphy ◽

Jared M O'Connell ◽

Anthony J Cox ◽

Ole B Schulz-Trieglaff

Keyword(s):

Error Correction ◽

Large Scale ◽

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Sequencing Data ◽

Additional Information ◽

Mate Pair ◽

De Bruijn ◽

De Novo Sequence Assembly

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.

Download Full-text

deGSM: memory scalable construction of large scale de Bruijn Graph

10.1101/388454 ◽

2018 ◽

Cited By ~ 2

Author(s):

Hongzhe Guo ◽

Yilei Fu ◽

Yan Gao ◽

Junyi Li ◽

Yadong Wang ◽

...

Keyword(s):

Genome Sequence ◽

Large Scale ◽

High Throughput Sequencing ◽

De Novo ◽

Rapid Development ◽

Main Idea ◽

Supplementary Information ◽

De Bruijn Graph ◽

External Sorting ◽

De Bruijn

AbstractMotivationDe Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks such as de novo assembly, high-throughput sequencing (HTS) read alignment, pan-genome analysis, metagenomics analysis, HTS read correction, etc. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct de Bruijn graph for sequences up to Tera-base-pair level. It is non-trivial since the size of the graph to be constructed could be very large and each graph consists of hundreds of billions of vertices and edges. Current existing approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. Moreover, it also requires the construction approach to handle very large dataset efficiently, even if in a relatively small RAM space.ResultsWe propose a lightweight parallel de Bruijn graph construction approach, de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Bur-rows-Wheeler Transformation (BWT) of the unipaths of de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. It is mainly implemented by a fast parallel external sorting of k-mers, which allows only a part of k-mers kept in RAM by a novel organization of the k-mers. The experimental results demonstrate that, just with a commonly used machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in Gen-Bank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potentials in many large scale genomics studies.Availabilityhttps://github.com/hitbc/[email protected] (YW) and [email protected] (BL)Supplementary informationSupplementary data are available online.

Download Full-text

Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism

10.1101/484113 ◽

2018 ◽

Cited By ~ 24

Author(s):

F. Kyle Satterstrom ◽

Jack A. Kosmicki ◽

Jiebiao Wang ◽

Michael S. Breen ◽

Silvia De Rubeis ◽

...

Keyword(s):

Gene Expression ◽

Exome Sequencing ◽

Large Scale ◽

De Novo ◽

Regulation Of Gene Expression ◽

Copy Number Variants ◽

Autism Spectrum ◽

Risk Genes ◽

Functional Changes ◽

Multiple Paths

SummaryWe present the largest exome sequencing study of autism spectrum disorder (ASD) to date (n=35,584 total samples, 11,986 with ASD). Using an enhanced Bayesian framework to integrate de novo and case-control rare variation, we identify 102 risk genes at a false discovery rate ≤ 0.1. Of these genes, 49 show higher frequencies of disruptive de novo variants in individuals ascertained for severe neurodevelopmental delay, while 53 show higher frequencies in individuals ascertained for ASD; comparing ASD cases with mutations in these groups reveals phenotypic differences. Expressed early in brain development, most of the risk genes have roles in regulation of gene expression or neuronal communication (i.e., mutations effect neurodevelopmental and neurophysiological changes), and 13 fall within loci recurrently hit by copy number variants. In human cortex single-cell gene expression data, expression of risk genes is enriched in both excitatory and inhibitory neuronal lineages, consistent with multiple paths to an excitatory/inhibitory imbalance underlying ASD.

Download Full-text

Search for mutations of the interferon-induced transmembrane protein 5 (IFITM5) gene in patients with osteogenesis imperfecta

Nauchno-prakticheskii zhurnal «Medicinskaia genetika» ◽

10.25557/2073-7998.2019.10.21-29 ◽

2019 ◽

pp. 21-29

Author(s):

А.Р. Зарипова ◽

Л.Р. Нургалиева ◽

А.В. Тюрин ◽

И.Р. Минниахметов ◽

Р.И. Хусаинова

Keyword(s):

Osteogenesis Imperfecta ◽

De Novo ◽

Transmembrane Protein ◽

Clinical Signs ◽

Clinical Manifestations ◽

Heterozygous Mutation ◽

Type I ◽

New Genes ◽

Wide Range ◽

Type V

Проведено исследование гена интерферон индуцированного трансмембранного белка 5 (IFITM5) у 99 пациентов с несовершенным остеогенезом (НО) из 86 неродственных семей. НО - клинически и генетически гетерогенное наследственное заболевание соединительной ткани, основное клиническое проявление которого - множественные переломы, начиная с неонатального периода жизни, зачастую приводящие к инвалидизации с детского возраста. К основным клиническим признакам НО относятся голубые склеры, потеря слуха, аномалия дентина, повышенная ломкость костей, нарушения роста и осанки с развитием характерных инвалидизирующих деформаций костей и сопутствующих проблем, включающих дыхательные, неврологические, сердечные, почечные нарушения. НО встречается как у мужчин, так и у женщин. До сих пор не определена степень генетической гетерогенности заболевания. На сегодняшний день известно 20 генов, вовлеченных в патогенез НО, и исследователи разных стран продолжают искать новые гены. В последнее десятилетие стало известно, что аутосомно-рецессивные, аутосомно-доминантные и Х-сцепленные мутации в широком спектре генов, кодирующих белки, которые участвуют в синтезе коллагена I типа, его процессинге, секреции и посттрансляционной модификации, а также в белках, которые регулируют дифференцировку и активность костеобразующих клеток, вызывают НО. Мутации в гене IFITM5, также называемом BRIL (bone-restricted IFITM-like protein), участвующем в формировании остеобластов, приводят к развитию НО типа V. До 5% пациентов имеют НО типа V, который характеризуется образованием гиперпластического каллуса после переломов, кальцификацией межкостной мембраны предплечья и сетчатым рисунком ламелирования, наблюдаемого при гистологическом исследовании кости. В 2012 г. гетерозиготная мутация (c.-14C> T) в 5’-нетранслируемой области (UTR) гена IFITM5 была идентифицирована как основная причина НО V типа. В представленной работе проведен анализ гена IFITM5 и идентифицирована мутация c.-14C>T, возникшая de novo, у одного пациента с НО, которому впоследствии был установлен V тип заболевания. Также выявлены три известных полиморфных варианта: rs57285449; c.80G>C (p.Gly27Ala) и rs2293745; c.187-45C>T и rs755971385 c.279G>A (p.Thr93=) и один ранее не описанный вариант: c.128G>A (p.Ser43Asn) AGC>AAC (S/D), которые не являются патогенными. В статье уделяется внимание особенностям клинических проявлений НО V типа и рекомендуется определение мутации c.-14C>T в гене IFITM5 при подозрении на данную форму заболевания. A study was made of interferon-induced transmembrane protein 5 gene (IFITM5) in 99 patients with osteogenesis imperfecta (OI) from 86 unrelated families and a search for pathogenic gene variants involved in the formation of the disease phenotype. OI is a clinically and genetically heterogeneous hereditary disease of the connective tissue, the main clinical manifestation of which is multiple fractures, starting from the natal period of life, often leading to disability from childhood. The main clinical signs of OI include blue sclera, hearing loss, anomaly of dentin, increased fragility of bones, impaired growth and posture, with the development of characteristic disabling bone deformities and associated problems, including respiratory, neurological, cardiac, and renal disorders. OI occurs in both men and women. The degree of genetic heterogeneity of the disease has not yet been determined. To date, 20 genes are known to be involved in the pathogenesis of OI, and researchers from different countries continue to search for new genes. In the last decade, it has become known that autosomal recessive, autosomal dominant and X-linked mutations in a wide range of genes encoding proteins that are involved in the synthesis of type I collagen, its processing, secretion and post-translational modification, as well as in proteins that regulate the differentiation and activity of bone-forming cells cause OI. Mutations in the IFITM5 gene, also called BRIL (bone-restricted IFITM-like protein), involved in the formation of osteoblasts, lead to the development of OI type V. Up to 5% of patients have OI type V, which is characterized by the formation of a hyperplastic callus after fractures, calcification of the interosseous membrane of the forearm, and a mesh lamellar pattern observed during histological examination of the bone. In 2012, a heterozygous mutation (c.-14C> T) in the 5’-untranslated region (UTR) of the IFITM5 gene was identified as the main cause of OI type V. In the present work, the IFITM5 gene was analyzed and the de novo c.-14C> T mutation was identified in one patient with OI who was subsequently diagnosed with type V of the disease. Three known polymorphic variants were also identified: rs57285449; c.80G> C (p.Gly27Ala) and rs2293745; c.187-45C> T and rs755971385 c.279G> A (p.Thr93 =) and one previously undescribed variant: c.128G> A (p.Ser43Asn) AGC> AAC (S / D), which were not pathogenic. The article focuses on the features of the clinical manifestations of OI type V, and it is recommended to determine the c.-14C> T mutation in the IFITM5 gene if this form of the disease is suspected.

Download Full-text

Target-Templated de novo Design of Macrocyclic D-/L-Peptides: Inhibitors of the PD-1/PD-L1 Interaction

10.26434/chemrxiv.11663337.v3 ◽

2020 ◽

Author(s):

Salvador Guardiola ◽

Monica Varese ◽

Xavier Roig ◽

Jesús Garcia ◽

Ernest Giralt

Keyword(s):

Protein Interactions ◽

Cyclic Peptides ◽

General Framework ◽

Large Scale ◽

De Novo ◽

Inhibitory Effect ◽

Original Text ◽

Protein Protein Interactions ◽

Retraction Notice ◽

Pharmaceutical Properties

NOTE: This preprint has been retracted by consensus from all authors. See the retraction notice in place above; the original text can be found under "Version 1", accessible from the version selector above. ------------------------------------------------------------------------ Peptides, together with antibodies, are among the most potent biochemical tools to modulate challenging protein-protein interactions. However, current structure-based methods are largely limited to natural peptides and are not suitable for designing target-specific binders with improved pharmaceutical properties, such as macrocyclic peptides. Here we report a general framework that leverages the computational power of Rosetta for large-scale backbone sampling and energy scoring, followed by side-chain composition, to design heterochiral cyclic peptides that bind to a protein surface of interest. To showcase the applicability of our approach, we identified two peptides (PD-i3 and PD-i6) that target PD-1, a key immune checkpoint, and work as protein ligand decoys. A comprehensive biophysical evaluation confirmed their binding mechanism to PD-1 and their inhibitory effect on the PD-1/PD-L1 interaction. Finally, elucidation of their solution structures by NMR served as validation of our de novo design approach. We anticipate that our results will provide a general framework for designing target-specific drug-like peptides.

Download Full-text

A Target-Based Method for Designing Heterochiral Cyclic Peptide Binders: De Novo Inhibitors of the PD-1/PD-L1 Interaction

10.26434/chemrxiv.11663337.v2 ◽

2020 ◽

Author(s):

Salvador Guardiola ◽

Monica Varese ◽

Xavier Roig ◽

Jesús Garcia ◽

Ernest Giralt

Keyword(s):

Protein Interactions ◽

Cyclic Peptides ◽

General Framework ◽

Large Scale ◽

Cyclic Peptide ◽

De Novo ◽

Inhibitory Effect ◽

Original Text ◽

Retraction Notice ◽

Pharmaceutical Properties

NOTE: This preprint has been retracted by consensus from all authors. See the retraction notice in place above; the original text can be found under "Version 1", accessible from the version selector above. ------------------------------------------------------------------------ Peptides, together with antibodies, are among the most potent biochemical tools to modulate challenging protein-protein interactions. However, current structure-based methods are largely limited to natural peptides and are not suitable for designing target-specific binders with improved pharmaceutical properties, such as macrocyclic peptides. Here we report a general framework that leverages the computational power of Rosetta for large-scale backbone sampling and energy scoring, followed by side-chain composition, to design heterochiral cyclic peptides that bind to a protein surface of interest. To showcase the applicability of our approach, we identified two peptides (PD-i3 and PD-i6) that target PD-1, a key immune checkpoint, and work as protein ligand decoys. A comprehensive biophysical evaluation confirmed their binding mechanism to PD-1 and their inhibitory effect on the PD-1/PD-L1 interaction. Finally, elucidation of their solution structures by NMR served as validation of our de novo design approach. We anticipate that our results will provide a general framework for designing target-specific drug-like peptides.

Download Full-text

Target-Templated de novo Design of Macrocyclic D-/L-Peptides: Inhibitors of the PD-1/PD-L1 Interaction

10.26434/chemrxiv.11663337 ◽

2020 ◽

Author(s):

Salvador Guardiola ◽

Monica Varese ◽

Xavier Roig ◽

Jesús Garcia ◽

Ernest Giralt

Keyword(s):

Protein Interactions ◽

Cyclic Peptides ◽

General Framework ◽

Large Scale ◽

De Novo ◽

Inhibitory Effect ◽

Original Text ◽

Protein Protein Interactions ◽

Retraction Notice ◽

Pharmaceutical Properties

NOTE: This preprint has been retracted by consensus from all authors. See the retraction notice in place above; the original text can be found under "Version 1", accessible from the version selector above. ------------------------------------------------------------------------ Peptides, together with antibodies, are among the most potent biochemical tools to modulate challenging protein-protein interactions. However, current structure-based methods are largely limited to natural peptides and are not suitable for designing target-specific binders with improved pharmaceutical properties, such as macrocyclic peptides. Here we report a general framework that leverages the computational power of Rosetta for large-scale backbone sampling and energy scoring, followed by side-chain composition, to design heterochiral cyclic peptides that bind to a protein surface of interest. To showcase the applicability of our approach, we identified two peptides (PD-i3 and PD-i6) that target PD-1, a key immune checkpoint, and work as protein ligand decoys. A comprehensive biophysical evaluation confirmed their binding mechanism to PD-1 and their inhibitory effect on the PD-1/PD-L1 interaction. Finally, elucidation of their solution structures by NMR served as validation of our de novo design approach. We anticipate that our results will provide a general framework for designing target-specific drug-like peptides.

Download Full-text

Analysis and design of suitable model structures for activated sludge tanks with circulating flow

Water Science & Technology ◽

10.2166/wst.1999.0189 ◽

1999 ◽

Vol 39 (4) ◽

pp. 55-60 ◽

Cited By ~ 4

Author(s):

J. Alex ◽

R. Tschepetzki ◽

U. Jumar ◽

F. Obenaus ◽

K.-H. Rosenwinkel

Keyword(s):

Activated Sludge ◽

Large Scale ◽

Wastewater Treatment Plants ◽

Treatment Plant ◽

High Sensitivity ◽

Suitable Model ◽

Model Structure ◽

Analysis And Design ◽

Hydraulic Behaviour ◽

Model Structures

Activated sludge models are widely used for planning and optimisation of wastewater treatment plants and on line applications are under development to support the operation of complex treatment plants. A proper model is crucial for all of these applications. The task of parameter calibration is focused in several papers and applications. An essential precondition for this task is an appropriately defined model structure, which is often given much less attention. Different model structures for a large scale treatment plant with circulation flow are discussed in this paper. A more systematic method to derive a suitable model structure is applied to this case. Results of a numerical hydraulic model are used for this purpose. The importance of these efforts are proven by a high sensitivity of the simulation results with respect to the selection of the model structure and the hydraulic conditions. Finally it is shown, that model calibration was possible only by adjusting to the hydraulic behaviour and without any changes of biological parameters.

Download Full-text