next generation sequencing data
Recently Published Documents





2022 ◽  
Vol 23 (1) ◽  
Ludwig Mann ◽  
Kathrin M. Seibt ◽  
Beatrice Weber ◽  
Tony Heitkam

Abstract Background Extrachromosomal circular DNAs (eccDNAs) are ring-like DNA structures physically separated from the chromosomes with 100 bp to several megabasepairs in size. Apart from carrying tandemly repeated DNA, eccDNAs may also harbor extra copies of genes or recently activated transposable elements. As eccDNAs occur in all eukaryotes investigated so far and likely play roles in stress, cancer, and aging, they have been prime targets in recent research—with their investigation limited by the scarcity of computational tools. Results Here, we present the ECCsplorer, a bioinformatics pipeline to detect eccDNAs in any kind of organism or tissue using next-generation sequencing techniques. Following Illumina-sequencing of amplified circular DNA (circSeq), the ECCsplorer enables an easy and automated discovery of eccDNA candidates. The data analysis encompasses two major procedures: first, read mapping to the reference genome allows the detection of informative read distributions including high coverage, discordant mapping, and split reads. Second, reference-free comparison of read clusters from amplified eccDNA against control sample data reveals specifically enriched DNA circles. Both software parts can be run separately or jointly, depending on the individual aim or data availability. To illustrate the wide applicability of our approach, we analyzed semi-artificial and published circSeq data from the model organisms Homo sapiens and Arabidopsis thaliana, and generated circSeq reads from the non-model crop plant Beta vulgaris. We clearly identified eccDNA candidates from all datasets, with and without reference genomes. The ECCsplorer pipeline specifically detected mitochondrial mini-circles and retrotransposon activation, showcasing the ECCsplorer’s sensitivity and specificity. Conclusion The ECCsplorer (available online at is a bioinformatics pipeline to detect eccDNAs in any kind of organism or tissue using next-generation sequencing data. The derived eccDNA targets are valuable for a wide range of downstream investigations—from analysis of cancer-related eccDNAs over organelle genomics to identification of active transposable elements.

2022 ◽  
Simon Cabello ◽  
Julie A Vendrell ◽  
Charles Van Goethem ◽  
Mehdi Brousse ◽  
Catherine Gozé ◽  

Copy number variations (CNVs) are an essential component of genetic variation distributed across large parts of the human genome. CNV detection from next-generation sequencing data and artificial intelligence algorithms has progressed in recent years. However, only a few tools have taken advantage of machine learning algorithms for CNV detection. The most developed approach is to use a reference dataset to compare with the samples of interest, and it is well known that selecting appropriate normal samples represents a challenging task which dramatically influences the precision of results in all CNV-detecting tools. With careful consideration of these issues, we propose here ifCNV, a new software based on isolation forests that creates its own reference, available in R and python with customisable parameters. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples. It was validated using datasets from diverse origins, and it exhibits high sensitivity, specificity and accuracy. ifCNV is a publicly available open-source software that allows the detection of CNVs in many clinical situations.

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Bernardo Bonilauri ◽  
Amanda C. Camillo-Andrade ◽  
Marlon D. M. Santos ◽  
Juliana de S. da G. Fischer ◽  
Paulo C. Carvalho ◽  

Background. Obesity is characterized as a disease that directly affects the whole-body metabolism and is associated with excess fat mass and several related comorbidities. Dynamics of adipocyte hypertrophy and hyperplasia play an important role in health and disease, especially in obesity. Human adipose-derived stem cells (hASC) represent an important source for understanding the entire adipogenic differentiation process. However, little is known about the triggering step of adipogenesis in hASC. Here, we performed a proteogenomic approach for understanding the protein abundance alterations during the initiation of the adipogenic differentiation process. Methods. hASC were isolated from adipose tissue of three donors and were then characterized and expanded. Cells were cultured for 24 hours in adipogenic differentiation medium followed by protein extraction. We used shotgun proteomics to compare the proteomic profile of 24 h-adipogenic, differentiated, and undifferentiated hASC. We also used our previous next-generation sequencing data (RNA-seq) of the total and polysomal mRNA fractions of hASC to study posttranscriptional regulation during the initial steps of adipogenesis. Results. We identified 3420 proteins out of 48,336 peptides, of which 92 proteins were exclusively identified in undifferentiated hASC and 53 proteins were exclusively found in 24 h-differentiated cells. Using a stringent criterion, we identified 33 differentially abundant proteins when comparing 24 h-differentiated and undifferentiated hASC (14 upregulated and 19 downregulated, respectively). Among the upregulated proteins, we shortlisted several adipogenesis-related proteins. A combined analysis of the proteome and the transcriptome allowed the identification of positive correlation coefficients between proteins and mRNAs. Conclusions. These results demonstrate a specific proteome profile related to adipogenesis at the beginning (24 hours) of the differentiation process in hASC, which advances the understanding of human adipogenesis and obesity. Adipogenic differentiation is finely regulated at the transcriptional, posttranscriptional, and posttranslational levels.

2021 ◽  
Vol 12 ◽  
Penghui Chen ◽  
Longhao Wang ◽  
Yongchuan Chai ◽  
Hao Wu ◽  
Tao Yang

Splice site mutations contribute to a significant portion of the genetic causes for mendelian disorders including deafness. By next-generation sequencing of 4 multiplex, autosomal dominant families and 2 simplex, autosomal recessive families with hereditary deafness, we identified a variety of candidate pathogenic variants in noncanonical splice sites of known deafness genes, which include c.1616+3A > T and c.580G > A in EYA4, c.322-57_322-8del in PAX3, c.991-15_991-13del in DFNA5, c.6087-3T > G in PTPRQ and c.164+5G > A in USH1G. All six variants were predicted to affect the RNA splicing by at least one of the computational tools Human Splicing Finder, NNSPLICE and NetGene2. Phenotypic segregation of the variants was confirmed in all families and is consistent with previously reported genotype-phenotype correlations of the corresponding genes. Minigene analysis showed that those splicing site variants likely have various negative impact including exon-skipping (c.1616+3A > T and c.580G > A in EYA4, c.991-15_991-13del in DFNA5), intron retention (c.322-57_322-8del in PAX3), exon skipping and intron retention (c.6087-3T > G in PTPRQ) and shortening of exon (c.164+5G > A in USH1G). Our study showed that the cryptic, noncanonical splice site mutations may play an important role in the molecular etiology of hereditary deafness, whose diagnosis can be facilitated by modified filtering criteria for the next-generation sequencing data, functional verification, as well as segregation, bioinformatics, and genotype-phenotype correlation analysis.

Hyungtaek Jung ◽  
Brendan Jeon ◽  
Daniel Ortiz-Barrientos

Storing and manipulating Next Generation Sequencing (NGS) file formats for understanding biological phenomena is an essential but difficult task in the life sciences. Yet, most methods for analysing NGS data require complex command-line tools in high-performance computing (HPC) or web-based servers and have not yet been implemented in comprehensive, easy-to-use software. Here we present easyfm (easy file manipulation), a free standalone Graphical User Interface (GUI) software with Python support that can be used to facilitate the rapid discovery of target sequences (or user’s interest) in NGS datasets for novice users (more accessible to biologists). It enables them to perform end-to-end reproducible data analyses using a desktop application (Windows, Mac and Linux). Unlike existing tools, the GUI-based easyfm is not dependent on any HPC system and can be operated without an internet connection. For user-friendliness and convenience, easyfm was developed with four work modules and a secondary GUI window, covering different aspects of NGS data analysis, including post-processing, filtering, format conversion, generating results, real-time log, and help. In combination with the executable tools (BLAST+ and BLAT) and Python, easyfm allows the user to set analysis parameters, select/extract regions of interest, examine the input and output results, and convert to a wide range of file formats. To help augment the functionality of existing web-based and command-line tools, easyfm, a self-contained program, comes with extensive documentation ( This specific benefit allows easyfm to seamlessly integrate visual and interactive representations of NGS files, supporting a wider scope of bioinformatics applications in the life sciences.

2021 ◽  
Vol 22 (1) ◽  
Michael M. Khayat ◽  
Sayed Mohammad Ebrahim Sahraeian ◽  
Samantha Zarate ◽  
Andrew Carroll ◽  
Huixiao Hong ◽  

Abstract Background Genomic structural variations (SV) are important determinants of genotypic and phenotypic changes in many organisms. However, the detection of SV from next-generation sequencing data remains challenging. Results In this study, DNA from a Chinese family quartet is sequenced at three different sequencing centers in triplicate. A total of 288 derivative data sets are generated utilizing different analysis pipelines and compared to identify sources of analytical variability. Mapping methods provide the major contribution to variability, followed by sequencing centers and replicates. Interestingly, SV supported by only one center or replicate often represent true positives with 47.02% and 45.44% overlapping the long-read SV call set, respectively. This is consistent with an overall higher false negative rate for SV calling in centers and replicates compared to mappers (15.72%). Finally, we observe that the SV calling variability also persists in a genotyping approach, indicating the impact of the underlying sequencing and preparation approaches. Conclusions This study provides the first detailed insights into the sources of variability in SV identification from next-generation sequencing and highlights remaining challenges in SV calling for large cohorts. We further give recommendations on how to reduce SV calling variability and the choice of alignment methodology.

2021 ◽  
Vol 11 (1) ◽  
Erick C. Castelli ◽  
Bibiana S. de Almeida ◽  
Yara C. N. Muniz ◽  
Nayane S. B. Silva ◽  
Marília R. S. Passos ◽  

AbstractHLA-G is a promiscuous immune checkpoint molecule. The HLA-G gene presents substantial nucleotide variability in its regulatory regions. However, it encodes a limited number of proteins compared to classical HLA class I genes. We characterized the HLA-G genetic variability in 4640 individuals from 88 different population samples across the globe by using a state-of-the-art method to characterize polymorphisms and haplotypes from high-coverage next-generation sequencing data. We also provide insights regarding the HLA-G genetic diversity and a resource for future studies evaluating HLA-G polymorphisms in different populations and association studies. Despite the great haplotype variability, we demonstrated that: (1) most of the HLA-G polymorphisms are in introns and regulatory sequences, and these are the sites with evidence of balancing selection, (2) linkage disequilibrium is high throughout the gene, extending up to HLA-A, (3) there are few proteins frequently observed in worldwide populations, with lack of variation in residues associated with major HLA-G biological properties (dimer formation, interaction with leukocyte receptors). These observations corroborate the role of HLA-G as an immune checkpoint molecule rather than as an antigen-presenting molecule. Understanding HLA-G variability across populations is relevant for disease association and functional studies.

2021 ◽  
Preeti P ◽  
Robin Sinha ◽  
kamal rawal

Background: Mobile genetic elements (MGEs) comprise a major portion of the human genome and are essential for genetic diversity. These elements are known to have the capability to induce mutations in the human genome. To date, there are several MGE insertions which have been reported to be associated with cancer. We aim to use genome next-generation sequencing data and appropriate bioinformatics tools to accurately identify the insertion sites of MGEs in the human genome.Results: Herein, we introduce the MeX pipeline for the localization and annotation of MGEs in paired-end sequencing data. It requires the reference genome sequence, MGE sequences and paired-end sequencing reads. We evaluated MeX on high depth (>75×) Illumina HiSeq data produced at the Broad Institute (NA12878) against human genome 38-built (including only chromosome 1, 2 and 3) and Alu elements. We could identify 78 reference and 1 non-reference Alu insertions in the NA12878 sample. Upon annotation, it was found that the non-reference Alu element was in the 3' UTR region of the RNF2 gene. Out of 78 reference insertions, 42 were in the intronic region, 7 in the upstream region, 5 in the downstream region, 1 in the 3’ UTR region and the rest were not associated with any gene. MeX showed high performance for the identification and annotation of MGEs in genome samples.Conclusion: This study showed that MeX is a robust and powerful tool for the identification and annotation of MGE insertions. It may also serve as a valuable tool to study the phenotypic changes resulting from transpositional events in cancer genomics.

Sign in / Sign up

Export Citation Format

Share Document