gc bias
Recently Published Documents


TOTAL DOCUMENTS

43
(FIVE YEARS 17)

H-INDEX

10
(FIVE YEARS 2)

PLoS ONE ◽  
2022 ◽  
Vol 17 (1) ◽  
pp. e0261748
Author(s):  
John E. Bowers ◽  
Haibao Tang ◽  
John M. Burke ◽  
Andrew H. Paterson

The frequency of G and C nucleotides in genomes varies from species to species, and sometimes even between different genes in the same genome. The monocot grasses have a bimodal distribution of genic GC content absent in dicots. We categorized plant genes from 5 dicots and 4 monocot grasses by synteny to related species and determined that syntenic genes have significantly higher GC content than non-syntenic genes at their 5`-end in the third position within codons for all 9 species. Lower GC content is correlated with gene duplication, as lack of synteny to distantly related genomes is associated with past interspersed gene duplications. Two mutation types can account for biased GC content, mutation of methylated C to T and gene conversion from A to G. Gene conversion involves non-reciprocal exchanges between homologous alleles and is not detectable when the alleles are identical or heterozygous for presence-absence variation, both likely situations for genes duplicated to new loci. Gene duplication can cause production of siRNA which can induce targeted methylation, elevating mC→T mutations. Recently duplicated plant genes are more frequently methylated and less likely to undergo gene conversion, each of these factors synergistically creating a mutational environment favoring AT nucleotides. The syntenic genes with high GC content in the grasses compose a subset that have undergone few duplications, or for which duplicate copies were purged by selection. We propose a “biased gene duplication / biased mutation” (BDBM) model that may explain the origin and trajectory of the observed link between duplication and genic GC bias. The BDBM model is supported by empirical data based on joint analyses of 9 angiosperm species with their genes categorized by duplication status, GC content, methylation levels and functional classes.


PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0257521
Author(s):  
Clara Delahaye ◽  
Jacques Nicolas

Oxford Nanopore Technologies’ (ONT) long read sequencers offer access to longer DNA fragments than previous sequencer generations, at the cost of a higher error rate. While many papers have studied read correction methods, few have addressed the detailed characterization of observed errors, a task complicated by frequent changes in chemistry and software in ONT technology. The MinION sequencer is now more stable and this paper proposes an up-to-date view of its error landscape, using the most mature flowcell and basecaller. We studied Nanopore sequencing error biases on both bacterial and human DNA reads. We found that, although Nanopore sequencing is expected not to suffer from GC bias, it is a crucial parameter with respect to errors. In particular, low-GC reads have fewer errors than high-GC reads (about 6% and 8% respectively). The error profile for homopolymeric regions or regions with short repeats, the source of about half of all sequencing errors, also depends on the GC rate and mainly shows deletions, although there are some reads with long insertions. Another interesting finding is that the quality measure, although over-estimated, offers valuable information to predict the error rate as well as the abundance of reads. We supplemented this study with an analysis of a rapeseed RNA read set and shown a higher level of errors with a higher level of deletion in these data. Finally, we have implemented an open source pipeline for long-term monitoring of the error profile, which enables users to easily compute various analysis presented in this work, including for future developments of the sequencing device. Overall, we hope this work will provide a basis for the design of better error-correction methods.


2021 ◽  
Author(s):  
Dyfed Lloyd Evans

Much of the work on the normalization of RNA-seq data has been performed on human, notably cancer tissue. Little work has been done in plants, particularly polyploids and those species with incomplete or no genomes. We present a novel implementation of GeTMM (Gene Length Corrected TMM) that accounts for GC bias and works at the transcript level. The algorithm also employs transcript length as a factor, allowing for incomplete transcripts and alternate transcripts. This significantly improves overall normalization. The GCGeTMM methodology also allows for simultaneous determination of differentially expressed transcripts (and by extension genes) and stably expressed genes to act as references for qRT-PCR and microarray analyses.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yue Wang ◽  
Paul M. Harrison

AbstractHomopeptides (runs of one amino-acid type) are evolutionarily important since they are prone to expand/contract during DNA replication, recombination and repair. To gain insight into the genomic/proteomic traits driving their variation, we analyzed how homopeptides and homocodons (which are pure codon repeats) vary across 405 Dikarya, and probed their linkage to genome GC/AT bias and other factors. We find that amino-acid homopeptide frequencies vary diversely between clades, with the AT-rich Saccharomycotina trending distinctly. As organisms evolve, homocodon and homopeptide numbers are majorly coupled to GC/AT-bias, exhibiting a bi-furcated correlation with degree of AT- or GC-bias. Mid-GC/AT genomes tend to have markedly fewer simply because they are mid-GC/AT. Despite these trends, homopeptides tend to be GC-biased relative to other parts of coding sequences, even in AT-rich organisms, indicating they absorb AT bias less or are inherently more GC-rich. The most frequent and most variable homopeptide amino acids favour intrinsic disorder, and there are an opposing correlation and anti-correlation versus homopeptide levels for intrinsic disorder and structured-domain content respectively. Specific homopeptides show unique behaviours that we suggest are linked to inherent slippage probabilities during DNA replication and recombination, such as poly-glutamine, which is an evolutionarily very variable homopeptide with a codon repertoire unbiased for GC/AT, and poly-lysine whose homocodons are overwhelmingly made from the codon AAG.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Angana Chakraborty ◽  
Burkhard Morgenstern ◽  
Sanghamitra Bandyopadhyay

Abstract Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.


2020 ◽  
Author(s):  
Taylor L. Mighell ◽  
Andrew Nishida ◽  
Brendan L. O’Connell ◽  
Caitlin V. Miller ◽  
Sally Grindstaff ◽  
...  

AbstractTargeted sequencing remains a valuable technique for clinical and research applications. However, many existing technologies suffer from pervasive GC sequence content bias, high input DNA requirements, and high cost for custom panels. We have developed Cas12a-Capture, a low-cost and highly scalable method for targeted sequencing. The method utilizes preprogramed guide RNAs to direct CRISPR-Cas12a cleavage of double stranded DNA in vitro and then takes advantage of the resulting four to five nucleotide overhangs for selective ligation with a custom sequencing adapter. Addition of a second sequencing adapter and enrichment for ligation products generates a targeted sequence library. We first performed a pilot experiment with 7,176 guides targeting 3.5 megabases of DNA. Using these data, we modeled the sequence determinants of Cas12a-Capture efficiency, then designed an optimized set of 11,438 guides targeting 3.0 megabases. The optimized guide set achieves an average 64-fold enrichment of targeted regions with minimal GC bias. Cas12a-Capture variant calls had strong concordance with Illumina Platinum Genome calls, especially for SNVs, which could be improved by applying basic variant quality heuristics. We believe Cas12a-Capture has a wide variety of potential clinical and research applications and is amendable for selective enrichment for any double stranded DNA template or genome.


2020 ◽  
Author(s):  
Chris M. Cohen ◽  
Katherine Noble ◽  
T. Jeffrey Cole ◽  
Michael S. Brewer

AbstractRobber flies or assassin flies (Diptera: Asilidae) are a diverse family of venomous predators. The most recent classification organizes Asilidae into 14 subfamilies based on a comprehensive morphological phylogeny, but many of these have not been supported in a subsequent molecular study using traditional molecular markers. To address questions of monophyly in Asilidae, we leveraged the recently developed Diptera-wide UCE baitset to compile seven datasets comprising 151 robber flies and 146 - 2,508 loci, varying in the extent of missing data. We also studied the behavior of different nodal support metrics, as the non-parametric bootstrap is known to perform poorly with large genomic datasets. Our ML phylogeny was fully resolved and well-supported, but partially incongruent with the coalescent phylogeny. Further examination of the datasets suggested the possibility that GC bias had influenced gene tree inference and subsequent species tree analysis. The subfamilies Brachyrhopalinae, Dasypogoninae, Dioctriinae, Stenopogoninae, Tillobromatinae, Trigonomiminae, and Willistonininae were not recovered as monophyletic in either analysis, consistent with a previous molecular study. The inter-subfamily relationships are summarized as follows: Laphriinae and Dioctriinae (in part) are successively sister to the remaining subfamilies, which form two clades; the first consists of a grade of Stenopogoninae (in part), Willistonininae (in part), Bathypogoninae+Phellinae, Stichopogoninae, Leptogastrinae, Ommatiinae, and Asilinae; the second clade consists of a thoroughly paraphyletic assemblage of genera from Dioctriinae (in part), Trigonomiminae, Stenopogoninae (in part), Tillobromatinae, Brachyrhopalinae, and Dasypogoninae. We find that nodal support does not significantly vary with missing data. Furthermore, the bootstrap appears to overestimate nodal support, as has been reported from many recent studies. Gene concordance and site concordance factors seem to perform better, but may actually underestimate support. We instead recommend quartet concordance as a more appropriate estimator of nodal support. Our comprehensive phylogeny demonstrates that the higher classification of Asilidae is far from settled, and it will provide a much-needed foundation for a thorough revision of the subfamily classification.


Author(s):  
Aziz Khan ◽  
Rafael Riudavets Puig ◽  
Paul Boddie ◽  
Anthony Mathelier

Abstract Motivation Accurate motif enrichment analyses depend on the choice of background DNA sequences used, which should ideally match the sequence composition of the foreground sequences. It is important to avoid false positive enrichment due to sequence biases in the genome, such as GC-bias. Therefore, relying on an appropriate set of background sequences is crucial for enrichment analysis. Results We developed BiasAway, a command line tool and its dedicated easy-to-use web server to generate synthetic sequences matching any k-mer nucleotide composition or select genomic DNA sequences matching the mononucleotide composition of the foreground sequences through four different models. For genomic sequences, we provide precomputed partitions of genomes from nine species with five different bin sizes to generate appropriate genomic background sequences. Availability and implementation BiasAway source code is freely available from Bitbucket (https://bitbucket.org/CBGR/biasaway) and can be easily installed using bioconda or pip. The web server is available at https://biasaway.uio.no and a detailed documentation is available at https://biasaway.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document