scholarly journals Comprehensive understanding of Tn5 insertion preference improves transcription regulatory element identification

2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Houyu Zhang ◽  
Ting Lu ◽  
Shan Liu ◽  
Jianyu Yang ◽  
Guohuan Sun ◽  
...  

Abstract Tn5 transposase, which can efficiently tagment the genome, has been widely adopted as a molecular tool in next-generation sequencing, from short-read sequencing to more complex methods such as assay for transposase-accessible chromatin using sequencing (ATAC-seq). Here, we systematically map Tn5 insertion characteristics across several model organisms, finding critical parameters that affect its insertion. On naked genomic DNA, we found that Tn5 insertion is not uniformly distributed or random. To uncover drivers of these biases, we used a machine learning framework, which revealed that DNA shape cooperatively works with DNA motif to affect Tn5 insertion preference. These intrinsic insertion preferences can be modeled using nucleotide dependence information from DNA sequences, and we developed a computational pipeline to correct for these biases in ATAC-seq data. Using our pipeline, we show that bias correction improves the overall performance of ATAC-seq peak detection, recovering many potential false-negative peaks. Furthermore, we found that these peaks are bound by transcription factors, underscoring the biological relevance of capturing this additional information. These findings highlight the benefits of an improved understanding and precise correction of Tn5 insertion preference.

2019 ◽  
Author(s):  
Andre Macedo ◽  
Alisson M. Gontijo

The human body is made up of hundreds, perhaps thousands of cell types and states, most of which are currently inaccessible genetically. Genetic accessibility carries significant diagnostic and therapeutic potential by allowing the selective delivery of genetic messages or cures to cells. Research in model organisms has shown that single regulatory element (RE) activities are seldom cell type specific, limiting their usage in genetic systems designed to restrict gene expression posteriorly to their delivery to cells. Intersectional genetic approaches can increase the number of genetically accessible cells. A typical intersectional method acts like an AND logic gate by converting the input of two or more active REs into a single synthetic output, which becomes unique for that cell. Here, we systematically assessed the intersectional genetics landscape of human using a curated subset of cells from a large RE usage atlas obtained by Cap Analysis of Gene Expression Sequencing (CAGE-Seq) of thousands of primary and cancer cells (the FANTOM5 consortium atlas). We developed the heuristics and algorithms to retrieve and quality rank AND gate intersections intra- and inter-individually. We find that >90% of the 154 primary cell types surveyed can be distinguished from each other with as little as 3 to 4 active REs, with quantifiable safety and robustness. We call these minimal intersections of active REs with cell-type diagnostic potential “Versatile Entry Codes” (VEnCodes). We show that VEnCodes could be found for 100% of the 158 cancer cell types surveyed, and that most of these are highly robust to intra- and interindividual variation. Our tools for generating and quality-ranking VEnCodes can be adapted to other RE usage databases and to other intersectional methods using alternative Boolean logic operations. Our work demonstrate the potential of intersectional approaches for future gene delivery technologies in human.


1988 ◽  
Vol 8 (8) ◽  
pp. 3008-3016 ◽  
Author(s):  
L A Bobek ◽  
D M Rekosh ◽  
P T LoVerde

We have isolated six independent genomic clones encoding schistosome chorion or eggshell proteins from a Schistosoma mansoni genomic library. A linkage may of five of the clones spanning 35 kilobase pair (kbp) of the S. mansoni genome was constructed. The region contained two eggshell protein genes closely linked, separated by 7.5 kbp of intergenic DNA. The two genes of the cluster were arranged in the same orientation, that is, they were transcribed from the same strand. The sixth clone probably represents a third copy of the eggshell gene that is not contained within the 35-kbp region. The 5' end of the mRNA transcribed from these genes was defined by primer extension directly off the RNA. The ATCAT cap site sequence was homologous to a silkmoth chorion PuTCATT cap site sequence, where Pu indicates any purine. DNA sequence analysis showed that there were no introns in these genes. The DNA sequences of the three genes were very homologous to each other and to a cDNA clone, pSMf61-46, differing only in three or four nucleotides. A multiple TATA box was located at positions -23 to -31, and a CAAAT sequence was located at -52 upstream of the eggshell transcription unit. Comparison of sequences in regions further upstream with silkmoth and Drosophila sequences revealed several very short elements that were shared. One such element, TCACGT, recently shown to be an essential cis-regulatory element for silkmoth chorion gene promoter function, was found at a similar position in all three organisms.


2018 ◽  
Vol 15 (138) ◽  
pp. 20170667 ◽  
Author(s):  
Sophia S. Liu ◽  
Adam J. Hockenberry ◽  
Michael C. Jewett ◽  
Luís A. N. Amaral

The unequal utilization of synonymous codons affects numerous cellular processes including translation rates, protein folding and mRNA degradation. In order to understand the biological impact of variable codon usage bias (CUB) between genes and genomes, it is crucial to be able to accurately measure CUB for a given sequence. A large number of metrics have been developed for this purpose, but there is currently no way of systematically testing the accuracy of individual metrics or knowing whether metrics provide consistent results. This lack of standardization can result in false-positive and false-negative findings if underpowered or inaccurate metrics are applied as tools for discovery. Here, we show that the choice of CUB metric impacts both the significance and measured effect sizes in numerous empirical datasets, raising questions about the generality of findings in published research. To bring about standardization, we developed a novel method to create synthetic protein-coding DNA sequences according to different models of codon usage. We use these benchmark sequences to identify the most accurate and robust metrics with regard to sequence length, GC content and amino acid heterogeneity. Finally, we show how our benchmark can aid the development of new metrics by providing feedback on its performance compared to the state of the art.


2012 ◽  
Vol 109 (38) ◽  
pp. 15229-15234 ◽  
Author(s):  
Bethany A. Buck-Koehntop ◽  
Robyn L. Stanfield ◽  
Damian C. Ekiert ◽  
Maria A. Martinez-Yamout ◽  
H. Jane Dyson ◽  
...  

Methylation of CpG dinucleotides in DNA is a common epigenetic modification in eukaryotes that plays a central role in maintenance of genome stability, gene silencing, genomic imprinting, development, and disease. Kaiso, a bifunctional Cys2His2 zinc finger protein implicated in tumor-cell proliferation, binds to both methylated CpG (mCpG) sites and a specific nonmethylated DNA motif (TCCTGCNA) and represses transcription by recruiting chromatin remodeling corepression machinery to target genes. Here we report structures of the Kaiso zinc finger DNA-binding domain in complex with its nonmethylated, sequence-specific DNA target (KBS) and with a symmetrically methylated DNA sequence derived from the promoter region of E-cadherin. Recognition of specific bases in the major groove of the core KBS and mCpG sites is accomplished through both classical and methyl CH···O hydrogen-bonding interactions with residues in the first two zinc fingers, whereas residues in the C-terminal extension following the third zinc finger bind in the opposing minor groove and are required for high-affinity binding. The C-terminal region is disordered in the free protein and adopts an ordered structure upon binding to DNA. The structures of these Kaiso complexes provide insights into the mechanism by which a zinc finger protein can recognize mCpG sites as well as a specific, nonmethylated regulatory DNA sequence.


1989 ◽  
Vol 9 (12) ◽  
pp. 5305-5314 ◽  
Author(s):  
J Drouin ◽  
M A Trifiro ◽  
R K Plante ◽  
M Nemer ◽  
P Eriksson ◽  
...  

Glucocorticoids rapidly and specifically inhibit transcription of the pro-opiomelanocortin (POMC) gene in the anterior pituitary, thus offering a model for studying negative control of transcription in mammals. We have defined an element within the rat POMC gene 5'-flanking region that is required for glucocorticoid inhibition of POMC gene transcription in POMC-expressing pituitary tumor cells (AtT-20). This element contains an in vitro binding site for purified glucocorticoid receptor. Site-directed mutagenesis revealed that binding of the receptor to this site located at position base pair -63 is essential for glucocorticoid repression of transcription. Although related to the well-defined glucocorticoid response element (GRE) found in glucocorticoid-inducible genes, the DNA sequence of the POMC negative glucocorticoid response element (nGRE) differs significantly from the GRE consensus; this sequence divergence may result in different receptor-DNA interactions and may account at least in part for the opposite transcriptional properties of these elements. Hormone-dependent repression of POMC gene transcription may be due to binding of the receptor over a positive regulatory element of the promoter. Thus, repression may result from mutually exclusive binding of two DNA-binding proteins to overlapping DNA sequences.


Author(s):  
ESSAM AL DAOUD

In this study, a new genetic algorithm was developed to discover the best motifs in a set of DNA sequences. The main steps were: finding the potential positions in each sequence by using few voters (1–5 sequences), constructing the chromosomes from the potential positions, evaluating the fitness for each gene (position) and for each chromosome, calculating the new random distribution, and using the new distribution to generate the next generation. To verify the effectiveness of the proposed algorithm, several real and artificial datasets were used; the results are compared to the standard genetic algorithm, and Gibbs, MEME, and consensus algorithms. Although all the algorithms have low correlation with the correct motifs, the new algorithm exhibits higher accuracy, without sacrificing implementation time.


1985 ◽  
Vol 5 (12) ◽  
pp. 3417-3428 ◽  
Author(s):  
R T Nagao ◽  
E Czarnecka ◽  
W B Gurley ◽  
F Schöffl ◽  
J L Key

Soybeans, Glycine max, synthesize a family of low-molecular-weight heat shock (HS) proteins in response to HS. The DNA sequences of two genes encoding 17.5- and 17.6-kilodalton HS proteins were determined. Nuclease S1 mapping of the corresponding mRNA indicated multiple start termini at the 5' end and multiple stop termini at the 3' end. These two genes were compared with two other soybean HS genes of similar size. A comparison among the 5' flanking regions encompassing the presumptive HS promoter of the soybean HS-protein genes demonstrated this region to be extremely homologous. Analysis of the DNA sequences in the 5' flanking regions of the soybean genes with the corresponding regions of Drosophila melanogaster HS-protein genes revealed striking similarity between plants and animals in the presumptive promoter structure of thermoinducible genes. Sequences related to the Drosophila HS consensus regulatory element were found 57 to 62 base pairs 5' to the start of transcription in addition to secondary HS consensus elements located further upstream. Comparative analysis of the deduced amino acid sequences of four soybean HS proteins illustrated that these proteins were greater than 90% homologous. Comparison of the amino acid sequence for soybean HS proteins with other organisms showed much lower homology (less than 20%). Hydropathy profiles for Drosophila, Xenopus, Caenorhabditis elegans, and G. max HS proteins showed a similarity of major hydrophilic and hydrophobic regions, which suggests conservation of functional domains for these proteins among widely dispersed organisms.


2007 ◽  
Vol 25 (18_suppl) ◽  
pp. 7567-7567
Author(s):  
W. A. Franklin ◽  
T. Byers ◽  
H. J. Wolf ◽  
S. Braudich ◽  
F. R. Hirsch ◽  
...  

7567 Background: Lung CT imaging shows promise for early lung cancer detection, but may be insensitive for central lesions and will create a need for complementary biomarkers to guide clinical decisions. Methods: We have been conducting a prospective study to assess the performance of various biomarkers in sputum to detect lung cancer. Within a cohort of 3,269 people at high risk for lung cancer (over 30 pack-years of cigarette use and chronic obstructive lung disease) we have conducted a nested case-control study to assay stored samples from 85 incident lung cancer cases and 73 controls for cytologic morphology, DNA methylation, and multi-targeted FISH assays (LAVysion, Vysis/Abbott Molecular). The FISH assay targeted DNA sequences from centromere 6, 5p15.2, 7p12 (EGFR), and 8q24 (cMYC). Cytology was classified as abnormal if moderate or greater atypia was observed, methylation was abnormal if three or more of 8 selected genes were methylated, and FISH was abnormal if two or more scored cells were observed to contain chromosomal aneuploidy indicated by signal gain for at least two targets or signal gain for one and signal loss for two or more targets. Results: Among all subjects, regardless of the time between sputum collection and diagnosis, FISH was abnormal in 49/87 cases (56%) and 5/73 controls (7%) (odds ratio (OR) = 17.5, 95% CI = 6.4 to 47.8). Considering only the cases for which sputum had been collected within 18 months before the lung cancer diagnosis, FISH was abnormal in 37/49 cases (75%) and in 3/38 controls (8%) (OR = 36.0, 95% CI = 9.4 to 138). For this same time period, cytological atypia was abnormal in 37% of cases and 22% of controls (OR = 2.1, 95% CI = 0.8 to 5.3) and methylation was abnormal in 64% of cases and 36% of controls (OR = 4.5, 95% CI = 1.0 to 19.8). Conclusions: The LAVysion FISH assay is more sensitive than other currently available sputum tests for the prediction of invasive carcinoma in subjects and is highly specific with a low frequency of false positive and false negative results. The results suggest that chromosomal aneuploidy and by extrapolation, missegregation, are features of advanced premalignancy and should identify patients who are proximate to clinically overt lung cancer. [Table: see text]


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Chunxiao Sun ◽  
Hongwei Huo ◽  
Qiang Yu ◽  
Haitao Guo ◽  
Zhigang Sun

The planted(l,d)motif search (PMS) is one of the fundamental problems in bioinformatics, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. Nowadays, identifying weak motifs and reducing the effect of local optimum are still important but challenging tasks for motif discovery. To solve the tasks, we propose a new algorithm, APMotif, which first applies the Affinity Propagation (AP) clustering in DNA sequences to produce informative and good candidate motifs and then employs Expectation Maximization (EM) refinement to obtain the optimal motifs from the candidate motifs. Experimental results both on simulated data sets and real biological data sets show that APMotif usually outperforms four other widely used algorithms in terms of high prediction accuracy.


Author(s):  
Philippe Grandcolas

Biology has already experienced great divides that decreased its global coherence and its ability to answer important scientific and societal concerns. For example in the XXth century, the so-called “Life Sciences” developed remarkably in comparison to Natural History sciences. This way, the approaches on model organisms dominated or prevented other approaches from being carried out on more diverse organisms, which may have given a misleading feeling of generality for the results obtained. Another great divide is at risk of developing now with the rise of what could be called “Digital Biology,” separating from other “material-based” approaches in its tendency to consider digital data only. Some biologists adopt a somewhat essentialist view of species and DNA, considering that enough knowledge is now accumulated, and that species records can be kept and saved as digital data only (Grandcolas 2017). Examples of this include occurrence records without specimens or auxiliary documents, taxonomic descriptions based on photographs, DNA sequences without vouchers, and, lastly, DNA sequences without taxonomic names. This tendency puts at risk the sustainability, growth, and coherence of biological knowledge that is organized in a system wherein all data and notions are connected via specimens, with names and sequences being a retrieval means (Troudet et al. 2018). This tendency also ignores the robust foundation of biology, the data of which are linked to collections, vouchers, and stocks. The foundation of physical specimens exists for data concerning any live beings, be they rare wild species or selected lines of model organisms. There are now many calls for open and FAIR science, with results, methods, tools, and data not only findable, accessible, and interoperable but also re-usable. More than FAIR and digitally re-usable, data need to be sustainable. It is needed that their meaning and significance can be re-analysed, re-interpreted by going back as far as possible to material vouchers. We urge then scientists to consider this question by providing all necessary material elements to make open and FAIR data sustainable as well.


Sign in / Sign up

Export Citation Format

Share Document