A Clustering Approach for Motif Discovery in ChIP-Seq Dataset

Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.

Download Full-text

A genome-wide scan for a simulated data set using two newly developed methods

Genetic Epidemiology ◽

10.1002/gepi.13701707101 ◽

1999 ◽

Vol 17 (S1) ◽

pp. S621-S626

Author(s):

Li Hsu ◽

Corinne Aragaki ◽

Filemon Quiaoit ◽

Xiangjing Wang ◽

Xiubin Xu ◽

...

Keyword(s):

Simulated Data ◽

Data Set ◽

Genome Wide ◽

A Genome ◽

Genome Wide Scan

Download Full-text

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

BioMed Research International ◽

10.1155/2016/4986707 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Qiang Yu ◽

Hongwei Huo ◽

Dazheng Feng

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

High Throughput Sequencing ◽

Hamming Distance ◽

Simulated Data ◽

Real Data ◽

Identification Accuracy ◽

Data Sets ◽

Sequencing Data ◽

Data Set

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

Download Full-text

Detecting epistasis via Markov bases

Journal of Algebraic Statistics ◽

10.18409/jas.v2i1.27 ◽

2011 ◽

Vol 2 (1) ◽

Cited By ~ 3

Author(s):

Anna-Sapfo Malaspinas ◽

Caroline Uhler

Keyword(s):

Simulated Data ◽

Research Progress ◽

Nucleotide Polymorphisms ◽

Two Stage ◽

Data Set ◽

Exact Test ◽

Multiple Loci ◽

Genome Wide ◽

A Genome ◽

Logistic Regression Method

Rapid research progress in genotyping techniques have allowed large genome-wide associationstudies. Existing methods often focus on determining associations between single loci anda specic phenotype. However, a particular phenotype is usually the result of complex relationshipsbetween multiple loci and the environment. In this paper, we describe a two-stage methodfor detecting epistasis by combining the traditionally used single-locus search with a search formultiway interactions. Our method is based on an extended version of Fisher's exact test. Toperform this test, a Markov chain is constructed on the space of multidimensional contingencytables using the elements of a Markov basis as moves. We test our method on simulated data andcompare it to a two-stage logistic regression method and to a fully Bayesian method, showing thatwe are able to detect the interacting loci when other methods fail to do so. Finally, we apply ourmethod to a genome-wide data set consisting of 685 dogs and identify epistasis associated withcanine hair length for four pairs of single nucleotide polymorphisms (SNPs).

Download Full-text

Eyes of Africa: The Genetics of Blindness: Study Design and Methodology

BMC Ophthalmology ◽

10.1186/s12886-021-02029-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Olusola Olawoye ◽

Chimdi Chuka-Okosa ◽

Onoja Akpa ◽

Tony Realini ◽

Michael Hauser ◽

...

Keyword(s):

Genome Wide Association Study ◽

Case Control Study ◽

African Ancestry ◽

Collaborative Study ◽

Data Set ◽

Human Heredity ◽

Genome Wide ◽

A Genome ◽

Multiplex Families ◽

Control Study

Abstract Background This report describes the design and methodology of the “Eyes of Africa: The Genetics of Blindness,” a collaborative study funded through the Human Heredity and Health in Africa (H3Africa) program of the National Institute of Health. Methods This is a case control study that is collecting a large well phenotyped data set among glaucoma patients and controls for a genome wide association study. (GWAS). Multiplex families segregating Mendelian forms of early-onset glaucoma will also be collected for exome sequencing. Discussion A total of 4500 cases/controls have been recruited into the study at the end of the 3rd funded year of the study. All these participants have been appropriately phenotyped and blood samples have been received from these participants. Recent GWAS of POAG in African individuals demonstrated genome-wide significant association with the APBB2 locus which is an association that is unique to individuals of African ancestry. This study will add to the existing knowledge and understanding of POAG in the African population.

Download Full-text

F-Box Genes in the Wheat Genome and Expression Profiling in Wheat at Different Developmental Stages

Genes ◽

10.3390/genes11101154 ◽

2020 ◽

Vol 11 (10) ◽

pp. 1154

Author(s):

Min Jeong Hong ◽

Jin-Baek Kim ◽

Yong Weon Seo ◽

Dae Yeon Kim

Keyword(s):

Developmental Stages ◽

Brachypodium Distachyon ◽

Triticum Aestivum L ◽

Wheat Genome ◽

Post Translational Modification ◽

Genome Wide ◽

A Genome ◽

High Sequence Homology ◽

Wide Scale

Genes of the F-box family play specific roles in protein degradation by post-translational modification in several biological processes, including flowering, the regulation of circadian rhythms, photomorphogenesis, seed development, leaf senescence, and hormone signaling. F-box genes have not been previously investigated on a genome-wide scale; however, the establishment of the wheat (Triticum aestivum L.) reference genome sequence enabled a genome-based examination of the F-box genes to be conducted in the present study. In total, 1796 F-box genes were detected in the wheat genome and classified into various subgroups based on their functional C-terminal domain. The F-box genes were distributed among 21 chromosomes and most showed high sequence homology with F-box genes located on the homoeologous chromosomes because of allohexaploidy in the wheat genome. Additionally, a synteny analysis of wheat F-box genes was conducted in rice and Brachypodium distachyon. Transcriptome analysis during various wheat developmental stages and expression analysis by quantitative real-time PCR revealed that some F-box genes were specifically expressed in the vegetative and/or seed developmental stages. A genome-based examination and classification of F-box genes provide an opportunity to elucidate the biological functions of F-box genes in wheat.

Download Full-text

Genome-wide mapping of unexplored essential regions in the Saccharomyces cerevisiae genome: evidence for hidden synthetic lethal combinations in a genetic interaction network

Nucleic Acids Research ◽

10.1093/nar/gku576 ◽

2014 ◽

Vol 42 (15) ◽

pp. 9838-9853 ◽

Cited By ~ 6

Author(s):

Saeed Kaboli ◽

Takuya Yamakawa ◽

Keisuke Sunada ◽

Tao Takagaki ◽

Yu Sasano ◽

...

Keyword(s):

Saccharomyces Cerevisiae ◽

Genetic Interaction ◽

Interaction Network ◽

Essential Genes ◽

Genetic Interactions ◽

Lethal Gene ◽

Synthetic Lethal ◽

Genome Wide ◽

A Genome ◽

Wide Scale

Abstract Despite systematic approaches to mapping networks of genetic interactions in Saccharomyces cerevisiae, exploration of genetic interactions on a genome-wide scale has been limited. The S. cerevisiae haploid genome has 110 regions that are longer than 10 kb but harbor only non-essential genes. Here, we attempted to delete these regions by PCR-mediated chromosomal deletion technology (PCD), which enables chromosomal segments to be deleted by a one-step transformation. Thirty-three of the 110 regions could be deleted, but the remaining 77 regions could not. To determine whether the 77 undeletable regions are essential, we successfully converted 67 of them to mini-chromosomes marked with URA3 using PCR-mediated chromosome splitting technology and conducted a mitotic loss assay of the mini-chromosomes. Fifty-six of the 67 regions were found to be essential for cell growth, and 49 of these carried co-lethal gene pair(s) that were not previously been detected by synthetic genetic array analysis. This result implies that regions harboring only non-essential genes contain unidentified synthetic lethal combinations at an unexpectedly high frequency, revealing a novel landscape of genetic interactions in the S. cerevisiae genome. Furthermore, this study indicates that segmental deletion might be exploited for not only revealing genome function but also breeding stress-tolerant strains.

Download Full-text

Genome-wide characterization of copy number variations in the host genome in genetic resistance to Marek's disease using next generation sequencing

10.21203/rs.2.12741/v2 ◽

2020 ◽

Author(s):

Hao Bai ◽

Yanghua He ◽

Yi Ding ◽

Huanmin Zhang ◽

Jilan Chen ◽

...

Keyword(s):

Next Generation Sequencing ◽

Copy Number ◽

Marek's Disease ◽

Neoplastic Disease ◽

Marek’S Disease ◽

Next Generation ◽

Qrt Pcr ◽

Genome Wide ◽

A Genome ◽

Generation Sequencing

Abstract Background: Marek’s disease (MD) is a highly neoplastic disease primarily affecting chickens, and remains as a chronic infectious disease that threatens the poultry industry. Copy number variation (CNV) has been examined in many species and is recognized as a major source of genetic variation that directly contributes to phenotypic variation such as resistance to infectious diseases. Two highly inbred chicken lines 63 (MD-resistant) and 72 (MD-susceptible), as well as their F1 generation and six recombinant congenic strains (RCSs) with varied susceptibility to MD, are considered as ideal models to identify the complex mechanisms of genetic and molecular resistance to MD.Results: In the present study, to unravel the potential genetic mechanisms underlying resistance to MD, we performed a genome-wide CNV detection using next generation sequencing on the inbred chicken lines with the assistance of CNVnator. As a result, a total of 1,649 CNV regions (CNVRs) were successfully identified after merging all the nine datasets, of which 90 CNVRs were overlapped across all the chicken lines. Within these shared regions, 1,360 harbored genes were identified. In addition, 55 and 44 CNVRs with 62 and 57 harbored genes were specifically identified in line 63 and 72, respectively. Bioinformatics analysis showed that the nearby genes were significantly enriched in 36 GO terms and 6 KEGG pathways including JAK/STAT signaling pathway. Ten CNVRs (nine deletions and one duplication) involved in 10 disease-related genes were selected for validation by using qRT-PCR, all of which were successfully confirmed. Finally, qRT-PCR was also used to validate two deletion events in line 72 that were definitely normal in line 63. One high-confidence gene, IRF2 was identified as the most promising candidate gene underlying resistance and susceptibility to MD in view of its function and overlaps with data from previous study.Conclusions: Our findings provide valuable insights for understanding the genetic mechanism of resistance to MD and the identified gene and pathway could be considered as the subject of further functional characterization.

Download Full-text

Machine-learning annotation of human splicing branchpoints

10.1101/094003 ◽

2016 ◽

Cited By ~ 3

Author(s):

Bethany Signal ◽

Brian S Gloss ◽

Marcel E Dinger ◽

Timothy R Mercer

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Gene Splicing ◽

Genetic Encoding ◽

Genome Wide ◽

Common Genetic Variants ◽

A Genome ◽

Wide Scale ◽

The Impact ◽

Splicing Patterns

ABSTRACTBackgroundThe branchpoint element is required for the first lariat-forming reaction in splicing. However due to difficulty in experimentally mapping at a genome-wide scale, current catalogues are incomplete.ResultsWe have developed a machine-learning algorithm trained with empirical human branchpoint annotations to identify branchpoint elements from primary genome sequence alone. Using this approach, we can accurately locate branchpoints elements in 85% of introns in current gene annotations. Consistent with branchpoints as basal genetic elements, we find our annotation is unbiased towards gene type and expression levels. A major fraction of introns was found to encode multiple branchpoints raising the prospect that mutational redundancy is encoded in key genes. We also confirmed all deleterious branchpoint mutations annotated in clinical variant databases, and further identified thousands of clinical and common genetic variants with similar predicted effects.ConclusionsWe propose the broad annotation of branchpoints constitutes a valuable resource for further investigations into the genetic encoding of splicing patterns, and interpreting the impact of common- and disease-causing human genetic variation on gene splicing.

Download Full-text

Emerging Technologies for Genome-Wide Profiling of DNA Breakage

Frontiers in Genetics ◽

10.3389/fgene.2020.610386 ◽

2021 ◽

Vol 11 ◽

Author(s):

Matthew J. Rybin ◽

Melina Ramic ◽

Natalie R. Ricciardi ◽

Philipp Kapranov ◽

Claes Wahlestedt ◽

...

Keyword(s):

Genome Instability ◽

Dna Double Strand Breaks ◽

Single Nucleotide ◽

Strand Breaks ◽

Single Strand Breaks ◽

Genome Wide ◽

A Genome ◽

Wide Scale ◽

Nucleotide Resolution ◽

Genomic Regions

Genome instability is associated with myriad human diseases and is a well-known feature of both cancer and neurodegenerative disease. Until recently, the ability to assess DNA damage—the principal driver of genome instability—was limited to relatively imprecise methods or restricted to studying predefined genomic regions. Recently, new techniques for detecting DNA double strand breaks (DSBs) and single strand breaks (SSBs) with next-generation sequencing on a genome-wide scale with single nucleotide resolution have emerged. With these new tools, efforts are underway to define the “breakome” in normal aging and disease. Here, we compare the relative strengths and weaknesses of these technologies and their potential application to studying neurodegenerative diseases.

Download Full-text

An ancestral recombination graph of human, Neanderthal, and Denisovan genomes

Science Advances ◽

10.1126/sciadv.abc0776 ◽

2021 ◽

Vol 7 (29) ◽

pp. eabc0776

Author(s):

Nathan K. Schaefer ◽

Beth Shapiro ◽

Richard E. Green

Keyword(s):

Incomplete Lineage Sorting ◽

Simulated Data ◽

Modern Human ◽

Ancestral Recombination Graph ◽

Lineage Sorting ◽

Human Genomes ◽

Genome Wide ◽

A Genome ◽

Graph Inference ◽

And Function

Many humans carry genes from Neanderthals, a legacy of past admixture. Existing methods detect this archaic hominin ancestry within human genomes using patterns of linkage disequilibrium or direct comparison to Neanderthal genomes. Each of these methods is limited in sensitivity and scalability. We describe a new ancestral recombination graph inference algorithm that scales to large genome-wide datasets and demonstrate its accuracy on real and simulated data. We then generate a genome-wide ancestral recombination graph including human and archaic hominin genomes. From this, we generate a map within human genomes of archaic ancestry and of genomic regions not shared with archaic hominins either by admixture or incomplete lineage sorting. We find that only 1.5 to 7% of the modern human genome is uniquely human. We also find evidence of multiple bursts of adaptive changes specific to modern humans within the past 600,000 years involving genes related to brain development and function.

Download Full-text