scholarly journals Improved peak-calling with MACS2

2018 ◽  
Author(s):  
John M. Gaspar

The computational analyses of genome-enrichment assays, such as ChIP-seq and ATAC-seq, are typically concluded with a peak-calling program that identifies genomic regions that are significantly enriched. The most popular peak-caller, MACS2, assumes that the input alignment files are for single-end sequence reads by default, yet those with paired-end Illumina sequence data frequently use this default setting. This leads to erroneous coverage values and suboptimal peak identification. However, using the correct paired-end mode can introduce another set of artifacts. After thoroughly reviewing the MACS2 source code, we have modified it to limit these and other problems. Our updated version is freely available (https://github.com/jsh58/MACS).

2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 25-25
Author(s):  
Muhammad Yasir Nawaz ◽  
Rodrigo Pelicioni Savegnago ◽  
Cedric Gondro

Abstract In this study, we detected genome wide footprints of selection in Hanwoo and Angus beef cattle using different allele frequency and haplotype-based methods based on imputed whole genome sequence data. Our dataset included 13,202 Angus and 10,437 Hanwoo animals with 10,057,633 and 13,241,550 imputed SNPs, respectively. A subset of data with 6,873,624 common SNPs between the two populations was used to estimate signatures of selection parameters, both within (runs of homozygosity and extended haplotype homozygosity) and between (allele fixation index, extended haplotype homozygosity) the breeds in order to infer evidence of selection. We observed that correlations between various measures of selection ranged between 0.01 to 0.42. Assuming these parameters were complementary to each other, we combined them into a composite selection signal to identify regions under selection in both beef breeds. The composite signal was based on the average of fractional ranks of individual selection measures for every SNP. We identified some selection signatures that were common between the breeds while others were independent. We also observed that more genomic regions were selected in Angus as compared to Hanwoo. Candidate genes within significant genomic regions may help explain mechanisms of adaptation, domestication history and loci for important traits in Angus and Hanwoo cattle. In the future, we will use the top SNPs under selection for genomic prediction of carcass traits in both breeds.


2016 ◽  
Vol 1 ◽  
pp. 4 ◽  
Author(s):  
Sarah Auburn ◽  
Ulrike Böhme ◽  
Sascha Steinbiss ◽  
Hidayat Trimarsanto ◽  
Jessica Hostetler ◽  
...  

Plasmodium vivax is now the predominant cause of malaria in the Asia-Pacific, South America and Horn of Africa. Laboratory studies of this species are constrained by the inability to maintain the parasite in continuous ex vivo culture, but genomic approaches provide an alternative and complementary avenue to investigate the parasite’s biology and epidemiology. To date, molecular studies of P. vivax have relied on the Salvador-I reference genome sequence, derived from a monkey-adapted strain from South America. However, the Salvador-I reference remains highly fragmented with over 2500 unassembled scaffolds.  Using high-depth Illumina sequence data, we assembled and annotated a new reference sequence, PvP01, sourced directly from a patient from Papua Indonesia. Draft assemblies of isolates from China (PvC01) and Thailand (PvT01) were also prepared for comparative purposes. The quality of the PvP01 assembly is improved greatly over Salvador-I, with fragmentation reduced to 226 scaffolds. Detailed manual curation has ensured highly comprehensive annotation, with functions attributed to 58% core genes in PvP01 versus 38% in Salvador-I. The assemblies of PvP01, PvC01 and PvT01 are larger than that of Salvador-I (28-30 versus 27 Mb), owing to improved assembly of the subtelomeres.  An extensive repertoire of over 1200 Plasmodium interspersed repeat (pir) genes were identified in PvP01 compared to 346 in Salvador-I, suggesting a vital role in parasite survival or development. The manually curated PvP01 reference and PvC01 and PvT01 draft assemblies are important new resources to study vivax malaria. PvP01 is maintained at GeneDB and ongoing curation will ensure continual improvements in assembly and annotation quality.


2019 ◽  
Vol 35 (21) ◽  
pp. 4430-4432 ◽  
Author(s):  
René L Warren ◽  
Lauren Coombe ◽  
Hamid Mohamadi ◽  
Jessica Zhang ◽  
Barry Jaquish ◽  
...  

Abstract Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.


2004 ◽  
Vol 85 (1) ◽  
pp. 45-48 ◽  
Author(s):  
Linda M. Kohn

Astract Phylogenetic or genealogical interpretation of DNA sequence data from multiple genomic regions has become the gold standard for species delimitation and population genetics. Precise species concepts can inform quarantine decisions but are likely to reflect evolutionary events too far in the past to impact disease management. On the other hand, multilocus approaches at the population level can identify patterns of endemism or migration directly associated with episodes of disease, including host shifts and associated changes in determinants of pathogenicity and avirulence. We used the genome database of Magnaporthe grisea to frame a comparative, multilocus genomics approach from which we demonstrate a single origin for rice infecting genotypes with concomitant loss of sex in pandemic clonal lineages, and patterns of gain and loss of avirulence genes. In the Sclerotinia sclerotiorum pathosystem, we identified significant associations of multilocus haplotypes with specific pathogen populations in North America. Following the introduction of a new crop, endemic pathogen genotypes and newly evolved migrant genotypes caused novel, early-season symptoms.


2014 ◽  
Vol 30 (9) ◽  
pp. 1302-1304 ◽  
Author(s):  
Michael T. McCarthy ◽  
Christopher A. O’Callaghan

2015 ◽  
Vol 59 (7) ◽  
pp. 4139-4147 ◽  
Author(s):  
Hannah M. Adams ◽  
Xiang Li ◽  
Carmela Mascio ◽  
Laurent Chesnel ◽  
Kelli L. Palmer

ABSTRACTClostridium difficileinfection (CDI) is an urgent public health concern causing considerable clinical and economic burdens. CDI can be treated with antibiotics, but recurrence of the disease following successful treatment of the initial episode often occurs. Surotomycin is a rapidly bactericidal cyclic lipopeptide antibiotic that is in clinical trials for CDI treatment and that has demonstrated superiority over vancomycin in preventing CDI relapse. Surotomycin is a structural analogue of the membrane-active antibiotic daptomycin. Previously, we utilizedin vitroserial passage experiments to deriveC. difficilestrains with reduced surotomycin susceptibilities. The parent strains used included ATCC 700057 and clinical isolates from the restriction endonuclease analysis (REA) groups BI and K. Serial passage experiments were also performed with vancomycin-resistant and vancomycin-susceptibleEnterococcus faeciumandEnterococcus faecalis. The goal of this study is to identify mutations associated with reduced surotomycin susceptibility inC. difficileand enterococci. Illumina sequence data generated for the parent strains and serial passage isolates were compared. We identified nonsynonymous mutations in genes coding for cardiolipin synthase inC. difficileATCC 700057, enoyl-(acyl carrier protein) reductase II (FabK) and cell division protein FtsH2 inC. difficileREA type BI, and a PadR family transcriptional regulator inC. difficileREA type K. Among the 4 enterococcal strain pairs, 20 mutations were identified, and those mutations overlap those associated with daptomycin resistance. These data give insight into the mechanism of action of surotomycin againstC. difficile, possible mechanisms for resistance emergence during clinical use, and the potential impacts of surotomycin therapy on intestinal enterococci.


2019 ◽  
Author(s):  
Alexis Criscuolo ◽  
Sylvie Issenhuth-Jeanjean ◽  
Xavier Didelot ◽  
Kaisa Thorell ◽  
James Hale ◽  
...  

AbstractBacteria and archaea make up most of natural diversity but the mechanisms that underlie the origin and maintenance of prokaryotic species are poorly understood. We investigated the speciation history of the genusSalmonella, an ecologically diverse bacterial lineage, within whichS. entericasubsp.entericais responsible for important human food-borne infections. We performed a survey of diversity across a large reference collection using multilocus sequence typing, followed by genome sequencing of distinct lineages. We identified eleven distinct phylogroups, three of which were previously undescribed. Strains assigned toS. entericasubsp.salamaeare polyphyletic, with two distinct lineages that we designate Salamae A and Salamae B. Strains of subspecieshoutenaeare subdivided into two groups, Houtenae A and B and are both related to Selander’s group VII. A phylogroup we designate VIII was previously unknown. A simple binary fission model of speciation cannot explain observed patterns of sequence diversity. In the recent past, there have been large scale hybridization events involving an unsampled ancestral lineage and three distantly related lineages of the genus that have given rise to Houtenae A, Houtenae B and VII. We found no evidence for ongoing hybridization in the other eight lineages but detected more subtle signals of ancient recombination events. We are unable to fully resolve the speciation history of the genus, which might have involved additional speciation-by-hybridization or multi-way speciation events. Our results imply that traditional models of speciation by binary fission and divergence may not apply inSalmonella.Data summaryIllumina sequence data were submitted to the European Nucleotide Archive under project number PRJEB2099 and are available from INSDC (NCBI/ENA/DDBJ) under accession numbers ERS011101 to ERS011146. The MLST sequence and profile data generated in this study have been publicly available on theSalmonellaMLST web site between 2010 and the migration of theSalmonellaMLST website to EnteroBase (https://enterobase.warwick.ac.uk/), and subsequently from there.


2018 ◽  
Author(s):  
Jun Zheng ◽  
Erliang Zeng ◽  
Yicong Du ◽  
Cheng He ◽  
Ying Hu ◽  
...  

AbstractSmall RNAs (sRNAs) are short noncoding RNAs that play roles in many biological processes, including drought responses in plants. However, how the expression of sRNAs dynamically changes with the gradual imposition of drought stress in plants is largely unknown. We generated time-series sRNA sequence data from maize seedlings under drought stress and under well-watered conditions at the same time points. Analyses of length, functional annotation, and abundance of 736,372 non-redundant sRNAs from both drought and well-watered data, as well as genome copy number and chromatin modifications at the corresponding genomic regions, revealed distinct patterns of abundance, genome organization, and chromatin modifications for different sRNA classes of sRNAs. The analysis identified 6,646 sRNAs whose regulation was altered in response to drought stress. Among drought-responsive sRNAs, 1,325 showed transient down-regulation by the seventh day, coinciding with visible symptoms of drought stress. The profiles revealed drought-responsive microRNAs, as well as other sRNAs that originated from ribosomal RNAs (rRNAs), splicing small nuclear RNAs, and small nucleolar RNAs (snoRNA). Expression profiles of their sRNA derivers indicated that snoRNAs might play a regulatory role through regulating stability of rRNAs and splicing small nuclear RNAs under drought condition.


2019 ◽  
Author(s):  
Aseel Awdeh ◽  
Marcel Turcotte ◽  
Theodore J. Perkins

AbstractMotivationChromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results.ResultsWe propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses.ConclusionThis ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls.


Sign in / Sign up

Export Citation Format

Share Document