scholarly journals PHARP: A pig haplotype reference panel for genotype imputation

2021 ◽  
Author(s):  
Zhen Wang ◽  
Zhenyang Zhang ◽  
Zitao Chen ◽  
Jiabao Sun ◽  
Caiyun Cao ◽  
...  

Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2,012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17x sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.

2015 ◽  
Author(s):  
Shane McCarthy ◽  
Sayantan Das ◽  
Warren Kretzschmar ◽  
Olivier Delaneau ◽  
Andrew R. Wood ◽  
...  

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1%, a large increase in the number of SNPs tested in association studies and can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.


2021 ◽  
Author(s):  
Changheng Zhao ◽  
Jun Teng ◽  
Xinhao Zhang ◽  
Dan Wang ◽  
Xinyi Zhang ◽  
...  

Abstract Background Low coverage whole genome sequencing is a low-cost genotyping technology. Combining with genotype imputation approaches, it is likely to become a critical component of cost-efficient genomic selection programs in agricultural livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to investigate the performance of genotype imputation for low coverage whole genome sequence data and genomic selection based on the imputed genotype data. The specific aims were: (i) to measure the accuracy of genotype imputation under different sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the accuracy of genomic selection under different marker densities derived from the imputed sequence data, different strategies for constructing the genomic relationship matrixes, and single- vs multi-trait models. Results We found that a high imputation accuracy (> 0.95) can be achieved for sequence data with sequencing depth as low as 1x and the number of sequenced individuals equal to 400. For genomic selection, the best performance was obtained by using a marker density of 410K and a G matrix constructed using marker dosage information. Multi-trait GBLUP performed better than single-trait GBLUP. Conclusions Our study demonstrates that low coverage whole genome sequencing would be a cost-effective method for genomic selection in Dezhou Donkey.


2021 ◽  
Author(s):  
Yiheng Hu ◽  
Laszlo Irinyi ◽  
Minh Thuy Vi Hoang ◽  
Tavish Eenjes ◽  
Abigail Graetz ◽  
...  

Background: The kingdom fungi is crucial for life on earth and is highly diverse. Yet fungi are challenging to characterize. They can be difficult to culture and may be morphologically indistinct in culture. They can have complex genomes of over 1 Gb in size and are still underrepresented in whole genome sequence databases. Overall their description and analysis lags far behind other microbes such as bacteria. At the same time, classification of species via high throughput sequencing without prior purification is increasingly becoming the norm for pathogen detection, microbiome studies, and environmental monitoring. However, standardized procedures for characterizing unknown fungi from complex sequencing data have not yet been established. Results: We compared different metagenomics sequencing and analysis strategies for the identification of fungal species. Using two fungal mock communities of 44 phylogenetically diverse species, we compared species classification and community composition analysis pipelines using shotgun metagenomics and amplicon sequencing data generated from both short and long read sequencing technologies. We show that regardless of the sequencing methodology used, the highest accuracy of species identification was achieved by sequence alignment against a fungi-specific database. During the assessment of classification algorithms, we found that applying cut-offs to the query coverage of each read or contig significantly improved the classification accuracy and community composition analysis without significant data loss. Conclusion: Overall, our study expands the toolkit for identifying fungi by improving sequence-based fungal classification, and provides a practical guide for the design of metagenomics analyses.


Author(s):  
Giada Ferrari ◽  
Lane M. Atmore ◽  
Sissel Jentoft ◽  
Kjetill S. Jakobsen ◽  
Daniel Makowiecki ◽  
...  

2018 ◽  
Author(s):  
Alfredo Iacoangeli ◽  
Ahmad Al Khleifat ◽  
William Sproviero ◽  
Aleksey Shatunov ◽  
Ashley R Jones ◽  
...  

AbstractAmyotrophic lateral sclerosis (ALS, MND) is a neurodegenerative disease of upper and lower motor neurons resulting in death from neuromuscular respiratory failure, typically within two years of first symptoms. Genetic factors are an important cause of ALS, with variants in more than 25 genes having strong evidence, and weaker evidence available for variants in more than 120 genes. With the increasing availability of Next-Generation sequencing data, non-specialists, including health care professionals and patients, are obtaining their genomic information without a corresponding ability to analyse and interpret it. Furthermore, the relevance of novel or existing variants in ALS genes is not always apparent. Here we present ALSgeneScanner, a tool that is easy to install and use, able to provide an automatic, detailed, annotated report, on a list of ALS genes from whole genome sequence data in a few hours and whole exome sequence data in about one hour on a readily available mid-range computer. This will be of value to non-specialists and aid in the interpretation of the relevance of novel and existing variants identified in DNA sequencing data.


Author(s):  
Ming Cao ◽  
Qinke Peng ◽  
Ze-Gang Wei ◽  
Fei Liu ◽  
Yi-Fan Hou

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


2019 ◽  
Vol 96 (2) ◽  
pp. 106-109
Author(s):  
Jayshree Dave ◽  
John Paul ◽  
Thomas Joshua Pasvol ◽  
Andy Williams ◽  
Fiona Warburton ◽  
...  

ObjectiveWe aimed to characterise gonorrhoea transmission patterns in a diverse urban population by linking genomic, epidemiological and antimicrobial susceptibility data.MethodsNeisseria gonorrhoeae isolates from patients attending sexual health clinics at Barts Health NHS Trust, London, UK, during an 11-month period underwent whole-genome sequencing and antimicrobial susceptibility testing. We combined laboratory and patient data to investigate the transmission network structure.ResultsOne hundred and fifty-eight isolates from 158 patients were available with associated descriptive data. One hundred and twenty-nine (82%) patients identified as male and 25 (16%) as female; four (3%) records lacked gender information. Self-described ethnicities were: 51 (32%) English/Welsh/Scottish; 33 (21%) white, other; 23 (15%) black British/black African/black, other; 12 (8%) Caribbean; 9 (6%) South Asian; 6 (4%) mixed ethnicity; and 10 (6%) other; data were missing for 14 (9%). Self-reported sexual orientations were 82 (52%) men who have sex with men (MSM); 49 (31%) heterosexual; 2 (1%) bisexual; data were missing for 25 individuals. Twenty-two (14%) patients were HIV positive. Whole-genome sequence data were generated for 151 isolates, which linked 75 (50%) patients to at least one other case. Using sequencing data, we found no evidence of transmission networks related to specific ethnic groups (p=0.64) or of HIV serosorting (p=0.35). Of 82 MSM/bisexual patients with sequencing data, 45 (55%) belonged to clusters of ≥2 cases, compared with 16/44 (36%) heterosexuals with sequencing data (p=0.06).ConclusionWe demonstrate links between 50% of patients in transmission networks using a relatively small sample in a large cosmopolitan city. We found no evidence of HIV serosorting. Our results do not support assortative selectivity as an explanation for differences in gonorrhoea incidence between ethnic groups.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5895 ◽  
Author(s):  
Thomas Andreas Kohl ◽  
Christian Utpatel ◽  
Viola Schleusener ◽  
Maria Rosaria De Filippo ◽  
Patrick Beckert ◽  
...  

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.


2021 ◽  
Author(s):  
Giada Ferrari ◽  
Lane M Atmore ◽  
Sissel Jentoft ◽  
Kjetill S Jakobsen ◽  
Daniel Makowiecki ◽  
...  

Genomic assignment tests can provide important diagnostic biological characteristics, such as population of origin or ecotype. In ancient DNA research, such characters can provide further information on population continuity, evolution, climate change, species migration, or trade, depending on archaeological context. Yet, assignment tests often rely on moderate- to high-coverage sequence data, which can be difficult to obtain for many ancient specimens and in ecological studies, which often use sequencing techniques such as ddRAD to bypass the need for costly whole-genome sequencing. We have developed a novel approach that efficiently assigns biologically relevant information (such as population identity or structural variants) in extremely low-coverage sequence data. First, we generate databases from existing reference data using a subset of diagnostic Single Nucleotide Polymorphisms (SNPs) associated with a biological characteristic. Low coverage alignment files from ancient specimens are subsequently compared to these databases to ascertain allelic state yielding a joint probability for each association. To assess the efficacy of this approach, we assigned inversion haplotypes and population identity in several species including Heliconius butterflies, Atlantic herring, and Atlantic cod. We used both modern and ancient specimens, including the first whole-genome sequence data recovered from ancient herring bones. The method accurately assigns biological characteristics, including population membership, using extremely low-coverage (e.g. 0.0001x fold) based on genome-wide SNPs. This approach will therefore increase the number of ancient samples in ecological and bioarchaeological research for which relevant biological information can be obtained.


Genetics ◽  
2019 ◽  
Vol 212 (3) ◽  
pp. 577-586 ◽  
Author(s):  
V. Kartik Chundru ◽  
Riccardo E. Marioni ◽  
James G. D. Prendergast ◽  
Costanza L. Vallerga ◽  
Tian Lin ◽  
...  

Genetic variants disrupting DNA methylation at CpG dinucleotides (CpG-SNP) provide a set of known causal variants to serve as models to test fine-mapping methodology. We use 1716 CpG-SNPs to test three fine-mapping approaches (Bayesian imputation-based association mapping, Bayesian sparse linear mixed model, and the J-test), assessing the impact of imputation errors and the choice of reference panel by using both whole-genome sequence (WGS), and genotype array data on the same individuals (n = 1166). The choice of imputation reference panel had a strong effect on imputation accuracy, with the 1000 Genomes Project Phase 3 (1000G) reference panel (n = 2504 from 26 populations) giving a mean nonreference discordance rate between imputed and sequenced genotypes of 3.2% compared to 1.6% when using the Haplotype Reference Consortium (HRC) reference panel (n = 32,470 Europeans). These imputation errors had an impact on whether the CpG-SNP was included in the 95% credible set, with a difference of ∼23% and ∼7% between the WGS and the 1000G and HRC imputed datasets, respectively. All of the fine-mapping methods failed to reach the expected 95% coverage of the CpG-SNP. This is attributed to secondary cis genetic effects that are unable to be statistically separated from the CpG-SNP, and through a masking mechanism where the effect of the methylation disrupting allele at the CpG-SNP is hidden by the effect of a nearby SNP that has strong linkage disequilibrium with the CpG-SNP. The reduced accuracy in fine-mapping a known causal variant in a low-level biological trait with imputed genetic data has implications for the study of higher-order complex traits and disease.


Sign in / Sign up

Export Citation Format

Share Document