A beginner's guide to low-coverage whole genome sequencing for population genomics

10.22541/au.160689616.68843086/v4 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Arne Jacobs ◽

Aryn Wilder ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Cost Effective ◽

Whole Genome ◽

Model Species ◽

Population Genomic ◽

Genomic Studies ◽

Lower Depth ◽

Low Coverage

Low-coverage whole genome sequencing (lcWGS) has emerged as a powerful and cost-effective approach for population genomic studies in both model and non-model species. However, with read depths too low to confidently call individual genotypes, lcWGS requires specialized analysis tools that explicitly account for genotype uncertainty. A growing number of such tools have become available, but it can be difficult to get an overview of what types of analyses can be performed reliably with lcWGS data, and how the distribution of sequencing effort between the number of samples analyzed and per-sample sequencing depths affects inference accuracy. In this introductory guide to lcWGS, we first illustrate how the per-sample cost for lcWGS is now comparable to RAD-seq and Pool-seq in many systems. We then provide an overview of software packages that explicitly account for genotype uncertainty in different types of population genomic inference. Next, we use both simulated and empirical data to assess the accuracy of allele frequency and genetic diversity estimation, detection of population structure, and selection scans under different sequencing strategies. Our results show that spreading a given amount of sequencing effort across more samples with lower depth per sample consistently improves the accuracy of most types of inference, with a few notable exceptions. Finally, we assess the potential for using imputation to bolster inference from lcWGS data in non-model species, and discuss current limitations and future perspectives for lcWGS-based population genomics research. With this overview, we hope to make lcWGS more approachable and stimulate its broader adoption.

Download Full-text

A beginner's guide to low-coverage whole genome sequencing for population genomics

10.22541/au.160689616.68843086/v1 ◽

2020 ◽

Author(s):

Runyang Nicolas Lou ◽

Arne Jacobs ◽

Aryn Wilder ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Frequency Estimation ◽

Cost Effective ◽

Whole Genome ◽

Model Species ◽

Population Genomic ◽

Lower Depth ◽

Low Coverage

Low-coverage whole genome sequencing (lcWGS) has emerged as a powerful and cost-effective approach for population genomic studies in both model and non-model species. However, with read depths too low to confidently call individual genotypes, lcWGS requires specialized analysis tools that explicitly account for genotype uncertainty. A growing number of such tools have become available, but it can be difficult to get an overview of what types of analyses can be performed reliably with lcWGS data and how the distribution of sequencing effort between the number of samples analyzed and per-sample sequencing depths affects inference accuracy. In this introductory guide to lcWGS, we first illustrate that the per-sample cost for lcWGS is now comparable to RAD-seq and Pool-seq in many systems. We then provide an overview of software packages that explicitly account for genotype uncertainty in different types of population genomic inference. Next, we use both simulated and empirical data to assess the accuracy of allele frequency estimation, detection of population structure, and selection scans under different sequencing strategies. Our results show that spreading a given amount of sequencing effort across more samples with lower depth per sample consistently improves the accuracy of most types of inference compared to sequencing fewer samples each at higher depth. Finally, we assess the potential for using imputation to bolster inference from lcWGS data in non-model species, and discuss current limitations and future perspectives for lcWGS-based analysis. With this overview, we hope to make lcWGS more approachable and stimulate broader adoption.

Download Full-text

A beginner's guide to low-coverage whole genome sequencing for population genomics

10.22541/au.160689616.68843086/v2 ◽

2020 ◽

Author(s):

Runyang Nicolas Lou ◽

Arne Jacobs ◽

Aryn Wilder ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Frequency Estimation ◽

Cost Effective ◽

Whole Genome ◽

Model Species ◽

Population Genomic ◽

Lower Depth ◽

Low Coverage

Low-coverage whole genome sequencing (lcWGS) has emerged as a powerful and cost-effective approach for population genomic studies in both model and non-model species. However, with read depths too low to confidently call individual genotypes, lcWGS requires specialized analysis tools that explicitly account for genotype uncertainty. A growing number of such tools have become available, but it can be difficult to get an overview of what types of analyses can be performed reliably with lcWGS data and how the distribution of sequencing effort between the number of samples analyzed and per-sample sequencing depths affects inference accuracy. In this introductory guide to lcWGS, we first illustrate that the per-sample cost for lcWGS is now comparable to RAD-seq and Pool-seq in many systems. We then provide an overview of software packages that explicitly account for genotype uncertainty in different types of population genomic inference. Next, we use both simulated and empirical data to assess the accuracy of allele frequency estimation, detection of population structure, and selection scans under different sequencing strategies. Our results show that spreading a given amount of sequencing effort across more samples with lower depth per sample consistently improves the accuracy of most types of inference compared to sequencing fewer samples each at higher depth. Finally, we assess the potential for using imputation to bolster inference from lcWGS data in non-model species, and discuss current limitations and future perspectives for lcWGS-based analysis. With this overview, we hope to make lcWGS more approachable and stimulate broader adoption.

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

Batch effects in population genomic studies with low-coverage whole genome sequencing data: causes, detection, and mitigation

10.22541/au.162791857.78788821/v1 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Read Length ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Spatial Coverage ◽

Genomic Studies ◽

Low Coverage

Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.

Download Full-text

Batch effects in population genomic studies with low-coverage whole genome sequencing data: causes, detection, and mitigation

10.22541/au.162791857.78788821/v2 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Read Length ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Spatial Coverage ◽

Genomic Studies ◽

Low Coverage

Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.

Download Full-text

Extremely low-coverage whole genome sequencing in South Asians captures population genomics information

BMC Genomics ◽

10.1186/s12864-017-3767-6 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 12

Author(s):

Navin Rustagi ◽

Anbo Zhou ◽

W. Scott Watkins ◽

Erika Gedvilaite ◽

Shuoguo Wang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

South Asians ◽

Whole Genome ◽

Low Coverage

Download Full-text

A beginner’s guide to low‐coverage whole genome sequencing for population genomics

Molecular Ecology ◽

10.1111/mec.16077 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Arne Jacobs ◽

Aryn Wilder ◽

Nina O. Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Whole Genome ◽

Low Coverage

Download Full-text

Noninvasive Detection of Urothelial Carcinoma by Cost-effective Low-coverage Whole-genome Sequencing from Urine-Exfoliated Cell DNA

Clinical Cancer Research ◽

10.1158/1078-0432.ccr-20-0401 ◽

2020 ◽

Vol 26 (21) ◽

pp. 5646-5654

Author(s):

Shuxiong Zeng ◽

Yidie Ying ◽

Naidong Xing ◽

Baiyun Wang ◽

Ziliang Qian ◽

...

Keyword(s):

Urothelial Carcinoma ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Cost Effective ◽

Noninvasive Detection ◽

Whole Genome ◽

Low Coverage

Download Full-text

ShallowHRD: detection of homologous recombination deficiency from shallow whole genome sequencing

Bioinformatics ◽

10.1093/bioinformatics/btaa261 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3888-3889

Author(s):

Alexandre Eeckhoutte ◽

Alexandre Houy ◽

Elodie Manié ◽

Manon Reverdy ◽

Ivan Bièche ◽

...

Keyword(s):

Homologous Recombination ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Software Tool ◽

Cost Effective ◽

Supplementary Information ◽

Whole Genome ◽

Copy Number Alterations ◽

Homologous Recombination Deficiency ◽

Low Coverage

Abstract Summary We introduce shallowHRD, a software tool to evaluate tumor homologous recombination deficiency (HRD) based on whole genome sequencing (WGS) at low coverage (shallow WGS or sWGS; ∼1X coverage). The tool, based on mining copy number alterations profile, implements a fast and straightforward procedure that shows 87.5% sensitivity and 90.5% specificity for HRD detection. shallowHRD could be instrumental in predicting response to poly(ADP-ribose) polymerase inhibitors, to which HRD tumors are selectively sensitive. shallowHRD displays efficiency comparable to most state-of-art approaches, is cost-effective, generates low-storable outputs and is also suitable for fixed-formalin paraffin embedded tissues. Availability and implementation shallowHRD R script and documentation are available at https://github.com/aeeckhou/shallowHRD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text