A spatial genomic approach identifies time lags and historic barriers to gene flow in a rapidly fragmenting Appalachian landscape

AbstractThe resolution offered by genomic data sets coupled with recently developed spatially informed analyses are allowing researchers to quantify population structure at increasingly fine temporal and spatial scales. However, uncertainties regarding data set size and quality thresholds and the time scale at which barriers to gene flow become detectable have limited both empirical research and conservation measures. Here, we used restriction site associated DNA sequencing to generate a large SNP data set for the copperhead snake (Agkistrodon contortrix) and address the population genomic impacts of recent and widespread landscape modification across an approximately 1000 km2 region of eastern Kentucky. Nonspatial population-based assignment and clustering methods supported little to no population structure. However, using individual-based spatial autocorrelation approaches we found evidence for genetic structuring which closely follows the path of a historic highway which experienced high traffic volumes from ca. 1920 to 1970. We found no similar spatial genomic signatures associated with more recently constructed highways or surface mining activity, though a time lag effect may be responsible for the lack of any emergent spatial genetic patterns. Subsampling of our SNP data set suggested that similar results could be obtained with as few as 250 SNPs, and thresholds for missing data exhibited limited impacts on the spatial patterns we detected outside of very strict or permissive extremes. Our findings highlight the importance of temporal factors in landscape genetics approaches, and suggest the potential advantages of large genomic data sets and fine-scale, spatially-informed approaches for quantifying subtle genetic patterns in temporally complex landscapes.

Download Full-text

Bayesian Inference of Species Trees using Diffusion Models

Systematic Biology ◽

10.1093/sysbio/syaa051 ◽

2020 ◽

Vol 70 (1) ◽

pp. 145-161 ◽

Cited By ~ 1

Author(s):

Marnus Stoltz ◽

Boris Baeumer ◽

Remco Bouckaert ◽

Colin Fox ◽

Gordon Hiscott ◽

...

Keyword(s):

Bayesian Inference ◽

Numerical Algorithms ◽

Diffusion Models ◽

Model Parameters ◽

Data Sets ◽

Species Trees ◽

Computationally Efficient ◽

Data Set ◽

Snp Data ◽

Binary Markers

Abstract We describe a new and computationally efficient Bayesian methodology for inferring species trees and demographics from unlinked binary markers. Likelihood calculations are carried out using diffusion models of allele frequency dynamics combined with novel numerical algorithms. The diffusion approach allows for analysis of data sets containing hundreds or thousands of individuals. The method, which we call Snapper, has been implemented as part of the BEAST2 package. We conducted simulation experiments to assess numerical error, computational requirements, and accuracy recovering known model parameters. A reanalysis of soybean SNP data demonstrates that the models implemented in Snapp and Snapper can be difficult to distinguish in practice, a characteristic which we tested with further simulations. We demonstrate the scale of analysis possible using a SNP data set sampled from 399 fresh water turtles in 41 populations. [Bayesian inference; diffusion models; multi-species coalescent; SNP data; species trees; spectral methods.]

Download Full-text

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

BMC Genomics ◽

10.1186/s12864-020-07038-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Da Xu ◽

Jialin Zhang ◽

Hanxiao Xu ◽

Yusen Zhang ◽

Wei Chen ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Therapeutic Targets ◽

Genomic Data ◽

Data Sets ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Model Learning ◽

Data Set ◽

Multi Scale

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.

Download Full-text

Chemical biology-whole genome engineering datasets predict new antibacterial combinations

Microbial Genomics ◽

10.1099/mgen.0.000718 ◽

2021 ◽

Vol 7 (12) ◽

Author(s):

Arthur K. Turner ◽

Muhammad Yasir ◽

Sarah Bastkowski ◽

Andrea Telatin ◽

Andrew Page ◽

...

Keyword(s):

Chemical Biology ◽

Genome Engineering ◽

Genomic Data ◽

Inducible Promoter ◽

Data Sets ◽

Whole Genome ◽

Data Set ◽

Content Type ◽

Chemical Genomic ◽

Increased Susceptibility

Trimethoprim and sulfamethoxazole are used commonly together as cotrimoxazole for the treatment of urinary tract and other infections. The evolution of resistance to these and other antibacterials threatens therapeutic options for clinicians. We generated and analysed a chemical-biology-whole-genome data set to predict new targets for antibacterial combinations with trimethoprim and sulfamethoxazole. For this we used a large transposon mutant library in Escherichia coli BW25113 where an outward-transcribing inducible promoter was engineered into one end of the transposon. This approach allows regulated expression of adjacent genes in addition to gene inactivation at transposon insertion sites, a methodology that has been called TraDIS-Xpress. These chemical genomic data sets identified mechanisms for both reduced and increased susceptibility to trimethoprim and sulfamethoxazole. The data identified that over-expression of FolA reduced trimethoprim susceptibility, a known mechanism for reduced susceptibility. In addition, transposon insertions into the genes tdk, deoR, ybbC, hha, ldcA, wbbK and waaS increased susceptibility to trimethoprim and likewise for rsmH, fadR, ddlB, nlpI and prc with sulfamethoxazole, while insertions in ispD, uspC, minC, minD, yebK, truD and umpG increased susceptibility to both these antibiotics. Two of these genes’ products, Tdk and IspD, are inhibited by AZT and fosmidomycin respectively, antibiotics that are known to synergise with trimethoprim. Thus, the data identified two known targets and several new target candidates for the development of co-drugs that synergise with trimethoprim, sulfamethoxazole or cotrimoxazole. We demonstrate that the TraDIS-Xpress technology can be used to generate information-rich chemical-genomic data sets that can be used for antibacterial development.

Download Full-text

Scaling probabilistic models of genetic variation to millions of humans

10.1101/013227 ◽

2014 ◽

Cited By ~ 1

Author(s):

Prem Gopalan ◽

Wei Hao ◽

David M. Blei ◽

John D. Storey

Keyword(s):

Population Genetics ◽

Population Structure ◽

Genetic Variation ◽

Probabilistic Models ◽

Simulated Data ◽

Genomic Data ◽

Human Populations ◽

Data Sets ◽

Complex Population ◽

Simulated Data Sets

One of the major goals of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. To this end, researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans, and such methods have been effective for analyzing modestly sized genomic data sets. However, the number of genotyped humans has grown significantly in recent years, and it is accelerating. In aggregate about 1M individuals have been genotyped to date. Analyzing these data will bring us closer to a nearly complete picture of human genetic variation; but existing methods for population genetics analysis do not scale to data of this size. To solve this problem we developed TeraStructure. TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure of the individuals. On data sets of up to 2K individuals, TeraStructure matches the existing state of the art in terms of both speed and accuracy. On simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and has higher accuracy in recovering the latent population structure. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.

Download Full-text

CartograTree: Enabling Landscape Genomics for Forest Trees

10.7287/peerj.preprints.2345v1 ◽

2016 ◽

Author(s):

Nic Herndon ◽

Emily S Grau ◽

Iman Batra ◽

Steven A Demurjian Jr. ◽

Hans A Vasquez-Gross ◽

...

Keyword(s):

Population Structure ◽

Economic Value ◽

Genomic Data ◽

Environmental Data ◽

Geospatial Data ◽

Data Sets ◽

Forest Trees ◽

Analytic Framework ◽

Landscape Genomics ◽

The World

Forest trees cover just over 30% of the earth's surface and are studied by researchers around the world for both their conservation and economic value. With the onset of high throughput technologies, tremendous phenotypic and genomic data sets have been generated for hundreds of species. These long-lived and immobile individuals serve as ideal models to assess population structure and adaptation to environment. Despite the availability of comprehensive data, researchers are challenged to integrate genotype, phenotype, and environment in one place. Towards this goal, CartograTree was designed and implemented as a repository and analytic framework for genomic, phenotypic, and environmental data for forest trees. One of key components, the integration of geospatial data, allows the display of environmental layers and acquisition of environmental metrics relative to the positions of georeferenced individuals.

Download Full-text

Fully automated near-surface analysis by surface-consistent refraction method

Geophysics ◽

10.1190/geo2016-0018.1 ◽

2016 ◽

Vol 81 (4) ◽

pp. U39-U49 ◽

Cited By ~ 18

Author(s):

Daniele Colombo ◽

Federico Miorelli ◽

Ernesto Sandoval ◽

Kevin Erickson

Keyword(s):

Surface Analysis ◽

Seismic Data ◽

Linear Equations ◽

Time Shift ◽

Data Sets ◽

Time Lags ◽

Data Set ◽

Near Surface ◽

Seismic Acquisition ◽

Key Aspects

Industry practices for near-surface analysis indicate difficulties in coping with the increased number of channels in seismic acquisition systems, and new approaches are needed to fully exploit the resolution embedded in modern seismic data sets. To achieve this goal, we have developed a novel surface-consistent refraction analysis method for low-relief geology to automatically derive near-surface corrections for seismic data processing. The method uses concepts from surface-consistent analysis applied to refracted arrivals. The key aspects of the method consist of the use of common midpoint (CMP)-offset-azimuth binning, evaluation of mean traveltime and standard deviation for each bin, rejection of anomalous first-break (FB) picks, derivation of CMP-based traveltime-offset functions, conversion to velocity-depth functions, evaluation of long-wavelength statics, and calculation of surface-consistent residual statics through waveform crosscorrelation. Residual time lags are evaluated in multiple CMP-offset-azimuth bins by crosscorrelating a pilot trace with all the other traces in the gather in which the correlation window is centered at the refracted arrival. The residuals are then used to build a system of linear equations that is simultaneously inverted for surface-consistent shot and receiver time shift corrections plus a possible subsurface residual term. All the steps are completely automated and require a fraction of the time needed for conventional near-surface analysis. The developed methodology was successfully performed on a complex 3D land data set from Central Saudi Arabia where it was benchmarked against a conventional tomographic work flow. The results indicate that the new surface-consistent refraction statics method enhances seismic imaging especially in portions of the survey dominated by noise.

Download Full-text

Minor allele frequency thresholds strongly affect population structure inference with genomic data sets

Molecular Ecology Resources ◽

10.1111/1755-0998.12995 ◽

2019 ◽

Vol 19 (3) ◽

pp. 639-647 ◽

Cited By ~ 77

Author(s):

Ethan Linck ◽

C. J. Battey

Keyword(s):

Population Structure ◽

Allele Frequency ◽

Minor Allele Frequency ◽

Genomic Data ◽

Minor Allele ◽

Data Sets

Download Full-text

CartograTree: Enabling landscape genomics for forest trees

10.7287/peerj.preprints.2345v2 ◽

2016 ◽

Author(s):

Nic Herndon ◽

Emily S Grau ◽

Iman Batra ◽

Steven A Demurjian Jr. ◽

Hans A Vasquez-Gross ◽

...

Keyword(s):

Population Structure ◽

Economic Value ◽

Genomic Data ◽

Environmental Data ◽

Geospatial Data ◽

Data Sets ◽

Forest Trees ◽

Analytic Framework ◽

Landscape Genomics ◽

The World

Download Full-text

Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation

G3 Genes|Genome|Genetics ◽

10.1534/g3.113.005728 ◽

2013 ◽

Vol 3 (5) ◽

pp. 891-907 ◽

Cited By ~ 65

Author(s):

Trevor J. Pemberton ◽

Michael DeGiorgio ◽

Noah A. Rosenberg

Keyword(s):

Population Structure ◽

Genomic Data ◽

Microsatellite Variation ◽

Data Set

Download Full-text

Joint analysis of remotely sensed soil moisture and water storage variations from satellite gravimetry

10.5194/egusphere-egu21-2188 ◽

2021 ◽

Author(s):

Daniel Blank ◽

Annette Eicker ◽

Laura Jensen ◽

Andreas Güntner

Keyword(s):

Soil Moisture ◽

Water Storage ◽

Microwave Remote Sensing ◽

Time Shift ◽

Joint Analysis ◽

Data Sets ◽

Time Lags ◽

Data Set ◽

Satellite Gravimetry ◽

Storage Change

Information on water storage changes in the soil can be obtained on a global scale from different types of satellite observations. While active or passive microwave remote sensing is limited to investigating the upper few centimeters of the soil, satellite gravimetry is sensitive to variations in the full column of terrestrial water storage (TWS) but cannot distinguish between storage variations occurring in different soil depths. Jointly analyzing both data types promises interesting insights into the underlying hydrological dynamics and may enable a better process understanding of water storage change in the subsurface.In this study, we aim at investigating the global relationship of (1) several satellite soil moisture (SM) products and (2) non-standard daily TWS data from the GRACE and GRACE-FO satellite gravimetry missions on different time scales. We decompose the data sets into different temporal frequencies from seasonal to sub-monthly signals and carry out the comparison with respect to spatial patterns and temporal variability. Level-3 (Surface SM up to 5 cm depth) and Level-4 (Root-Zone SM up to 1 m depth) data sets of the SMOS and SMAP missions as well as the ESA CCI data set are used in this investigation. Since a direct comparison of the absolute values is not possible due to the different integration depths of the two data sets (SM and TWS), we will analyze their relationship using Pearson&#8217;s pairwise correlation coefficient. Furthermore, a time-shift analysis is carried out by means of cross-correlation to identify time lags between SM and TWS data sets that indicate differences in the temporal dynamics of SM storage change in varying depth layers.

Download Full-text