scholarly journals Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus

Author(s):  
Georg Hahn ◽  
Sanghun Lee ◽  
Scott T. Weiss ◽  
Christoph Lange

AbstractOver 10,000 viral genome sequences of the SARS-CoV-2 virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11, 132 SARS-CoV-2 patients in the GISAID (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7, 640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modelling the mutation rate, applying phylogenetic tree approaches, etc., we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index (Jaccard, 1901; Tan et al., 2005; Prokopenko et al., 2016; Schlauch et al., 2017). Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.

GigaScience ◽  
2021 ◽  
Vol 10 (1) ◽  
Author(s):  
Taras K Oleksyk ◽  
Walter W Wolfsberger ◽  
Alexandra M Weber ◽  
Khrystyna Shchubelka ◽  
Olga T Oleksyk ◽  
...  

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.


Parasitology ◽  
2009 ◽  
Vol 136 (5) ◽  
pp. 469-485 ◽  
Author(s):  
A. S. TAFT ◽  
J. J. VERMEIRE ◽  
J. BERNIER ◽  
S. R. BIRKELAND ◽  
M. J. CIPRIANO ◽  
...  

SUMMARYInfection of the snail,Biomphalaria glabrata, by the free-swimming miracidial stage of the human blood fluke,Schistosoma mansoni, and its subsequent development to the parasitic sporocyst stage is critical to establishment of viable infections and continued human transmission. We performed a genome-wide expression analysis of theS. mansonimiracidia and developing sporocyst using Long Serial Analysis of Gene Expression (LongSAGE). Five cDNA libraries were constructed from miracidia andin vitrocultured 6- and 20-day-old sporocysts maintained in sporocyst medium (SM) or in SM conditioned by previous cultivation with cells of theB. glabrataembryonic (Bge) cell line. We generated 21 440 SAGE tags and mapped 13 381 to theS. mansonigene predictions (v4.0e) either by estimating theoretical 3′ UTR lengths or using existing 3′ EST sequence data. Overall, 432 transcripts were found to be differentially expressed amongst all 5 libraries. In total, 172 tags were differentially expressed between miracidia and 6-day conditioned sporocysts and 152 were differentially expressed between miracidia and 6-day unconditioned sporocysts. In addition, 53 and 45 tags, respectively, were differentially expressed in 6-day and 20-day cultured sporocysts, due to the effects of exposure to Bge cell-conditioned medium.


2019 ◽  
Vol 116 (42) ◽  
pp. 21262-21267 ◽  
Author(s):  
Kenji Yano ◽  
Yoichi Morinaka ◽  
Fanmiao Wang ◽  
Peng Huang ◽  
Sayaka Takehara ◽  
...  

Elucidation of the genetic control of rice architecture is crucial due to the global demand for high crop yields. Rice architecture is a complex trait affected by plant height, tillering, and panicle morphology. In this study, principal component analysis (PCA) on 8 typical traits related to plant architecture revealed that the first principal component (PC), PC1, provided the most information on traits that determine rice architecture. A genome-wide association study (GWAS) using PC1 as a dependent variable was used to isolate a gene encoding rice, SPINDLY (OsSPY), that activates the gibberellin (GA) signal suppression protein SLR1. The effect of GA signaling on the regulation of rice architecture was confirmed in 9 types of isogenic plant having different levels of GA responsiveness. Further population genetics analysis demonstrated that the functional allele of OsSPY associated with semidwarfism and small panicles was selected in the process of rice breeding. In summary, the use of PCA in GWAS will aid in uncovering genes involved in traits with complex characteristics.


Forests ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 779 ◽  
Author(s):  
Paulina Ballesta ◽  
Nicolle Serra ◽  
Fernando Guerra ◽  
Rodrigo Hasbún ◽  
Freddy Mora

The present study was undertaken to examine the ability of different genomic selection (GS) models to predict growth traits (diameter at breast height, tree height and wood volume), stem straightness and branching quality of Eucalyptus globulus Labill. trees using a genome-wide Single Nucleotide Polymorphism (SNP) chip (60 K), in one of the southernmost progeny trials of the species, close to its southern distribution limit in Chile. The GS methods examined were Ridge Regression-BLUP (RRBLUP), Bayes-A, Bayes-B, Bayesian least absolute shrinkage and selection operator (BLASSO), principal component regression (PCR), supervised PCR and a variant of the RRBLUP method that involves the previous selection of predictor variables (RRBLUP-B). RRBLUP-B and supervised PCR models presented the greatest predictive ability (PA), followed by the PCR method, for most of the traits studied. The highest PA was obtained for the branching quality (~0.7). For the growth traits, the maximum values of PA varied from 0.43 to 0.54, while for stem straightness, the maximum value of PA reached 0.62 (supervised PCR). The study population presented a more extended linkage disequilibrium (LD) than other populations of E. globulus previously studied. The genome-wide LD decayed rapidly within 0.76 Mbp (threshold value of r2 = 0.1). The average LD on all chromosomes was r2 = 0.09. In addition, the 0.15% of total pairs of linked SNPs were in a complete LD (r2 = 1), and the 3% had an r2 value >0.5. Genomic prediction, which is based on the reduction in dimensionality and variable selection may be a promising method, considering the early growth of the trees and the low-to-moderate values of heritability found in the traits evaluated. These findings provide new understanding of how develop novel breeding strategies for tree improvement of E. globulus at its southernmost range limit in Chile, which could represent new opportunities for forest planting that can benefit the local economy.


Blood ◽  
2007 ◽  
Vol 110 (9) ◽  
pp. 3326-3333 ◽  
Author(s):  
Gabrielle S. Sellick ◽  
Lynn R. Goldin ◽  
Ruth W. Wild ◽  
Susan L. Slager ◽  
Laura Ressenti ◽  
...  

Abstract Chronic lymphocytic leukemia (CLL) and other B-cell lymphoproliferative disorders display familial aggregation. To identify a susceptibility gene for CLL, we assembled families from the major European (ICLLC) and American (GEC) consortia to conduct a genome-wide linkage analysis of 101 new CLL pedigrees using a high-density single nucleotide polymorphism (SNP) array and combined the results with data from our previously reported analysis of 105 families. Here, we report on the combined analysis of the 206 families. Multipoint linkage analyses were undertaken using both nonparametric (model-free) and parametric (model-based) methods. After the removal of high linkage disequilibrium SNPs, we obtained a maximum nonparametric linkage (NPL) score of 3.02 (P = .001) on chromosome 2q21.2. The same genomic position also yielded the highest multipoint heterogeneity LOD (HLOD) score under a common recessive model of disease susceptibility (HLOD = 3.11; P = 7.7 × 10−5), which was significant at the genome-wide level. In addition, 2 other chromosomal positions, 6p22.1 (corresponding to the major histocompatibility locus) and 18q21.1, displayed HLOD scores higher than 2.1 (P < .002). None of the regions coincided with areas of common chromosomal abnormalities frequently observed in CLL. These findings provide direct evidence for Mendelian predisposition to CLL and evidence for the location of disease loci.


AGROFOR ◽  
2020 ◽  
Vol 5 (2) ◽  
Author(s):  
Barbora OLŠANSKÁ ◽  
Radovan KASARDA ◽  
Kristína LEHOCKÁ ◽  
Nina MORAVČÍKOVÁ

The presented study provides a genome-wide scan of selection signals in cattle by principal component analysis (PCA). The aim was to identify SNP affected by intensive selection based on package PCAdapt implemented under software R. This analysis provided insight into the association between the SNP frequencies related to population differentiation. The four cattle populations were involved in the analysis (Slovak Spotted cattle, Ayrshire, Swiss Simmental and Holstein) with overall 272 of genotyped individuals. After applying quality control, the final dataset consisted of 35 675 SNPs, with an overall length of 2496.14 Mb and average space between adjacent SNP 70.03 ± 76.1 kb. After performing PCA analysis, the uniqueness of the breeds was revealed. On the other hand, a close genetic relationship and eleven SNPs affected by selection were found, with a position close to 162 genes involved in the various biological processes. The majority of genes were involved in the positive regulation of adenylate cyclase activity, embryo development and somatic diversification of immune receptors via somatic mutation. Several candidate genes for genetic control of the immune system (DNAJB9), muscle development (SEPT7, TRIM32, ROCK1, NRAP, PZDZ8, HSPA12A and FGFR2), milk production (SOCS5, CD46), reproduction (LHCGR, EEPD1, FSHR) and coat colour (KIT) were identified. Our results provide insights into the regions of the genome affected by the intensive selection of analysed cattle populations.


2019 ◽  
Author(s):  
Tika B. Adhikari ◽  
Brian J. Knaus ◽  
Niklaus J. Grünwald ◽  
Dennis Halterman ◽  
Frank J. Louws

ABSTRACTGenotyping by sequencing (GBS) is considered a powerful tool to discover single nucleotide polymorphisms (SNPs), which are useful to characterize closely related genomes of plant species and plant pathogens. We applied GBS to determine genome-wide variations in a panel of 187 isolates of three closely related Alternaria spp. that cause diseases on tomato and potato in North Carolina (NC) and Wisconsin (WI). To compare genetic variations, reads were mapped to both A. alternata and A. solani draft reference genomes and detected dramatic differences in SNPs among them. Comparison of A. linariae and A. solani populations by principal component analysis revealed the first (83.8% of variation) and second (8.0% of variation) components contained A. linariae from tomato in NC and A. solani from potato in WI, respectively, providing evidence of population structure. Genetic differentiation (Hedrick’s G’ST) in A. linariae populations from Haywood, Macon, and Madison counties in NC were little or no differentiated (G’ST 0.0 - 0.2). However, A. linariae population from Swain county appeared to be highly differentiated (G’ST > 0.8). To measure the strength of the linkage disequilibrium (LD), we also calculated the allelic association between pairs of loci. Lewontin’s D (measures the fraction of allelic variations) and physical distances provided evidence of linkage throughout the entire genome, consistent with the hypothesis of non-random association of alleles among loci. Our findings provide new insights into the understanding of clonal populations on a genome-wide scale and microevolutionary factors that might play an important role in population structure. Although we found limited genetic diversity, the three Alternaria spp. studied here are genetically distinct and each species is preferentially associated with one host.


2020 ◽  
Vol 7 (11) ◽  
Author(s):  
Guillaume Butler-Laporte ◽  
Devin Kreuzer ◽  
Tomoko Nakanishi ◽  
Adil Harroud ◽  
Vincenzo Forgetta ◽  
...  

Abstract Background Infectious diseases are causally related to a large array of noncommunicable diseases (NCDs). Identifying genetic determinants of infections and antibody-mediated immune responses may shed light on this relationship and provide therapeutic targets for drug and vaccine development. Methods We used the UK biobank cohort of up to 10 000 serological measurements of infectious diseases and genome-wide genotyping. We used data on 13 pathogens to define 46 phenotypes: 15 seropositivity case–control phenotypes and 31 quantitative antibody measurement phenotypes. For each of these, we performed genome-wide association studies (GWAS) using the fastGWA linear mixed model package and human leukocyte antigen (HLA) classical allele and amino acid residue associations analyses using Lasso regression for variable selection. Results We included a total of 8735 individuals for case–control phenotypes, and an average (range) of 4286 (276–8555) samples per quantitative analysis. Fourteen of the GWAS yielded a genome-wide significant (P &lt; 5 ×10-8) locus at the major histocompatibility complex (MHC) on chromosome 6. Outside the MHC, we found a total of 60 loci, multiple associated with Epstein-Barr virus (EBV)–related NCDs (eg, RASA3, MED12L, and IRF4). FUT2 was also identified as an important gene for polyomaviridae. HLA analysis highlighted the importance of DRB1*09:01, DQB1*02:01, DQA1*01:02, and DQA1*03:01 in EBV serologies and of DRB1*15:01 in polyomaviridae. Conclusions We have identified multiple genetic variants associated with antibody immune response to 13 infections, many of which are biologically plausible therapeutic or vaccine targets. This may help prioritize future research and drug development.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Georgina Samaha ◽  
Claire M. Wade ◽  
Julia Beatty ◽  
Leslie A. Lyons ◽  
Linda M. Fleeman ◽  
...  

Abstract Diabetes mellitus, a common endocrinopathy affecting domestic cats, shares many clinical and pathologic features with type 2 diabetes in humans. In Australia and Europe, diabetes mellitus is almost four times more common among Burmese cats than in other breeds. As a genetically isolated population, the diabetic Australian Burmese cat provides a spontaneous genetic model for studying diabetes mellitus in humans. Studying complex diseases in pedigreed breeds facilitates tighter control of confounding factors including population stratification, allelic frequencies and environmental heterogeneity. We used the feline SNV array and whole genome sequence data to undertake a genome wide-association study and runs of homozygosity analysis, of a case–control cohort of Australian and European Burmese cats. Our results identified diabetes-associated haplotypes across chromosomes A3, B1 and E1 and selective sweeps across the Burmese breed on chromosomes B1, B3, D1 and D4. The locus on chromosome B1, common to both analyses, revealed coding and splice region variants in candidate genes, ANK1, EPHX2 and LOX2, implicated in diabetes mellitus and lipid dysregulation. Mapping this condition in Burmese cats has revealed a polygenic spectrum, implicating loci linked to pancreatic beta cell dysfunction, lipid dysregulation and insulin resistance in the pathogenesis of diabetes mellitus in the Burmese cat.


Sign in / Sign up

Export Citation Format

Share Document