genome alignment
Recently Published Documents





2021 ◽  
Vol 119 (1) ◽  
pp. e2113075119
Baoxing Song ◽  
Santiago Marco-Sola ◽  
Miquel Moreto ◽  
Lynn Johnson ◽  
Edward S. Buckler ◽  

Millions of species are currently being sequenced, and their genomes are being compared. Many of them have more complex genomes than model systems and raise novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication. Here, we introduce Anchored Wavefront alignment (AnchorWave), which performs whole-genome duplication–informed collinear anchor identification between genomes and performs base pair–resolved global alignment for collinear blocks using a two-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multikilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs. By contrast, other genome alignment tools showed low power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome as position matches or indels than the closest competitive approach when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor–binding sites at a rate of 1.05- to 74.85-fold higher than other tools with significantly lower false-positive alignments. AnchorWave complements available genome alignment tools by showing obvious improvement when applied to genomes with dispersed repeats, active TEs, high sequence diversity, and whole-genome duplication variation.

2021 ◽  
Vol 3 (12) ◽  
Paul Benedic U. Salvador ◽  
Leslie Michelle M. Dalmacio ◽  
Sang Hoon Kim ◽  
Dae-Kyung Kang ◽  
Marilen P. Balolong

Probiotic strains from different origins have shown promise in recent decades for their health benefits, for example in promoting and regulating the immune system. The immunomodulatory potential of four Lactobacillus strains from animal and plant origins was evaluated in this paper based on their genomic information. Comparative genomic analysis was performed through genome alignment, average nucleotide identity (ANI) analysis and gene mining for putative immunomodulatory genes. The genomes of the four Lactobacillus strains show relative similarities in multiple regions, as observed in the genome alignment. However, ANI analysis showed that L. mucosae LM1 and L. fermentum SK152 are the most similar when considering their nucleotide sequences alone. Gene mining of putative immunomodulatory genes studied from L. plantarum WCFS1 yielded multiple results in the four potential probiotic strains, with L. plantarum SK151 showing the largest number of genes at around 74 hits, followed by L. johnsonii PF01 at 41 genes when adjusted for matches with at least 30 % identity. Looking at the immunomodulatory genes in each strain, L. plantarum SK151 and L. johnsonii PF01 may have wider activity, covering both immune activation and immune suppression, as compared to L. mucosae LM1 and L. fermentum SK152, which could be more effective in activating immune cells and the pro-inflammatory cascade rather than suppressing it. The similarities and differences between the four Lactobacillus species showed that there is no definitive trend based on the origin of isolation alone. Moreover, higher percentage identities between genomes do not directly correlate with higher similarities in potential activity, such as in immunomodulation. The immunomodulatory function of each of the four Lactobacillus strains should be observed and verified experimentally in the future, since some the activity of some genes may be strain-specific, which would not be identified through comparative genomics alone.

2021 ◽  
Baoxing Song ◽  
Santiago Marco-Sola ◽  
Miquel Moreto ◽  
Lynn Johnson ◽  
Edward S. Buckler ◽  

Millions of species are currently being sequenced and their genomes are being compared. Many of them have more complex genomes than model systems and raised novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication or polyploidy levels. Here we introduce AnchorWave, which performs whole-genome duplication informed collinear anchor identification between genomes and performs base-pair resolution global alignments for collinear blocks using the wavefront algorithm and a 2-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multi-kilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs between two maize lines. By contrast, other genome alignment tools showed almost zero power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome than the closest competitive approach, when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor binding sites (TFBSs) at a rate of 1.05-74.85 fold higher than other tools, while with significantly lower false positive alignments. AnchorWave shows obvious improvement when applied to genomes with dispersed repeats, active transposable elements, high sequence diversity and whole-genome duplication variation.

2021 ◽  
Vol 21 (1) ◽  
Zhao Chen ◽  
Jing Li ◽  
Ning Hou ◽  
Yanling Zhang ◽  
Yanjiang Qiao

AbstractThe traditional Chinese medicine (TCM) genome project aims to reveal the genetic information and regulatory network of herbal medicines, and to clarify their molecular mechanisms in the prevention and treatment of human diseases. Moreover, the TCM genome could provide the basis for the discovery of the functional genes of active ingredients in TCM, and for the breeding and improvement of TCM. The traditional Chinese Medicine Basic Local Alignment Search Tool (TCM-Blast) is a web interface for TCM protein and DNA sequence similarity searches. It contains approximately 40G of genome data on TCMs, including protein and DNA sequence for 36 TCMs with high medical value.The development of a publicly accessible TCM genome alignment database hosted on the TCM-Blast website ( has expanded to query multiple sequence databases to obtain TCM genome data, and provide user-friendly output for easy analysis and browsing of BLAST results. The genome sequencing of TCMs helps to elucidate the biosynthetic pathways of important secondary metabolites and provides an essential resource for gene discovery studies and molecular breeding. The TCMs genome provides a valuable resource for the investigation of novel bioactive compounds and drugs from these TCMs under the guidance of TCM clinical practice. Our database could be expanded to other TCMs after the determination of their genome data.

Roman Kotłowski ◽  
Alicja Nowak-Zaleska ◽  
Grzegorz Węgrzyn

AbstractAn optimized method for bacterial strain differentiation, based on combination of Repeated Sequences and Whole Genome Alignment Differential Analysis (RS&WGADA), is presented in this report. In this analysis, 51 Acinetobacter baumannii multidrug-resistance strains from one hospital environment and patients from 14 hospital wards were classified on the basis of polymorphisms of repeated sequences located in CRISPR region, variation in the gene encoding the EmrA-homologue of E. coli, and antibiotic resistance patterns, in combination with three newly identified polymorphic regions in the genomes of A. baumannii clinical isolates. Differential analysis of two similarity matrices between different genotypes and resistance patterns allowed to distinguish three significant correlations (p < 0.05) between 172 bp DNA insertion combined with resistance to chloramphenicol and gentamycin. Interestingly, 45 and 55 bp DNA insertions within the CRISPR region were identified, and combined during analyses with resistance/susceptibility to trimethoprim/sulfamethoxazole. Moreover, 184 or 1374 bp DNA length polymorphisms in the genomic region located upstream of the GTP cyclohydrolase I gene, associated mainly with imipenem susceptibility, was identified. In addition, considerable nucleotide polymorphism of the gene encoding the gamma/tau subunit of DNA polymerase III, an enzyme crucial for bacterial DNA replication, was discovered. The differentiation analysis performed using the above described approach allowed us to monitor the distribution of A. baumannii isolates in different wards of the hospital in the time frame of several years, indicating that the optimized method may be useful in hospital epidemiological studies, particularly in identification of the source of primary infections.

2021 ◽  
Yaoyao Wu ◽  
Lynn Johnson ◽  
Baoxing Song ◽  
Cinta Romay ◽  
Michelle Stitzer ◽  

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow ( based on the LAST aligner to allow practical and sensitive multiple alignment of diverged plant genomes with minimal user inputs. Our workflow only requires a set of genomes in FASTA format as input. The workflow outputs multiple alignments in MAF format, and includes utilities to help calculate genome-wide conservation scores. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the impact of different masking approaches and alignment parameters using genome assemblies of 33 grass species. Compared to conventional masking with RepeatMasker, a k-mer masking approach increased the alignment rate of CDS and non-coding functional regions by 25% and 14% respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for non-coding functional regions by over 52% compared to default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of non-coding sites that can be scored for conservation by over 76%.

2021 ◽  
Bruno Contreras-Moreira ◽  
Carla V Filippi ◽  
Guy Naamati ◽  
Carlos García Girón ◽  
James E Allen ◽  

Ii.Summary/AbstractThe annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis or pangenome exploration. While homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here we benchmark a two-step approach, where repeats are first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, using the kmer-based Repeat Detector (Red) and two repeat libraries (REdat and nrTEplants, curated for this work). We obtained repeated genome fractions that match those reported in the literature, but with shorter repeated elements than those produced with conventional annotators. Inspection of masked regions overlapping genes revealed no preference for specific protein domains. Half of Red masked sequences can be successfully classified with nrTEplants, with the complete protocol taking less than 2h on a desktop Linux box. The repeat library and the scripts to mask and annotate plant genomes can be obtained at

eLife ◽  
2021 ◽  
Vol 10 ◽  
Thomas Sakoparnig ◽  
Chris Field ◽  
Erik van Nimwegen

Although recombination is accepted to be common in bacteria, for many species robust phylogenies with well-resolved branches can be reconstructed from whole genome alignments of strains, and these are generally interpreted to reflect clonal relationships. Using new methods based on the statistics of single-nucleotide polymorphism (SNP) splits, we show that this interpretation is incorrect. For many species, each locus has recombined many times along its line of descent, and instead of many loci supporting a common phylogeny, the phylogeny changes many thousands of times along the genome alignment. Analysis of the patterns of allele sharing among strains shows that bacterial populations cannot be approximated as either clonal or freely recombining, but are structured such that recombination rates between lineages vary over several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect distributions of recombination rates.

Sign in / Sign up

Export Citation Format

Share Document