scholarly journals An optimized FM-index library for nucleotide and amino acid search

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Tim Anderson ◽  
Travis J. Wheeler

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

2021 ◽  
Author(s):  
Tim Anderson ◽  
Travis J Wheeler

AbstractPattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. We present AvxWindowedFMindex (AWFM-index), an open-source, thread-parallel FM-index library written in C that is highly optimized for indexing nucleotide and amino acid sequences. AWFM-index is easy to incorporate into bioinformatics software and is able to perform exact match count and locate queries approximately 4x faster than Seqan3’s FM-index implementation for nucleotide search, and approximately 8x faster for amino acid search in a single-threaded context. This performance is due to (i) a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and (ii) inclusion of a cache-efficient lookup table for partial k-mer searches. AWFM-index also trivially parallelizes to multiple threads, and scales well in multithreaded contexts. The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.Author summaryAvxWindowedFMIndex is a fast, open-source library implementation of the FM-index algorithm. This library takes advantage of powerful ‘single-instruction, multiple data’ (SIMD) CPU instructions to quickly perform the most difficult part of the algorithm, counting the number of occurrences of a given letter in a block of text. Algorithms like FM-index are widely used many places in bioinformatics like biosequence database searching, taxonomic classification, and sequencing error correction. Using the AvxWindowedFMIndex library will ease the burden of including the FM index into bioinformatic software, thus enabling faster pattern matching and overall faster software in practice.


1993 ◽  
Vol 295 (2) ◽  
pp. 347-350 ◽  
Author(s):  
B J Nichols ◽  
L Hall ◽  
A C F Perry ◽  
R M Denton

A 600 bp cDNA fragment encoding part of the gamma-subunit of pig heart NAD(+)-isocitrate dehydrogenase (ICDH gamma) was amplified by PCR using redundant oligonucleotide primers based on partial peptide sequence data [Huang and Colman (1990) Biochemistry 29, 8266-8273]. This PCR fragment was then used as a probe to isolate clones encoding the complete mature forms of the gamma-subunit from rat epididymis and monkey testis cDNA libraries. Comparison of the deduced amino acid sequences of the rat and monkey subunits and the partial sequence of the pig heart enzyme revealed a remarkably high level of sequence identity. The relationship between the deduced amino acid sequences of the NAD(+)-ICDH gamma-subunits and those of nonmammalian NAD(+)- and NADP(+)-ICDH subunits is discussed.


1999 ◽  
Vol 43 (4) ◽  
pp. 969-971 ◽  
Author(s):  
Guillaume Arlet ◽  
Sylvie Goussard ◽  
Patrice Courvalin ◽  
Alain Philippon

ABSTRACT The sequences of the bla TEM genes encoding TEM-20, TEM-21, TEM-22, and TEM-29 extended-spectrum β-lactamases were determined. Analysis of the deduced amino acid sequences indicated that TEM-20 and TEM-29 were derived from TEM-1 and that TEM-21 and TEM-22 were derived from TEM-2. The substitutions involved were Ser-238 and Thr-182 for TEM-20; His-164 for TEM-29; Lys-104, Arg-153, and Ser-238 for TEM-21; and Lys-104, Gly-237, and Ser-238 for TEM-22. The promoter region of the bla TEM-22 gene was identical to that of bla TEM-3. High-level production of TEM-20 could result from a 135-bp deletion which combined the −35 region of the Pa promoter with the −10 region of the P3 promoter and a G→T transition in the latter motif.


1992 ◽  
Vol 284 (1) ◽  
pp. 87-93 ◽  
Author(s):  
U Murdiyatmo ◽  
W Asmara ◽  
J S H Tsang ◽  
A J Baines ◽  
A T Bull ◽  
...  

The structural gene (hdl IVa) for the Pseudomonas cepacia MBA4 2-haloacid halidohydrolase IVa (Hdl IVa) was isolated on a 1.6 kb fragment of Ps. cepacia MBA4 chromosomal DNA. The recombinant halidohydrolase was expressed in Escherichia coli and Pseudomonas putida and the structural gene was subcloned on to the tac expression vector pBTac1. High-level expression from the tac promoter was seen to be temperature-dependent, a consequence of the nucleotide sequence adjacent to the fragment encoding the halidohydrolase. The nucleotide sequence of the fragment encoding the Hdl IVa was determined and analysed. Three ATG codons were identified in one of the open reading frames and the one corresponding to the start of the hdl IVa structural gene was determined by comparison of the predicted amino acid sequences with the experimentally determined N-terminal sequences of halidohydrolase IVa. The hdl IVa gene encoded a 231-amino acid-residue protein of M(r) 25,900. The sequence and predicted structural data are discussed and comparison is made with sequence data for other halidohydrolases.


2009 ◽  
Vol 75 (6) ◽  
pp. 1552-1558 ◽  
Author(s):  
Naruhiko Sawa ◽  
Takeshi Zendo ◽  
Junko Kiyofuji ◽  
Koji Fujita ◽  
Kohei Himeno ◽  
...  

ABSTRACT Lactococcus sp. strain QU 12, which was isolated from cheese, produced a novel cyclic bacteriocin termed lactocyclicin Q. By using cation-exchange chromatography, hydrophobic interaction chromatography, and reverse-phase high-performance liquid chromatography, lactocyclicin Q was purified from culture supernatant, and its molecular mass was determined to be 6,062.8 Da by mass spectrometry. Lactocyclicin Q has been characterized by its unique antimicrobial spectrum, high level of protease resistance, and heat stability compared to other reported bacteriocins of lactic acid bacteria. The amino acid sequence of lactocyclicin Q was determined chemically, and this compound is composed of 61 amino acid residues that have a cyclic structure with linkage between the N and C termini by a peptide bond. It showed no homology to any other antimicrobial peptide, including cyclic bacteriocins. On the basis of the amino acid sequences obtained, the sequence of the gene encoding the prepeptide lactocyclicin Q was obtained. This is the first report of a cyclic bacteriocin purified from a strain belonging to the genus Lactococcus.


2017 ◽  
Vol 11 (4) ◽  
pp. 1835-1850 ◽  
Author(s):  
Stefan Muckenhuber ◽  
Stein Sandven

Abstract. An open-source sea ice drift algorithm for Sentinel-1 SAR imagery is introduced based on the combination of feature tracking and pattern matching. Feature tracking produces an initial drift estimate and limits the search area for the consecutive pattern matching, which provides small- to medium-scale drift adjustments and normalised cross-correlation values. The algorithm is designed to combine the two approaches in order to benefit from the respective advantages. The considered feature-tracking method allows for an efficient computation of the drift field and the resulting vectors show a high degree of independence in terms of position, length, direction and rotation. The considered pattern-matching method, on the other hand, allows better control over vector positioning and resolution. The preprocessing of the Sentinel-1 data has been adjusted to retrieve a feature distribution that depends less on SAR backscatter peak values. Applying the algorithm with the recommended parameter setting, sea ice drift retrieval with a vector spacing of 4 km on Sentinel-1 images covering 400 km  ×  400 km, takes about 4 min on a standard 2.7 GHz processor with 8 GB memory. The corresponding recommended patch size for the pattern-matching step that defines the final resolution of each drift vector is 34  ×  34 pixels (2.7  ×  2.7 km). To assess the potential performance after finding suitable search restrictions, calculated drift results from 246 Sentinel-1 image pairs have been compared to buoy GPS data, collected in 2015 between 15 January and 22 April and covering an area from 80.5 to 83.5° N and 12 to 27° E. We found a logarithmic normal distribution of the displacement difference with a median at 352.9 m using HV polarisation and 535.7 m using HH polarisation. All software requirements necessary for applying the presented sea ice drift algorithm are open-source to ensure free implementation and easy distribution.


2000 ◽  
Vol 44 (5) ◽  
pp. 1309-1314 ◽  
Author(s):  
Jeanette W. P. Teo ◽  
Antonius Suwanto ◽  
Chit Laa Poh

ABSTRACT Two ampicillin-resistant (Ampr) isolates ofVibrio harveyi, W3B and HB3, were obtained from the coastal waters of the Indonesian island of Java. Strain W3B was isolated from marine water near a shrimp farm in North Java while HB3 was from pristine seawater in South Java. In this study, novel β-lactamase genes from W3B (bla VHW-1) and HB3 (bla VHH-1) were cloned and their nucleotide sequences were determined. An open reading frame (ORF) of 870 bp encoding a deduced protein of 290 amino acids (VHW-1) was revealed for the bla gene of strain W3B while an ORF of 849 bp encoding a 283-amino-acid protein (VHH-1) was deduced forbla VHH-1. At the DNA level, genes for VHW-1 and VHH-1 have a 97% homology, while at the protein level they have a 91% homology of amino acid sequences. Neither gene sequence showed homology to any other β-lactamases in the databases. The deduced proteins were found to be class A β-lactamases bearing low levels of homology (<50%) to other β-lactamases of the same class. The highest level of identity was obtained with β-lactamases from Pseudomonas aeruginosa, i.e., PSE-1, PSE-4, and CARB-3, and Vibrio cholerae CARB-6. Our study showed that both strains W3B and HB3 possess an endogenous plasmid of approximately 60 kb in size. However, Southern hybridization analysis employingbla VHW-1 as a gene probe demonstrated that thebla gene was not located in the plasmid. A total of nine ampicillin-resistant V. harveyi strains, including W3B and HB3, were examined by pulsed-field gel electrophoresis ofNotI-digested genomic DNA. Despite a high level of intrastrain genetic diversity, thebla VHW-1 probe hybridized only to an 80- or 160-kb NotI genomic fragment in different isolates.


1995 ◽  
Vol 305 (2) ◽  
pp. 651-658 ◽  
Author(s):  
A Vaughan ◽  
M Rodriguez ◽  
J Hemingway

Resistance to organophosphates in Culex mosquitoes is typically associated with increased activity of non-specific esterases. The commonest phenotype involves two elevated esterases, A2 and B2, while some strains have elevation of esterase B1 alone. Overexpression of the two B esterase electromorphs is due to gene amplification. Full-length cDNAs coding for amplified esterase B genes from a resistant Cuban strain (MRES, with amplified B1 esterase) and a Sri Lankan strain (PelRR, with amplified B2 esterase) of C. quinquefasciatus have been sequenced. In addition, a partial-length cDNA coding for a B esterase from an insecticide-susceptible Sri Lankan strain (PelSS) has been sequenced. All the nucleotide sequences and the inferred amino acid sequences show a high level of identify (> 95% at the nucleotide and amino acid level), confirming that they are an allelic series. The two B1 esterase nucleotide sequences (MRES and the previously published TEM-R [Mouches, Pauplin, Agarwal, Lemieux, Herzog, Abadon, Beyssat-Arnaouty, Hyrien, De Saint Vincent, Georghiou and Pasteur (1990) Proc. Natl. Acad. Sci. U.S.A. 87, 2574-2578]) showed the lowest identity, and restriction-fragment-length-polymorphism analysis of the two strains was different. On the basis of these data we suggest that the two electrophoretically identical B1 esterase isoenzymes from California and Cuba have been amplified independently. Alternatively, if amplification has occurred only once, the original amplification has not occurred recently.


1993 ◽  
Vol 69 (04) ◽  
pp. 351-360 ◽  
Author(s):  
Masahiro Murakawa ◽  
Takashi Okamura ◽  
Takumi Kamura ◽  
Tsunefumi Shibuya ◽  
Mine Harada ◽  
...  

SummaryThe partial amino acid sequences of fibrinogen Aα-chains from five mammalian species have been inferred by means of the polymerase chain reaction (PCR). From the genomic DNA of the rhesus monkey, pig, dog, mouse and Syrian hamster, the DNA fragments coding for α-C domains in the Aα-chains were amplified and sequenced. In all species examined, four cysteine residues were always conserved at the homologous positions. The carboxy- and amino-terminal portions of the α-C domains showed a considerable homology among the species. However, the sizes of the middle portions, which corresponded to the internal repeat structures, showed an apparent variability because of several insertions and/or deletions. In the rhesus monkey, pig, mouse and Syrian hamster, 13 amino acid tandem repeats fundamentally similar to those in humans and the rat were identified. In the dog, however, tandem repeats were found to consist of 18 amino acids, suggesting an independent multiplication of the canine repeats. The sites of the α-chain cross-linking acceptor and α2-plasmin inhibitor cross-linking donor were not always evolutionally conserved. The arginyl-glycyl-aspartic acid (RGD) sequence was not found in the amplified region of either the rhesus monkey or the pig. In the canine α-C domain, two RGD sequences were identified at the homologous positions to both rat and human RGD S. In the Syrian hamster, a single RGD sequence was found at the same position to that of the rat. Triplication of the RGD sequences was seen in the murine fibrinogen α-C domain around the homologous site to the rat RGDS sequence. These findings are of some interest from the point of view of structure-function and evolutionary relationships in the mammalian fibrinogen Aα-chains.


Sign in / Sign up

Export Citation Format

Share Document