scholarly journals An optimized FM-index library for nucleotide and amino acid search

2021 ◽  
Author(s):  
Tim Anderson ◽  
Travis J Wheeler

AbstractPattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. We present AvxWindowedFMindex (AWFM-index), an open-source, thread-parallel FM-index library written in C that is highly optimized for indexing nucleotide and amino acid sequences. AWFM-index is easy to incorporate into bioinformatics software and is able to perform exact match count and locate queries approximately 4x faster than Seqan3’s FM-index implementation for nucleotide search, and approximately 8x faster for amino acid search in a single-threaded context. This performance is due to (i) a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and (ii) inclusion of a cache-efficient lookup table for partial k-mer searches. AWFM-index also trivially parallelizes to multiple threads, and scales well in multithreaded contexts. The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.Author summaryAvxWindowedFMIndex is a fast, open-source library implementation of the FM-index algorithm. This library takes advantage of powerful ‘single-instruction, multiple data’ (SIMD) CPU instructions to quickly perform the most difficult part of the algorithm, counting the number of occurrences of a given letter in a block of text. Algorithms like FM-index are widely used many places in bioinformatics like biosequence database searching, taxonomic classification, and sequencing error correction. Using the AvxWindowedFMIndex library will ease the burden of including the FM index into bioinformatic software, thus enabling faster pattern matching and overall faster software in practice.

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Tim Anderson ◽  
Travis J. Wheeler

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.


2017 ◽  
Vol 11 (4) ◽  
pp. 1835-1850 ◽  
Author(s):  
Stefan Muckenhuber ◽  
Stein Sandven

Abstract. An open-source sea ice drift algorithm for Sentinel-1 SAR imagery is introduced based on the combination of feature tracking and pattern matching. Feature tracking produces an initial drift estimate and limits the search area for the consecutive pattern matching, which provides small- to medium-scale drift adjustments and normalised cross-correlation values. The algorithm is designed to combine the two approaches in order to benefit from the respective advantages. The considered feature-tracking method allows for an efficient computation of the drift field and the resulting vectors show a high degree of independence in terms of position, length, direction and rotation. The considered pattern-matching method, on the other hand, allows better control over vector positioning and resolution. The preprocessing of the Sentinel-1 data has been adjusted to retrieve a feature distribution that depends less on SAR backscatter peak values. Applying the algorithm with the recommended parameter setting, sea ice drift retrieval with a vector spacing of 4 km on Sentinel-1 images covering 400 km  ×  400 km, takes about 4 min on a standard 2.7 GHz processor with 8 GB memory. The corresponding recommended patch size for the pattern-matching step that defines the final resolution of each drift vector is 34  ×  34 pixels (2.7  ×  2.7 km). To assess the potential performance after finding suitable search restrictions, calculated drift results from 246 Sentinel-1 image pairs have been compared to buoy GPS data, collected in 2015 between 15 January and 22 April and covering an area from 80.5 to 83.5° N and 12 to 27° E. We found a logarithmic normal distribution of the displacement difference with a median at 352.9 m using HV polarisation and 535.7 m using HH polarisation. All software requirements necessary for applying the presented sea ice drift algorithm are open-source to ensure free implementation and easy distribution.


2005 ◽  
Vol 03 (03) ◽  
pp. 697-716 ◽  
Author(s):  
YONGHUA HAN ◽  
BIN MA ◽  
KAIZHONG ZHANG

For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.


1993 ◽  
Vol 69 (04) ◽  
pp. 351-360 ◽  
Author(s):  
Masahiro Murakawa ◽  
Takashi Okamura ◽  
Takumi Kamura ◽  
Tsunefumi Shibuya ◽  
Mine Harada ◽  
...  

SummaryThe partial amino acid sequences of fibrinogen Aα-chains from five mammalian species have been inferred by means of the polymerase chain reaction (PCR). From the genomic DNA of the rhesus monkey, pig, dog, mouse and Syrian hamster, the DNA fragments coding for α-C domains in the Aα-chains were amplified and sequenced. In all species examined, four cysteine residues were always conserved at the homologous positions. The carboxy- and amino-terminal portions of the α-C domains showed a considerable homology among the species. However, the sizes of the middle portions, which corresponded to the internal repeat structures, showed an apparent variability because of several insertions and/or deletions. In the rhesus monkey, pig, mouse and Syrian hamster, 13 amino acid tandem repeats fundamentally similar to those in humans and the rat were identified. In the dog, however, tandem repeats were found to consist of 18 amino acids, suggesting an independent multiplication of the canine repeats. The sites of the α-chain cross-linking acceptor and α2-plasmin inhibitor cross-linking donor were not always evolutionally conserved. The arginyl-glycyl-aspartic acid (RGD) sequence was not found in the amplified region of either the rhesus monkey or the pig. In the canine α-C domain, two RGD sequences were identified at the homologous positions to both rat and human RGD S. In the Syrian hamster, a single RGD sequence was found at the same position to that of the rat. Triplication of the RGD sequences was seen in the murine fibrinogen α-C domain around the homologous site to the rat RGDS sequence. These findings are of some interest from the point of view of structure-function and evolutionary relationships in the mammalian fibrinogen Aα-chains.


1979 ◽  
Author(s):  
Takashi Morita ◽  
Craig Jackson

Bovine Factor X is eluted in two forms (X1and X2) from anion exchange chromatographic columns. These two forms have indistinguishable amino acid compositions, molecular weights and specific activities. The amino acid sequences containing the γ-carboxyglutamic acid residues have been shown to be identical in X1 and X2(H. Morris, personal communication). An activation peptide is released from the N-terminal region of the heavy chain of Factor X by an activator from Russell’s viper venom. This peptide can be isolated after activation by gel filtration on Sephadex G-100 under nondenaturing conditions. The activation peptides from a mixture of Factors X1 and X2 were separated into two forms by anion-exchange chromatography. The activation peptide (AP1) which eluted first was shown to be derived from Factor X1. while the activation peptiae (AP2) which eluted second was shown to be derived from X2 on the basis of chromatographic separations carried out on Factors X1 and X2 separately. Factor Xa was eluted as a symmetrical single peak. On the basis of these and other data characterizing these products, we conclude that the difference between X1 and X2 are properties of the structures of the activation peptides. (Supported by a grant HL 12820 from the National Heart, Lung and Blood Institute. C.M.J. is an Established Investigator of the American Heart Association).


2020 ◽  
Vol 44 (3) ◽  
pp. 177-189
Author(s):  
Momir Dunjic ◽  
Stefano Turini ◽  
Dejan Krstic ◽  
Katarina Dunjic ◽  
Marija Dunjic ◽  
...  

Radiofrequency therapy is an unconventional method, already applied for some time, with numerous results in numerous clinical pictures. Our group has developed a software, later called SONGENPROT-SOLARIS, capable of directly converting nucleotide sequences (DNA and/or RNA) and amino acid sequences (polypeptides and proteins) into musical sequences, based on mathematic matrices, designed by the French physicist and musician Joel Sternheimer, which allows to associate a musical note with a nucleotide or an amino acid. Innovation in our software is that, in the algorithm that defines it, a variant is directly implemented that allows the reproduction of sounds, phase-shifted by 30 Hz, between one ear and another reproducing the phenomenon of Binaural Tones, capable of induce a specific brain activity and also the release of particles called solitons. Thanks to this software we have developed a technique called MMT (Molecular Music Therapy) and currently, we are in the phase of applying the technique on a cohort of 91 patients, with a high spectrum of clinical pictures, examining the same, using the technique Bi-Digital-ORing-Test (BDORT), before and after treatment with MMT. Aim of project is to stimulate the expression of a specific gene (the same genetic sequence that the patient listens to, translated into music), only through the use of sound sequences. We have concentrated our attention on three main molecules: Sirtuin-1, Telomers and TP-53. The results obtained with BDORT, after treatment with MMT, showed a significant increase in the values of the three molecules, on all the examined patients, demonstrating the operative efficacy of the technique and the its applicability to numerous diseases. In order to confirm the data obtained by BDORT, we propose, with the help of an accredited laboratory, to perform epigenetic tests on the three parameters listed above, paving the way to understanding how frequencies can influence gene expression.


2019 ◽  
Vol 26 (7) ◽  
pp. 542-549 ◽  
Author(s):  
Shan Shan Hao ◽  
Man Man Zong ◽  
Ze Zhang ◽  
Jia Xi Cai ◽  
Yang Zheng ◽  
...  

Background: Bursa of Fabricius is the acknowledged central humoral immune organ. The bursal-derived peptides play the important roles on the immature B cell development and antibody production. Objective: Here we explored the functions of the new isolated bursal hexapeptide and pentapeptide on the humoral, cellular immune response and antigen presentation to Avian Influenza Virus (AIV) vaccine in mice immunization. Methods: The bursa extract samples were purified following RP HPLC method, and were analyzed with MS/MS to identify the amino acid sequences. Mice were twice subcutaneously injected with AIV inactivated vaccine plus with two new isolated bursal peptides at three dosages, respectively. On two weeks after the second immunization, sera samples were collected from the immunized mice to measure AIV-specific IgG antibody levels and HI antibody titers. Also, on 7th day after the second immunization, lymphocytes were isolated from the immunized mice to detect T cell subtype and lymphocyte viabilities, and the expressions of co-stimulatory molecule on dendritic cells in the immunized mice. Results: Two new bursal hexapeptide and pentapeptide with amino acid sequences KGNRVY and MPPTH were isolated, respectively. Our investigation proved the strong regulatory roles of bursal hexapeptide on AIV-specific IgG levels and HI antibody titers, and lymphocyte viabilities, and the significant increased T cells subpopulation and expressions of MHCII molecule on dendritic cells in the immunized mice. Moreover, our findings verified the significantly enhanced AIV-specific IgG antibody and HI titers, and the strong increased T cell subpopulation and expressions of CD40 molecule on dendritic cells in the mice immunized with AIV vaccine and bursal pentapeptide. Conclusion: We isolated and identified two new hexapeptide and pentapeptide from bursa, and proved that these two bursal peptides effectively induced the AIV-specific antibody, T cell and antigen presentation immune responses, which provided an experimental basis for the further clinical application of the bursal derived active peptide on the vaccine improvement.


2019 ◽  
Vol 20 (4) ◽  
pp. 309-316 ◽  
Author(s):  
Pritam Chattopadhyay ◽  
Goutam Banerjee

Background: Several strains of Klebsiella pneumoniae are responsible for causing pneumonia in lung and thereby causing death in immune-suppressed patients. In recent year, few investigations have reported the enhancement of K. pneumoniae population in patients using corticosteroid containing inhaler. Objectives: The biological mechanism(s) behind this increased incidence has not been elucidated. Therefore, the objective of this investigating was to explore the relation between Klebsiella pneumoniae and increment in carbapenamase producing Enterobacteriaceae score (ICS). Methods: The available genomes of K. pneumoniae and the amino acid sequences of steroid catabolism pathway enzymes were taken from NCBI database and KEGG pathway tagged with UniPort database, respectively. We have used different BLAST algorithms (tBLASTn, BLASTp, psiBLAST, and delBLAST) to identify enzymes (by their amino acid sequence) involved in steroid catabolism. Results: A total of 13 enzymes (taken from different bacterial candidates) responsible for corticosteroid degradation have been identified in the genome of K. pneumoniae. Finally, 8 enzymes (K. pneumoniae specific) were detected in four clinical strains of K. pneumoniae. This investigation intimates that this ability to catabolize corticosteroids could potentially be one mechanism behind the increased pneumonia incidence. Conclusion: The presence of corticosteroid catabolism enzymes in K. pneumoniae enhances the ability to utilize corticosteroid for their own nutrition source. This is the first report to demonstrate the corticosteroid degradation pathway in clinical strains of K. pneumoniae.


2020 ◽  
Vol 17 (1) ◽  
pp. 59-77
Author(s):  
Anand Kumar Nelapati ◽  
JagadeeshBabu PonnanEttiyappan

Background:Hyperuricemia and gout are the conditions, which is a response of accumulation of uric acid in the blood and urine. Uric acid is the product of purine metabolic pathway in humans. Uricase is a therapeutic enzyme that can enzymatically reduces the concentration of uric acid in serum and urine into more a soluble allantoin. Uricases are widely available in several sources like bacteria, fungi, yeast, plants and animals.Objective:The present study is aimed at elucidating the structure and physiochemical properties of uricase by insilico analysis.Methods:A total number of sixty amino acid sequences of uricase belongs to different sources were obtained from NCBI and different analysis like Multiple Sequence Alignment (MSA), homology search, phylogenetic relation, motif search, domain architecture and physiochemical properties including pI, EC, Ai, Ii, and were performed.Results:Multiple sequence alignment of all the selected protein sequences has exhibited distinct difference between bacterial, fungal, plant and animal sources based on the position-specific existence of conserved amino acid residues. The maximum homology of all the selected protein sequences is between 51-388. In singular category, homology is between 16-337 for bacterial uricase, 14-339 for fungal uricase, 12-317 for plants uricase, and 37-361 for animals uricase. The phylogenetic tree constructed based on the amino acid sequences disclosed clusters indicating that uricase is from different source. The physiochemical features revealed that the uricase amino acid residues are in between 300- 338 with a molecular weight as 33-39kDa and theoretical pI ranging from 4.95-8.88. The amino acid composition results showed that valine amino acid has a high average frequency of 8.79 percentage compared to different amino acids in all analyzed species.Conclusion:In the area of bioinformatics field, this work might be informative and a stepping-stone to other researchers to get an idea about the physicochemical features, evolutionary history and structural motifs of uricase that can be widely used in biotechnological and pharmaceutical industries. Therefore, the proposed in silico analysis can be considered for protein engineering work, as well as for gout therapy.


Sign in / Sign up

Export Citation Format

Share Document