Pattern matching for high level languages

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

Download Full-text

The view from the left

Journal of Functional Programming ◽

10.1017/s0956796803004829 ◽

2004 ◽

Vol 14 (1) ◽

pp. 69-111 ◽

Cited By ~ 128

Author(s):

CONOR MCBRIDE ◽

JAMES MCKINNA

Keyword(s):

Pattern Matching ◽

Type Theory ◽

Dependent Type Theory ◽

Induction Principle ◽

The Rich ◽

The Right ◽

High Level ◽

General User ◽

Dependent Type ◽

Do So

Pattern matching has proved an extremely powerful and durable notion in functional programming. This paper contributes a new programming notation for type theory which elaborates the notion in various ways. First, as is by now quite well-known in the type theory community, definition by pattern matching becomes a more discriminating tool in the presence of dependent types, since it refines the explanation of types as well as values. This becomes all the more true in the presence of the rich class of datatypes known as inductive families (Dybjer, 1991). Secondly, as proposed by Peyton Jones (1997) for Haskell, and independently rediscovered by us, subsidiary case analyses on the results of intermediate computations, which commonly take place on the right-hand side of definitions by pattern matching, should rather be handled on the left. In simply-typed languages, this subsumes the trivial case of Boolean guards; in our setting it becomes yet more powerful. Thirdly, elementary pattern matching decompositions have a well-defined interface given by a dependent type; they correspond to the statement of an induction principle for the datatype. More general, user-definable decompositions may be defined which also have types of the same general form. Elementary pattern matching may therefore be recast in abstract form, with a semantics given by translation. Such abstract decompositions of data generalize Wadler's (1987) notion of ‘view’. The programmer wishing to introduce a new view of a type $\mathit{T}$, and exploit it directly in pattern matching, may do so via a standard programming idiom. The type theorist, looking through the Curry–Howard lens, may see this as proving a theorem, one which establishes the validity of a new induction principle for $\mathit{T}$. We develop enough syntax and semantics to account for this high-level style of programming in dependent type theory. We close with the development of a typechecker for the simply-typed lambda calculus, which furnishes a view of raw terms as either being well-typed, or containing an error. The implementation of this view is ipso facto a proof that typechecking is decidable.

Download Full-text

Verification of Program Transformations with Inductive Refinement Types

ACM Transactions on Software Engineering and Methodology ◽

10.1145/3409805 ◽

2021 ◽

Vol 30 (1) ◽

pp. 1-33

Author(s):

Ahmad Salim Al-Sibahi ◽

Thomas P. Jensen ◽

Aleksandar S. Dimovski ◽

Andrzej Wąsowski

Keyword(s):

Pattern Matching ◽

Abstract Interpretation ◽

Operational Semantics ◽

Type Inference ◽

Program Transformations ◽

Inductive Type ◽

Refinement Types ◽

Code Generators ◽

High Level ◽

Abstract Syntax Trees

High-level transformation languages like Rascal include expressive features for manipulating large abstract syntax trees: first-class traversals, expressive pattern matching, backtracking, and generalized iterators. We present the design and implementation of an abstract interpretation tool, Rabit, for verifying inductive type and shape properties for transformations written in such languages. We describe how to perform abstract interpretation based on operational semantics, specifically focusing on the challenges arising when analyzing the expressive traversals and pattern matching. Finally, we evaluate Rabit on a series of transformations (normalization, desugaring, refactoring, code generators, type inference, etc.) showing that we can effectively verify stated properties.

Download Full-text

ASCII based Sequential Multiple Pattern Matching Algorithm for High Level Cloning

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2017.080635 ◽

2017 ◽

Vol 8 (6) ◽

Author(s):

Manu Singh ◽

Vidushi Sharma

Keyword(s):

Pattern Matching ◽

Matching Algorithm ◽

Pattern Matching Algorithm ◽

High Level ◽

Multiple Pattern Matching

Download Full-text

Pattern Matching and Discourse Processing in Information Extraction from Japanese Text

Journal of Artificial Intelligence Research ◽

10.1613/jair.53 ◽

1994 ◽

Vol 2 ◽

pp. 89-110 ◽

Cited By ~ 2

Author(s):

T. Kitani ◽

Y. Eriguchi ◽

M. Hara

Keyword(s):

Information Extraction ◽

Pattern Matching ◽

Human Performance ◽

Discourse Processing ◽

Second Step ◽

Sentence Level ◽

Key Word ◽

Word Search ◽

High Level ◽

Information Extraction System

Information extraction is the task of automaticallypicking up information of interest from an unconstrained text. Informationof interest is usually extracted in two steps. First, sentence level processing locates relevant pieces of information scatteredthroughout the text; second, discourse processing merges coreferential information to generate the output. In the first step, pieces of information are locally identified without recognizing any relationships among them. A key word search or simple patternsearch can achieve this purpose. The second step requires deeperknowledge in order to understand relationships among separately identified pieces of information. Previous information extraction systems focused on the first step, partly because they were not required to link up each piece of information with other pieces. To link the extracted pieces of information and map them onto a structuredoutput format, complex discourse processing is essential. This paperreports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor. Evaluationresults show a high level of system performance which approaches human performance.

Download Full-text

Automating the generation of lexical patterns for processing free text in clinical documents

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv012 ◽

2015 ◽

Vol 22 (5) ◽

pp. 980-986 ◽

Cited By ~ 2

Author(s):

Frank Meng ◽

Craig Morioka

Keyword(s):

Pattern Matching ◽

Language Processing ◽

Free Text ◽

Syntactic Parsing ◽

Language Usage ◽

Multiple Sequence ◽

Manual Intervention ◽

Data Elements ◽

High Level ◽

Matching Techniques

Abstract Objective Many tasks in natural language processing utilize lexical pattern-matching techniques, including information extraction (IE), negation identification, and syntactic parsing. However, it is generally difficult to derive patterns that achieve acceptable levels of recall while also remaining highly precise. Materials and Methods We present a multiple sequence alignment (MSA)-based technique that automatically generates patterns, thereby leveraging language usage to determine the context of words that influence a given target. MSAs capture the commonalities among word sequences and are able to reveal areas of linguistic stability and variation. In this way, MSAs provide a systemic approach to generating lexical patterns that are generalizable, which will both increase recall levels and maintain high levels of precision. Results The MSA-generated patterns exhibited consistent F1-, F.5-, and F2- scores compared to two baseline techniques for IE across four different tasks. Both baseline techniques performed well for some tasks and less well for others, but MSA was found to consistently perform at a high level for all four tasks. Discussion The performance of MSA on the four extraction tasks indicates the method’s versatility. The results show that the MSA-based patterns are able to handle the extraction of individual data elements as well as relations between two concepts without the need for large amounts of manual intervention. Conclusion We presented an MSA-based framework for generating lexical patterns that showed consistently high levels of both performance and recall over four different extraction tasks when compared to baseline methods.

Download Full-text

Apakah Kecukupan Pengungkapan telah Dipertimbangkan oleh Badan Pemeriksa Keuangan dalam Perumusan Opini?

Jurnal Akuntansi dan Akuntabilitas Publik ◽

10.22146/jaap.35329 ◽

2018 ◽

Vol 1 (1) ◽

pp. 17

Author(s):

Muhammad Iqbal ◽

Gudono Gudono ◽

Irwan Taufiq Ritonga

Keyword(s):

Pattern Matching ◽

Testing Procedure ◽

The Other ◽

Financial Statement ◽

Mandatory Disclosure ◽

Political Pressure ◽

Matching Technique ◽

High Level ◽

Main Factor

The objective of this study is to assess the disclosure level of Local Government Financial Statement (LGFS) of 2013 and 2014 that achieved unqualified opinion. This study also aims to identify the factors that caused ignorance of disclosure adequacy criteria by SAI auditor in formulating their opinion. A mandatory disclosure scoring technique was applied based on the criteria of the newest Government Compliance Index (GCI) to assess the LGFS disclosure level. This study also implemented pattern matching technique and plausible rival explanation strategy to identify the main factor that caused the case study problem. The result of scoring proves that the average of LGFS mandatory disclosure level are still low, i.e. 53.79% and 56.14% for 2013 and 2014. The pattern matching analysis reveals that SAI auditors have ignored LGFS disclosure inadequacy and decided not to modify their opinion. Meanwhile, the result of examining plausible rival explanation shows the other factors which contributed to cause problem studied, consist of insufficient implementation of disclosure testing procedure, the tolerance of high level auditors on the finding of LGFS disclosure inadequacy, and the existence of external political pressure.

Download Full-text

Transcription factor/DNA interactions visualized by electron spectroscopic imaging

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100122654 ◽

1992 ◽

Vol 50 (1) ◽

pp. 450-451

Author(s):

David P. Bazett-Jones ◽

Mark L. Brown

Keyword(s):

Transcription Factor ◽

Dna Sequences ◽

Transcription Initiation ◽

Molecular Mechanisms ◽

Phosphorus Content ◽

Spectroscopic Imaging ◽

Electron Spectroscopic Imaging ◽

Dna Backbone ◽

Two Parameters ◽

High Level

A multisubunit RNA polymerase enzyme is ultimately responsible for transcription initiation and elongation of RNA, but recognition of the proper start site by the enzyme is regulated by general, temporal and gene-specific trans-factors interacting at promoter and enhancer DNA sequences. To understand the molecular mechanisms which precisely regulate the transcription initiation event, it is crucial to elucidate the structure of the transcription factor/DNA complexes involved. Electron spectroscopic imaging (ESI) provides the opportunity to visualize individual DNA molecules. Enhancement of DNA contrast with ESI is accomplished by imaging with electrons that have interacted with inner shell electrons of phosphorus in the DNA backbone. Phosphorus detection at this intermediately high level of resolution (≈lnm) permits selective imaging of the DNA, to determine whether the protein factors compact, bend or wrap the DNA. Simultaneously, mass analysis and phosphorus content can be measured quantitatively, using adjacent DNA or tobacco mosaic virus (TMV) as mass and phosphorus standards. These two parameters provide stoichiometric information relating the ratios of protein:DNA content.

Download Full-text

Advances in Stem Instrumentation for Biology

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100180562 ◽

1990 ◽

Vol 48 (1) ◽

pp. 362-363

Author(s):

J. S. Wall

Keyword(s):

Scanning Transmission Electron Microscope ◽

Carbon Films ◽

Specimen Preparation ◽

Material Deposition ◽

Active User ◽

Percent Improvement ◽

Scanning Transmission ◽

The Moment ◽

Film Substrate ◽

High Level

The forte of the Scanning transmission Electron Microscope (STEM) is high resolution imaging with high contrast on thin specimens, as demonstrated by visualization of single heavy atoms. of equal importance for biology is the efficient utilization of all available signals, permitting low dose imaging of unstained single molecules such as DNA.Our work at Brookhaven has concentrated on: 1) design and construction of instruments optimized for a narrow range of biological applications and 2) use of such instruments in a very active user/collaborator program. Therefore our program is highly interactive with a strong emphasis on producing results which are interpretable with a high level of confidence.The major challenge we face at the moment is specimen preparation. The resolution of the STEM is better than 2.5 A, but measurements of resolution vs. dose level off at a resolution of 20 A at a dose of 10 el/A2 on a well-behaved biological specimen such as TMV (tobacco mosaic virus). To track down this problem we are examining all aspects of specimen preparation: purification of biological material, deposition on the thin film substrate, washing, fast freezing and freeze drying. As we attempt to improve our equipment/technique, we use image analysis of TMV internal controls included in all STEM samples as a monitor sensitive enough to detect even a few percent improvement. For delicate specimens, carbon films can be very harsh-leading to disruption of the sample. Therefore we are developing conducting polymer films as alternative substrates, as described elsewhere in these Proceedings. For specimen preparation studies, we have identified (from our user/collaborator program ) a variety of “canary” specimens, each uniquely sensitive to one particular aspect of sample preparation, so we can attempt to separate the variables involved.

Download Full-text