CICERO: A Domain-Specific Architecture for Efficient Regular Expression Matching

Regular Expression (RE) matching is a computational kernel used in several applications. Since RE complexity and data volumes are steadily increasing, hardware acceleration is gaining attention also for this problem. Existing approaches have limited flexibility as they require a different implementation for each RE. On the other hand, it is complex to map efficient RE representations like non-deterministic finite-state automata onto software-programmable engines or parallel architectures. In this work, we present CICERO , an end-to-end framework composed of a domain-specific architecture and a companion compilation framework for RE matching. Our solution is suitable for many applications, such as genomics/proteomics and natural language processing. CICERO aims at exploiting the intrinsic parallelism of non-deterministic representations of the REs. CICERO can trade-off accelerators’ efficiency and processors’ flexibility thanks to its programmable architecture and the compilation framework. We implemented CICERO prototypes on embedded FPGA achieving up to 28.6× and 20.8× more energy efficiency than embedded and mainstream processors, respectively. Since it is a programmable architecture, it can be implemented as a custom ASIC that is orders of magnitude more energy-efficient than mainstream processors.

Download Full-text

Regular Expression Matching Algorithm Based on FPGA Circuit

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.1730 ◽

2014 ◽

Vol 556-562 ◽

pp. 1730-1736

Author(s):

Cheng Qing Guo ◽

Jun Feng Xu

Keyword(s):

Regular Expression ◽

Rapid Development ◽

Hardware Acceleration ◽

Logic Circuit ◽

Experimental Result ◽

Matching Algorithm ◽

Network Bandwidth ◽

Regular Expression Matching ◽

Rule Set ◽

Memory Resources

With the rapid development of network bandwidth, the matching-performance of regular expression is gradually of crucial importance for networking security. There are many hardware acceleration designs of regular expression matching on the basis of NFA and DFA, of which NFA designs require more logic circuit resources while the DFA designs more memory resources. However, because there are too many states and transition edges in DFA, the performance of DFA is much inferior to the performance of NFA. In this paper we designed a DFA regular expression matching algorithm fully based on FPGA logic circuit. The algorithm exploits the feature of DFA that many transitions for a state may have the same next state pointer and setting a default transition for each state of DFA will result in the reduction of logic circuit and the simplification of the electronic circuit. To evaluate performance, this DFA algorithm was mapped onto the Altera Cyclone FPGA, and got the experimental results based on the L7-filter rule set. The performance of the DFA algorithm acquired an approximate performance compared to the NFA algorithm. Experimental result shows that, compared with the NFA algorithm, in the improved DFA plan, 10% rules got a higher throughput, reaching 60% in the best case; while 62% rules cost less logic resources, saving 87% logic resources in the best case.

Download Full-text

Parallel Finite State Machines for Very Fast Distributable Regular Expression Matching

Proceedings of the 7th International Conference on Software Paradigm Trends ◽

10.5220/0003949901050110 ◽

2012 ◽

Keyword(s):

Regular Expression ◽

Finite State Machines ◽

State Machines ◽

Regular Expression Matching ◽

Finite State

Download Full-text

Designing efficient algorithms for querying large corpora

Oslo Studies in Language ◽

10.5617/osla.8504 ◽

2021 ◽

Vol 11 (2) ◽

pp. 283-302

Author(s):

Paul Meurer

Keyword(s):

Regular Expression ◽

Linear Time ◽

Suffix Array ◽

Efficient Algorithms ◽

Regular Expressions ◽

Efficient Treatment ◽

Suffix Arrays ◽

Regular Expression Matching ◽

Finite State ◽

Query System

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

Download Full-text

An Extendible Regular Expression Compiler for Finite-State Approaches in Natural Language Processing

Lecture Notes in Computer Science - Automata Implementation ◽

10.1007/3-540-45526-4_12 ◽

2001 ◽

pp. 122-139 ◽

Cited By ~ 8

Author(s):

Gertjan van Noord ◽

Dale Gerdemann

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Regular Expression ◽

Finite State

Download Full-text

Design of a Regular Expression Matching System Based on Network on Chip

The Open Electrical & Electronic Engineering Journal ◽

10.2174/1874129001307010046 ◽

2013 ◽

Vol 7 (1) ◽

pp. 46-50

Author(s):

Linhai Cui ◽

Yusen Qin ◽

Fanyang Kong ◽

Kaihong Yu

Keyword(s):

Integrated Circuits ◽

Regular Expression ◽

General Purpose ◽

Network On Chip ◽

Deep Packet Inspection ◽

Regular Expression Matching ◽

Finite State ◽

Ip Cores ◽

Packet Inspection ◽

On Chip

This paper presents an efficient method for Regular Expression Matching (REM) by reusing Intellectual Property (IP) cores in a new architecture of Network on Chip (NoC). The method is to design a reusable IP core which consists of many engine cells for REM and to implement each engine cell on a Field Programmable Gate Array (FPGA) as a prototype. To make Finite State Machine (FSM) simpler, a new approach for partitioning a regular expression into several smaller parts is proposed. Each part of a regular expression was matched by an engine cell during matching, and each engine cell communicates with others by routers on a NoC topology. The proposed NoC architecture is a general-purpose design which is suitable for different rule libraries in deep packet inspection (DPI). It can deal with the problem that character self-deplete made the correct regular expression matching missing. A way to use both logic cell and RAM available on FPGA devices is described, and it can make it easier to change the rule library of regular expressions in the RAM. The implementation of the NoC architecture by employing application-specific integrated circuits (ASIC) is finally discussed.

Download Full-text

Rancang Bangun Aplikasi Chatbot Sebagai Media Pencarian Informasi Anime Menggunakan Regular Expression Pattern Matching

Jurnal ULTIMATICS ◽

10.31937/ti.v9i1.559 ◽

2017 ◽

Vol 9 (1) ◽

pp. 19-24 ◽

Cited By ~ 1

Author(s):

David Domarco ◽

Ni Made Satvika Iswari

Keyword(s):

Information Retrieval ◽

Expression Pattern ◽

Pattern Matching ◽

Language Processing ◽

Regular Expression ◽

Technology Development ◽

Data Retrieval ◽

Index Terms ◽

Retrieval Engine ◽

Behavioral Intention To Use

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching

Download Full-text

NFA split architecture for fast regular expression matching

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems - ANCS '10 ◽

10.1145/1872007.1872024 ◽

2010 ◽

Cited By ~ 1

Author(s):

Jan Kořenek ◽

Vlastimil Košař

Keyword(s):

Regular Expression ◽

Regular Expression Matching ◽

Split Architecture

Download Full-text

Collecting specialty-related medical terms: Development and evaluation of a resource for Spanish

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01495-w ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Pilar López-Úbeda ◽

Alexandra Pomares-Quimbaya ◽

Manuel Carlos Díaz-Galiano ◽

Stefan Schulz

Keyword(s):

Language Processing ◽

Classification Problem ◽

Snomed Ct ◽

Language Resources ◽

Clinical Specialty ◽

Controlled Vocabularies ◽

Clinical Text ◽

Domain Specific ◽

Medical Terms ◽

Core Vocabulary

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.

Download Full-text