scholarly journals CICERO: A Domain-Specific Architecture for Efficient Regular Expression Matching

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-24
Author(s):  
Daniele Parravicini ◽  
Davide Conficconi ◽  
Emanuele Del Sozzo ◽  
Christian Pilato ◽  
Marco D. Santambrogio

Regular Expression (RE) matching is a computational kernel used in several applications. Since RE complexity and data volumes are steadily increasing, hardware acceleration is gaining attention also for this problem. Existing approaches have limited flexibility as they require a different implementation for each RE. On the other hand, it is complex to map efficient RE representations like non-deterministic finite-state automata onto software-programmable engines or parallel architectures. In this work, we present CICERO , an end-to-end framework composed of a domain-specific architecture and a companion compilation framework for RE matching. Our solution is suitable for many applications, such as genomics/proteomics and natural language processing. CICERO aims at exploiting the intrinsic parallelism of non-deterministic representations of the REs. CICERO can trade-off accelerators’ efficiency and processors’ flexibility thanks to its programmable architecture and the compilation framework. We implemented CICERO prototypes on embedded FPGA achieving up to 28.6× and 20.8× more energy efficiency than embedded and mainstream processors, respectively. Since it is a programmable architecture, it can be implemented as a custom ASIC that is orders of magnitude more energy-efficient than mainstream processors.

2014 ◽  
Vol 556-562 ◽  
pp. 1730-1736
Author(s):  
Cheng Qing Guo ◽  
Jun Feng Xu

With the rapid development of network bandwidth, the matching-performance of regular expression is gradually of crucial importance for networking security. There are many hardware acceleration designs of regular expression matching on the basis of NFA and DFA, of which NFA designs require more logic circuit resources while the DFA designs more memory resources. However, because there are too many states and transition edges in DFA, the performance of DFA is much inferior to the performance of NFA. In this paper we designed a DFA regular expression matching algorithm fully based on FPGA logic circuit. The algorithm exploits the feature of DFA that many transitions for a state may have the same next state pointer and setting a default transition for each state of DFA will result in the reduction of logic circuit and the simplification of the electronic circuit. To evaluate performance, this DFA algorithm was mapped onto the Altera Cyclone FPGA, and got the experimental results based on the L7-filter rule set. The performance of the DFA algorithm acquired an approximate performance compared to the NFA algorithm. Experimental result shows that, compared with the NFA algorithm, in the improved DFA plan, 10% rules got a higher throughput, reaching 60% in the best case; while 62% rules cost less logic resources, saving 87% logic resources in the best case.


2021 ◽  
Vol 11 (2) ◽  
pp. 283-302
Author(s):  
Paul Meurer

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.


2013 ◽  
Vol 7 (1) ◽  
pp. 46-50
Author(s):  
Linhai Cui ◽  
Yusen Qin ◽  
Fanyang Kong ◽  
Kaihong Yu

This paper presents an efficient method for Regular Expression Matching (REM) by reusing Intellectual Property (IP) cores in a new architecture of Network on Chip (NoC). The method is to design a reusable IP core which consists of many engine cells for REM and to implement each engine cell on a Field Programmable Gate Array (FPGA) as a prototype. To make Finite State Machine (FSM) simpler, a new approach for partitioning a regular expression into several smaller parts is proposed. Each part of a regular expression was matched by an engine cell during matching, and each engine cell communicates with others by routers on a NoC topology. The proposed NoC architecture is a general-purpose design which is suitable for different rule libraries in deep packet inspection (DPI). It can deal with the problem that character self-deplete made the correct regular expression matching missing. A way to use both logic cell and RAM available on FPGA devices is described, and it can make it easier to change the rule library of regular expressions in the RAM. The implementation of the NoC architecture by employing application-specific integrated circuits (ASIC) is finally discussed.


2017 ◽  
Vol 9 (1) ◽  
pp. 19-24 ◽  
Author(s):  
David Domarco ◽  
Ni Made Satvika Iswari

Technology development has affected many areas of life, especially the entertainment field. One of the fastest growing entertainment industry is anime. Anime has evolved as a trend and a hobby, especially for the population in the regions of Asia. The number of anime fans grow every year and trying to dig up as much information about their favorite anime. Therefore, a chatbot application was developed in this study as anime information retrieval media using regular expression pattern matching method. This application is intended to facilitate the anime fans in searching for information about the anime they like. By using this application, user can gain a convenience and interactive anime data retrieval that can’t be found when searching for information via search engines. Chatbot application has successfully met the standards of information retrieval engine with a very good results, the value of 72% precision and 100% recall showing the harmonic mean of 83.7%. As the application of hedonic, chatbot already influencing Behavioral Intention to Use by 83% and Immersion by 82%. Index Terms—anime, chatbot, information retrieval, Natural Language Processing (NLP), Regular Expression Pattern Matching


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Pilar López-Úbeda ◽  
Alexandra Pomares-Quimbaya ◽  
Manuel Carlos Díaz-Galiano ◽  
Stefan Schulz

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.


2021 ◽  
Author(s):  
Nan Jiang ◽  
Ping Lin ◽  
Yulong He ◽  
Zhuozhi Tan ◽  
Jin Hu

Sign in / Sign up

Export Citation Format

Share Document