Rule-based pattern extractor and named entity recognition: A hybrid approach

The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.

Download Full-text

NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic

Natural Language Engineering ◽

10.1017/s1351324916000097 ◽

2016 ◽

Vol 23 (3) ◽

pp. 441-472 ◽

Cited By ~ 8

Author(s):

MAI OUDAH ◽

KHALED SHAALAN

Keyword(s):

Language Processing ◽

Hybrid Approach ◽

Named Entity Recognition ◽

Entity Recognition ◽

Rule Base ◽

Named Entities ◽

Rule Based ◽

Named Entity ◽

Linguistic Rules ◽

Person Location

AbstractNamed Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.

Download Full-text

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

BioMed Research International ◽

10.1155/2016/4248026 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 5

Author(s):

Abbas Akkasi ◽

Ekrem Varoğlu ◽

Nazife Dimililer

Keyword(s):

Conditional Random Fields ◽

Named Entity Recognition ◽

Classification Performance ◽

Entity Recognition ◽

Support Vector ◽

Learning Approaches ◽

Data Set ◽

Rule Based ◽

Named Entity ◽

Vector Machines

Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.

Download Full-text

A Hybrid Approach of Pattern Extraction and Semi-supervised Learning for Vietnamese Named Entity Recognition

Computational Collective Intelligence. Technologies and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-34630-9_9 ◽

2012 ◽

pp. 83-93 ◽

Cited By ~ 1

Author(s):

Duc-Thuan Vo ◽

Cheol-Young Ock

Keyword(s):

Supervised Learning ◽

Hybrid Approach ◽

Named Entity Recognition ◽

Entity Recognition ◽

Pattern Extraction ◽

Named Entity

Download Full-text

A hybrid approach to Arabic named entity recognition

Journal of Information Science ◽

10.1177/0165551513502417 ◽

2013 ◽

Vol 40 (1) ◽

pp. 67-87 ◽

Cited By ~ 27

Author(s):

Khaled Shaalan ◽

Mai Oudah

Keyword(s):

Hybrid Approach ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

Integrating Rule-Based System with Classification for Arabic Named Entity Recognition

Computational Linguistics and Intelligent Text Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-28604-9_26 ◽

2012 ◽

pp. 311-322 ◽

Cited By ~ 26

Author(s):

Sherief Abdallah ◽

Khaled Shaalan ◽

Muhammad Shoaib

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Rule Based ◽

Rule Based System ◽

Named Entity

Download Full-text

Using machine learning to maintain rule-based named-entity recognition and classification systems

10.3115/1073012.1073067 ◽

2001 ◽

Cited By ~ 17

Author(s):

Georgios Petasis ◽

Frantz Vichot ◽

Francis Wolinski ◽

Georgios Paliouras ◽

Vangelis Karkaletsis ◽

...

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Classification Systems ◽

Entity Recognition ◽

Rule Based ◽

Named Entity

Download Full-text

A Hybrid Approach for Persian Named Entity Recognition

Iranian Journal of Science and Technology Transactions A Science ◽

10.1007/s40995-017-0209-x ◽

2017 ◽

Vol 41 (1) ◽

pp. 215-222

Author(s):

Hamed Moradi ◽

Farid Ahmadi ◽

Mohammad-Reza Feizi-Derakhshi

Keyword(s):

Hybrid Approach ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

Named Entity Recognition in Telugu language using Language Dependent Features and Rule based Approach

International Journal of Computer Applications ◽

10.5120/2602-3628 ◽

2011 ◽

Vol 22 (8) ◽

pp. 30-34 ◽

Cited By ~ 2

Author(s):

B. Sasidhar ◽

P. M. Yohan ◽

A. Vinaya Babu ◽

A. Govardhan

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Rule Based ◽

Named Entity ◽

Rule Based Approach

Download Full-text

Named Entity Recognition for a Low Resource Language

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2085.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 587-590

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Linguistic Knowledge ◽

Rule Based ◽

Low Resource ◽

Named Entity ◽

The North ◽

Rule Based Approach

Kokborok named entity recognition using the rules based approach is being studied in this paper. Named entity recognition is one of the applications of natural language processing. It is considered a subtask for information extraction. Named entity recognition is the means of identifying the named entity for some specific task. We have studied the named entity recognition system for the Kokborok language. Kokborok is the official language of the state of Tripura situated in the north eastern part of India. It is also widely spoken in other part of the north eastern state of India and adjoining areas of Bangladesh. The named entities are like the name of person, organization, location etc. Named entity recognitions are studied using the machine learning approach, rule based approach or the hybrid approach combining the machine learning and rule based approaches. Rule based named entity recognitions are influence by the linguistic knowledge of the language. Machine learning approach requires a large number of training data. Kokborok being a low resource language has very limited number of training data. The rule based approach requires linguistic rules and the results are not depended on the size of data available. We have framed a heuristic rules for identifying the named entity based on linguistic knowledge of the language. An encouraging result is obtained after we test our data with the rule based approach. We also tried to study and frame the rules for the counting system in Kokborok in this paper. The rule based approach to named entity recognition is found suitable for low resource language with limited digital work and absence of named entity tagged data. We have framed a suitable algorithm using the rules for solving the named entity recognition task for obtaining a desirable result.

Download Full-text