Mining the URLs: An Approach to Measure the Similarities between Named-Entities

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Knowledge and Information Systems ◽

10.1007/s10115-012-0502-0 ◽

2012 ◽

Vol 35 (1) ◽

pp. 87-109 ◽

Cited By ~ 5

Author(s):

César de Pablo-Sánchez ◽

Isabel Segura-Bedmar ◽

Paloma Martínez ◽

Ana Iglesias-Maqueda

Keyword(s):

Text Mining ◽

Named Entities ◽

Linguistic Patterns ◽

Multilingual Text

Download Full-text

Inaugural Speech Classification with Named Entities and Key Phrases

2021 IEEE International Conference on Big Data and Smart Computing (BigComp) ◽

10.1109/bigcomp51126.2021.00029 ◽

2021 ◽

Author(s):

Hyoil Han ◽

SeungJin Lim

Keyword(s):

Named Entities ◽

Inaugural Speech ◽

Speech Classification ◽

Key Phrases

Download Full-text

Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources

ACM Transactions on Asian Language Information Processing ◽

10.1145/1165255.1165257 ◽

2006 ◽

Vol 5 (2) ◽

pp. 121-145 ◽

Cited By ~ 19

Author(s):

Chun-Jen Lee ◽

Jason S. Chang ◽

Jyh-Shing R. Jang

Keyword(s):

Statistical Models ◽

Knowledge Sources ◽

Named Entities ◽

Parallel Corpora

Download Full-text

Linking named entities in Tweets with knowledge base via user interest modeling

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '13 ◽

10.1145/2487575.2487686 ◽

2013 ◽

Cited By ~ 66

Author(s):

Wei Shen ◽

Jianyong Wang ◽

Ping Luo ◽

Min Wang

Keyword(s):

Knowledge Base ◽

User Interest ◽

Named Entities

Download Full-text

TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus

Language Resources and Evaluation ◽

10.1007/s10579-020-09516-2 ◽

2021 ◽

Cited By ~ 1

Author(s):

Elena Álvarez-Mellado ◽

María Luisa Díez-Platas ◽

Pablo Ruiz-Fabo ◽

Helena Bermúdez ◽

Salvador Ros ◽

...

Keyword(s):

Historical Data ◽

Named Entity Recognition ◽

Rich Source ◽

Entity Recognition ◽

Historical Evidence ◽

Annotation Scheme ◽

Named Entities ◽

General Domain ◽

Named Entity ◽

Entity Annotation

AbstractMedieval documents are a rich source of historical data. Performing named-entity recognition (NER) on this genre of texts can provide us with valuable historical evidence. However, traditional NER categories and schemes are usually designed with modern documents in mind (i.e. journalistic text) and the general-domain NER annotation schemes fail to capture the nature of medieval entities. In this paper we explore the challenges of performing named-entity annotation on a corpus of Spanish medieval documents: we discuss the mismatches that arise when applying traditional NER categories to a corpus of Spanish medieval documents and we propose a novel humanist-friendly TEI-compliant annotation scheme and guidelines intended to capture the particular nature of medieval entities.

Download Full-text

A Survey of Arabic Named Entity Recognition and Classification

Computational Linguistics ◽

10.1162/coli_a_00178 ◽

2014 ◽

Vol 40 (2) ◽

pp. 469-510 ◽

Cited By ~ 62

Author(s):

Khaled Shaalan

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Relevant Information ◽

Arabic Language ◽

Entity Recognition ◽

Named Entities ◽

Linguistic Resources ◽

Named Entity ◽

To Receive ◽

Made In

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Download Full-text