scholarly journals Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

Data ◽  
2021 ◽  
Vol 6 (7) ◽  
pp. 78
Author(s):  
Dipali Baviskar ◽  
Swati Ahirrao ◽  
Ketan Kotecha

The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents.

2019 ◽  
Vol 34 (4) ◽  
pp. 283-294 ◽  
Author(s):  
Huyen T M Nguyen ◽  
Quyen T Ngo ◽  
Luong X Vu ◽  
Vu M Tran ◽  
Hien T T Nguyen

Named entities (NE) are phrases that contain the names of persons, organizations, locations, times and quantities, monetary values, percentages, etc. Named Entity Recognition (NER) is the task of recognizing named entities in documents. NER is an important subtask of Information Extraction, which has attracted researchers all over the world since 1990s. For Vietnamese language, although there exists some research projects and publications on NER task before 2016, no systematic comparison of the performance of NER systems has been done. In 2016, the organizing committee of the VLSP workshop decided to launch the first NER shared task, in order to get an objective evaluation of Vietnamese NER systems and to promote the development of high quality systems. As a result, the first dataset with morpho-syntactic and NE annotations has been released for benchmarking NER systems. At VLSP 2018, the NER shared task has been organized for the second time, providing a bigger dataset containing texts from various domains, but without morpho-syntactic annotation. These resources are available for research purpose via the VLSP website vlsp.org.vn/resources. In this paper, we describe the datasets as well as the evaluation results obtained from these two campaigns.


2018 ◽  
Vol 7 (1) ◽  
pp. 23 ◽  
Author(s):  
Pavlina Fragkou

In this paper we examine the benefit of performing named entity recognition (NER) and co-reference resolution to a Greek corpus used for text segmentation. The aim here is to examine whether the combination of text segmentation and information extraction is beneficial for identifying various topics that appear in a document. NER was performed using an already existing tool for the Greek corpus. Produced annotations were manually corrected and enriched to cover four types of named entities. Co-reference resolution was subsequently performed manually. The evaluation, using four text segmentation algorithms leads to the conclusion that, information extraction techniques appear to be a promising solution in capturing semantic information for segmentation purposes.


2019 ◽  
Vol 58 (02/03) ◽  
pp. 094-106 ◽  
Author(s):  
Zhe Xie ◽  
Yuanyuan Yang ◽  
Mingqing Wang ◽  
Ming Li ◽  
Haozhe Huang ◽  
...  

Abstract Background Radiology reports are a permanent record of patient's health information often used in clinical practice and research. Reading radiology reports is common for clinicians and radiologists. However, it is laborious and time-consuming when the amount of reports to be read is large. Assisting clinicians to locate and assimilate the key information of reports is of great significance for improving the efficiency of reading reports. There are few studies on information extraction from Chinese medical texts and its application in radiology information systems (RIS) for efficiency improvement. Objectives The purpose of this study was to explore methods for extracting, grouping, ranking, delivering, and displaying medical-named entities in radiology reports which can yield efficiency improvement in RISs. Methods A total of 5,000 reports were obtained from two medical institutions for this study. We proposed a neural network model called Multi-Embedding-BGRU-CRF (bidirectional gated recurrent unit-conditional random field) for medical-named entity recognition and rule-based methods for entity grouping and ranking. Furthermore, a methodology for delivering and displaying entities in RISs was presented. Results The proposed neural named entity recognition model has achieved a good F1 score of 95.88%. Entity ranking achieved a very high accuracy of 99.23%. The weakness of the system is the entity grouping approach which yield accuracy of 91.03%. The effectiveness of the overall solution was proved by an evaluation task performed by two clinicians based on the setup of actual clinical practice. Conclusions The neural model shows great potential in extracting medical-named entities from radiology reports, especially for languages, that lack lexicons and natural language processing tools. The pipeline of extracting, grouping, ranking, delivering, and displaying medical-named entities could be a feasible solution to enhance RIS functionality by information extraction. The integration of information extraction and RIS has been demonstrated to be effective in improving the efficiency of reading radiology reports.


Data ◽  
2021 ◽  
Vol 6 (7) ◽  
pp. 71
Author(s):  
Gonçalo Carnaz ◽  
Mário Antunes ◽  
Vitor Beires Nogueira

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.


Processes ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 1178
Author(s):  
Zhenhua Wang ◽  
Beike Zhang ◽  
Dong Gao

In the field of chemical safety, a named entity recognition (NER) model based on deep learning can mine valuable information from hazard and operability analysis (HAZOP) text, which can guide experts to carry out a new round of HAZOP analysis, help practitioners optimize the hidden dangers in the system, and be of great significance to improve the safety of the whole chemical system. However, due to the standardization and professionalism of chemical safety analysis text, it is difficult to improve the performance of traditional models. To solve this problem, in this study, an improved method based on active learning is proposed, and three novel sampling algorithms are designed, Variation of Token Entropy (VTE), HAZOP Confusion Entropy (HCE) and Amplification of Least Confidence (ALC), which improve the ability of the model to understand HAZOP text. In this method, a part of data is used to establish the initial model. The sampling algorithm is then used to select high-quality samples from the data set. Finally, these high-quality samples are used to retrain the whole model to obtain the final model. The experimental results show that the performance of the VTE, HCE, and ALC algorithms are better than that of random sampling algorithms. In addition, compared with other methods, the performance of the traditional model is improved effectively by the method proposed in this paper, which proves that the method is reliable and advanced.


Author(s):  
Elena Álvarez-Mellado ◽  
María Luisa Díez-Platas ◽  
Pablo Ruiz-Fabo ◽  
Helena Bermúdez ◽  
Salvador Ros ◽  
...  

AbstractMedieval documents are a rich source of historical data. Performing named-entity recognition (NER) on this genre of texts can provide us with valuable historical evidence. However, traditional NER categories and schemes are usually designed with modern documents in mind (i.e. journalistic text) and the general-domain NER annotation schemes fail to capture the nature of medieval entities. In this paper we explore the challenges of performing named-entity annotation on a corpus of Spanish medieval documents: we discuss the mismatches that arise when applying traditional NER categories to a corpus of Spanish medieval documents and we propose a novel humanist-friendly TEI-compliant annotation scheme and guidelines intended to capture the particular nature of medieval entities.


2014 ◽  
Vol 40 (2) ◽  
pp. 469-510 ◽  
Author(s):  
Khaled Shaalan

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Kanix Wang ◽  
Robert Stevens ◽  
Halima Alachram ◽  
Yu Li ◽  
Larisa Soldatova ◽  
...  

AbstractMachine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.


Sign in / Sign up

Export Citation Format

Share Document