Emerging trends: Deep nets for poets

Kenneth Ward Church; Xiaopeng Yuan; Sheng Guo; Zewu Wu; Yehua Yang; Zeyu Chen

doi:10.1017/s1351324921000231

Emerging trends: Deep nets for poets

Natural Language Engineering ◽

10.1017/s1351324921000231 ◽

2021 ◽

Vol 27 (5) ◽

pp. 631-645

Author(s):

Kenneth Ward Church ◽

Xiaopeng Yuan ◽

Sheng Guo ◽

Zewu Wu ◽

Yehua Yang ◽

...

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Question Answering ◽

Named Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Training Models ◽

Named Entity ◽

Emerging Trends ◽

Programming Skills

AbstractDeep nets have done well with early adopters, but the future will soon depend on crossing the chasm. The goal of this paper is to make deep nets more accessible to a broader audience including people with little or no programming skills, and people with little interest in training new models. A github is provided with simple implementations of image classification, optical character recognition, sentiment analysis, named entity recognition, question answering (QA/SQuAD), machine translation, speech to text (SST), and speech recognition (STT). The emphasis is on instant gratification. Non-programmers should be able to install these programs and use them in 15 minutes or less (per program). Programs are short (10–100 lines each) and readable by users with modest programming skills. Much of the complexity is hidden behind abstractions such as pipelines and auto classes, and pretrained models and datasets provided by hubs: PaddleHub, PaddleNLP, HuggingFaceHub, and Fairseq. Hubs have different priorities than research. Research is training models from corpora and fine-tuning them for tasks. Users are already overwhelmed with an embarrassment of riches (13k models and 1k datasets). Do they want more? We believe the broader market is more interested in inference (how to run pretrained models on novel inputs) and less interested in training (how to create even more models).

Download Full-text

Emerging trends: A gentle introduction to fine-tuning

Natural Language Engineering ◽

10.1017/s1351324921000322 ◽

2021 ◽

Vol 27 (6) ◽

pp. 763-778

Author(s):

Kenneth Ward Church ◽

Zeyu Chen ◽

Yanjun Ma

Keyword(s):

Natural Language ◽

Language Processing ◽

Question Answering ◽

General Purpose ◽

Fine Tuning ◽

Language Engineering ◽

Training Models ◽

Emerging Trends ◽

Foundation Model ◽

Programming Skills

AbstractThe previous Emerging Trends article (Church et al., 2021. Natural Language Engineering27(5), 631–645.) introduced deep nets to poets. Poets is an imperfect metaphor, intended as a gesture toward inclusion. The future for deep nets will benefit by reaching out to a broad audience of potential users, including people with little or no programming skills, and little interest in training models. That paper focused on inference, the use of pre-trained models, as is, without fine-tuning. The goal of this paper is to make fine-tuning more accessible to a broader audience. Since fine-tuning is more challenging than inference, the examples in this paper will require modest programming skills, as well as access to a GPU. Fine-tuning starts with a general purpose base (foundation) model and uses a small training set of labeled data to produce a model for a specific downstream application. There are many examples of fine-tuning in natural language processing (question answering (SQuAD) and GLUE benchmark), as well as vision and speech.

Download Full-text

Searchable Turkish OCRed historical newspaper collection 1928–1942

Journal of Information Science ◽

10.1177/01655515211000642 ◽

2021 ◽

pp. 016555152110006

Author(s):

Houssem Menhour ◽

Hasan Basri Şahin ◽

Ramazan Nejdet Sarıkaya ◽

Medine Aktaş ◽

Rümeysa Sağlam ◽

...

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Named Entity Recognition ◽

Early Modern Period ◽

Entity Recognition ◽

Text Search ◽

Cultural Form ◽

Named Entity ◽

Search Mechanism ◽

Original Website

The newspaper emerged as a distinct cultural form in early 17th-century Europe. It is bound up with the early modern period of history. Historical newspapers are of utmost importance to nations and its people, and researchers from different disciplines rely on these papers to improve our understanding of the past. In pursuit of satisfying this need, Istanbul University Head Office of Library and Documentation provides access to a big database of scanned historical newspapers. To take it another step further and make the documents more accessible, we need to run optical character recognition (OCR) and named entity recognition (NER) tasks on the whole database and index the results to allow for full-text search mechanism. We design and implement a system encompassing the whole pipeline starting from scrapping the dataset from the original website to providing a graphical user interface to run search queries, and it manages to do that successfully. Proposed system provides to search people, culture and security-related keywords and to visualise them.

Download Full-text

Combining the NER-OCR methods to improve information retrieval efficiency in the Indonesian posters

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.2020.13686 ◽

2020 ◽

Vol 8 (4) ◽

pp. 263-269

Author(s):

Ahmad Syarif Rosidy ◽

Tubagus Mohammad Akhriza ◽

Mochammad Husni

Keyword(s):

Information Retrieval ◽

Character Recognition ◽

Optical Character Recognition ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Time Efficiency ◽

Named Entity ◽

Retrieval Efficiency ◽

Information Retrieval Methods

Event organizers in Indonesia often use websites to disseminate information about these events through digital posters. However, manually processing for transferring information from posters to websites is constrained by time efficiency, given the increasing number of posters uploaded. Also, information retrieval methods, such as Named Entity Recognition (NER) for Indonesian posters, are still rarely discussed in the literature. In contrast, the NER method application to Indonesian corpus is challenged by accuracy improvement because Indonesian is a low-resource language that causes a lack of corpus availability as a reference. This study proposes a solution to improve the efficiency of information extraction time from digital posters. The proposed solution is a combination of the NER method with the Optical Character Recognition (OCR) method to recognize text on posters developed with the support of relevant training data corpus to improve accuracy. The experimental results show that the system can increase time efficiency by 94 % with 82-92 % accuracy for several extracted information entities from 50 testing digital posters.

Download Full-text

Multilingual Named Entity Recognition Model for Indonesian Health Insurance Question Answering System

2020 3rd International Conference on Information and Communications Technology (ICOIACT) ◽

10.1109/icoiact50329.2020.9332027 ◽

2020 ◽

Author(s):

Budi Sulistiyo Jati ◽

ST Widyawan ◽

S.T. Muhammad Nur Rizal

Keyword(s):

Health Insurance ◽

Question Answering ◽

Named Entity Recognition ◽

Entity Recognition ◽

Question Answering System ◽

Recognition Model ◽

Named Entity

Download Full-text

Fine-Grained Named Entity Recognition in Question Answering with DBpedia

Journal of Physics Conference Series ◽

10.1088/1742-6596/1087/3/032003 ◽

2018 ◽

Vol 1087 ◽

pp. 032003

Author(s):

Shimin Zhong ◽

Yajun Du ◽

Zhen Wei Gao

Keyword(s):

Question Answering ◽

Named Entity Recognition ◽

Entity Recognition ◽

Fine Grained ◽

Named Entity

Download Full-text

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining

Journal of Healthcare Engineering ◽

10.1155/2020/8829219 ◽

2020 ◽

Vol 2020 ◽

pp. 1-8

Author(s):

Lejun Gong ◽

Zhifei Zhang ◽

Shiqi Chen

Keyword(s):

Deep Learning ◽

Electronic Medical Records ◽

Medical Records ◽

Named Entity Recognition ◽

Clinical Entity ◽

Fine Tuning ◽

Entity Recognition ◽

Recognition Model ◽

Named Entity ◽

Model Based

Background. Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is difficult to obtain. Methods. Aiming at these characteristics of Chinese electronic medical records, this study proposed a Chinese clinical entity recognition model based on deep learning pretraining. The model used word embedding from domain corpus and fine-tuning of entity recognition model pretrained by relevant corpus. Then BiLSTM and Transformer are, respectively, used as feature extractors to identify four types of clinical entities including diseases, symptoms, drugs, and operations from the text of Chinese electronic medical records. Results. 75.06% Macro-P, 76.40% Macro-R, and 75.72% Macro-F1 aiming at test dataset could be achieved. These experiments show that the Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition effect. Conclusions. These experiments show that the proposed Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition performance.

Download Full-text