Data set entity recognition based on distant supervision

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Pengcheng Li ◽  
Qikai Liu ◽  
Qikai Cheng ◽  
Wei Lu

Purpose This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain. Design/methodology/approach Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities. Findings In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition. Originality/value This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Huu-Thanh Duong ◽  
Tram-Anh Nguyen-Thi

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.


2018 ◽  
Vol 74 (5) ◽  
pp. 936-950
Author(s):  
Anne Chardonnens ◽  
Ettore Rizza ◽  
Mathias Coeckelbergs ◽  
Seth van Hooland

Purpose Advanced usage of web analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is problematic. The purpose of this paper is to address the problem of named entity recognition in digital library user queries. Design/methodology/approach The paper presents a large-scale case study conducted at the Royal Library of Belgium in its online historical newspapers platform BelgicaPress. The object of the study is a data set of 83,854 queries resulting from 29,812 visits over a 12-month period. By making use of information extraction methods, knowledge bases (KBs) and various authority files, this paper presents the possibilities and limits to identify what percentage of end users are looking for person and place names. Findings Based on a quantitative assessment, the method can successfully identify the majority of person and place names from user queries. Due to the specific character of user queries and the nature of the KBs used, a limited amount of queries remained too ambiguous to be treated in an automated manner. Originality/value This paper demonstrates in an empirical manner how user queries can be extracted from a web analytics tool and how named entities can then be mapped with KBs and authority files, in order to facilitate automated analysis of their content. Methods and tools used are generalisable and can be reused by other collection holders.


2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Yu Wang ◽  
Yining Sun ◽  
Zuchang Ma ◽  
Lisheng Gao ◽  
Yang Xu

The medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. Named Entity Recognition (NER) is the initial step in extracting this knowledge from unstructured text and presenting it as a Knowledge Graph (KG). However, the previous approaches of NER have often suffered from small-scale human-labelled training data. Furthermore, extracting knowledge from Chinese medical literature is a more complex task because there is no segmentation between Chinese characters. Recently, the pretraining models, which obtain representations with the prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of Natural Language Processing (NLP) tasks. However, the capabilities of pretraining models have not been fully exploited, and applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest. In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. First, we propose a method of data augmentation by replacing the words in the training set with synonyms through the Mask Language Model (MLM), which is a pretraining task. Then, we consider NER as the downstream task of the pretraining model and transfer the prior semantic knowledge obtained during pretraining to it. Finally, we conduct experiments to compare the performances of six pretraining models (BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIE-tiny, and RoBERTa) in recognizing named entities from Chinese medical literature. The effects of feature extraction and fine-tuning, as well as different downstream model structures, are also explored. Experimental results demonstrate that the method of data augmentation we proposed can obtain meaningful improvements in the performance of recognition. Besides, RoBERTa-CRF achieves the highest F1-score compared with the previous methods and other pretraining models.


2020 ◽  
Vol 47 (3) ◽  
pp. 547-560 ◽  
Author(s):  
Darush Yazdanfar ◽  
Peter Öhman

PurposeThe purpose of this study is to empirically investigate determinants of financial distress among small and medium-sized enterprises (SMEs) during the global financial crisis and post-crisis periods.Design/methodology/approachSeveral statistical methods, including multiple binary logistic regression, were used to analyse a longitudinal cross-sectional panel data set of 3,865 Swedish SMEs operating in five industries over the 2008–2015 period.FindingsThe results suggest that financial distress is influenced by macroeconomic conditions (i.e. the global financial crisis) and, in particular, by various firm-specific characteristics (i.e. performance, financial leverage and financial distress in previous year). However, firm size and industry affiliation have no significant relationship with financial distress.Research limitationsDue to data availability, this study is limited to a sample of Swedish SMEs in five industries covering eight years. Further research could examine the generalizability of these findings by investigating other firms operating in other industries and other countries.Originality/valueThis study is the first to examine determinants of financial distress among SMEs operating in Sweden using data from a large-scale longitudinal cross-sectional database.


2019 ◽  
Vol 9 (6) ◽  
pp. 1128 ◽  
Author(s):  
Yundong Li ◽  
Wei Hu ◽  
Han Dong ◽  
Xueyan Zhang

Using aerial cameras, satellite remote sensing or unmanned aerial vehicles (UAV) equipped with cameras can facilitate search and rescue tasks after disasters. The traditional manual interpretation of huge aerial images is inefficient and could be replaced by machine learning-based methods combined with image processing techniques. Given the development of machine learning, researchers find that convolutional neural networks can effectively extract features from images. Some target detection methods based on deep learning, such as the single-shot multibox detector (SSD) algorithm, can achieve better results than traditional methods. However, the impressive performance of machine learning-based methods results from the numerous labeled samples. Given the complexity of post-disaster scenarios, obtaining many samples in the aftermath of disasters is difficult. To address this issue, a damaged building assessment method using SSD with pretraining and data augmentation is proposed in the current study and highlights the following aspects. (1) Objects can be detected and classified into undamaged buildings, damaged buildings, and ruins. (2) A convolution auto-encoder (CAE) that consists of VGG16 is constructed and trained using unlabeled post-disaster images. As a transfer learning strategy, the weights of the SSD model are initialized using the weights of the CAE counterpart. (3) Data augmentation strategies, such as image mirroring, rotation, Gaussian blur, and Gaussian noise processing, are utilized to augment the training data set. As a case study, aerial images of Hurricane Sandy in 2012 were maximized to validate the proposed method’s effectiveness. Experiments show that the pretraining strategy can improve of 10% in terms of overall accuracy compared with the SSD trained from scratch. These experiments also demonstrate that using data augmentation strategies can improve mAP and mF1 by 72% and 20%, respectively. Finally, the experiment is further verified by another dataset of Hurricane Irma, and it is concluded that the paper method is feasible.


2021 ◽  
Vol 189 ◽  
pp. 292-299
Author(s):  
Caroline Sabty ◽  
Islam Omar ◽  
Fady Wasfalla ◽  
Mohamed Islam ◽  
Slim Abdennadher

Symmetry ◽  
2021 ◽  
Vol 13 (5) ◽  
pp. 845
Author(s):  
Dongheun Han ◽  
Chulwoo Lee ◽  
Hyeongyeop Kang

The neural-network-based human activity recognition (HAR) technique is being increasingly used for activity recognition in virtual reality (VR) users. The major issue of a such technique is the collection large-scale training datasets which are key for deriving a robust recognition model. However, collecting large-scale data is a costly and time-consuming process. Furthermore, increasing the number of activities to be classified will require a much larger number of training datasets. Since training the model with a sparse dataset can only provide limited features to recognition models, it can cause problems such as overfitting and suboptimal results. In this paper, we present a data augmentation technique named gravity control-based augmentation (GCDA) to alleviate the sparse data problem by generating new training data based on the existing data. The benefits of the symmetrical structure of the data are that it increased the number of data while preserving the properties of the data. The core concept of GCDA is two-fold: (1) decomposing the acceleration data obtained from the inertial measurement unit (IMU) into zero-gravity acceleration and gravitational acceleration, and augmenting them separately, and (2) exploiting gravity as a directional feature and controlling it to augment training datasets. Through the comparative evaluations, we validated that the application of GCDA to training datasets showed a larger improvement in classification accuracy (96.39%) compared to the typical data augmentation methods (92.29%) applied and those that did not apply the augmentation method (85.21%).


Author(s):  
Chao Feng ◽  
Jie Xiong ◽  
Liqiong Chang ◽  
Fuwei Wang ◽  
Ju Wang ◽  
...  

Person identification plays a critical role in a large range of applications. Recently, RF based person identification becomes a hot research topic due to the contact-free nature of RF sensing that is particularly appealing in current COVID-19 pandemic. However, existing systems still have multiple limitations: i) heavily rely on the gait patterns of users for identification; ii) require a large amount of data to train the model and also extensive retraining for new users and iii) require a large frequency bandwidth which is not available on most commodity RF devices for static person identification. This paper proposes RF-Identity, an RFID-based identification system to address the above limitations and the contribution is threefold. First, by integrating walking pattern features with unique body shape features (e.g., height), RF-Identity achieves a high accuracy in person identification. Second, RF-Identity develops a data augmentation scheme to expand the size of the training data set, thus reducing the human effort in data collection. Third, RF-Identity utilizes the tag diversity in spatial domain to identify static users without a need of large frequency bandwidth. Extensive experiments show an identification accuracy of 94.2% and 95.9% for 50 dynamic and static users, respectively.


2022 ◽  
Vol 18 (1) ◽  
pp. 1-24
Author(s):  
Yi Zhang ◽  
Yue Zheng ◽  
Guidong Zhang ◽  
Kun Qian ◽  
Chen Qian ◽  
...  

Gait, the walking manner of a person, has been perceived as a physical and behavioral trait for human identification. Compared with cameras and wearable sensors, Wi-Fi-based gait recognition is more attractive because Wi-Fi infrastructure is almost available everywhere and is able to sense passively without the requirement of on-body devices. However, existing Wi-Fi sensing approaches impose strong assumptions of fixed user walking trajectories, sufficient training data, and identification of already known users. In this article, we present GaitSense , a Wi-Fi-based human identification system, to overcome the above unrealistic assumptions. To deal with various walking trajectories and speeds, GaitSense first extracts target specific features that best characterize gait patterns and applies novel normalization algorithms to eliminate gait irrelevant perturbation in signals. On this basis, GaitSense reduces the training efforts in new deployment scenarios by transfer learning and data augmentation techniques. GaitSense also enables a distinct feature of illegal user identification by anomaly detection, making the system readily available for real-world deployment. Our implementation and evaluation with commodity Wi-Fi devices demonstrate a consistent identification accuracy across various deployment scenarios with little training samples, pushing the limit of gait recognition with Wi-Fi signals.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Lam Hoang Viet Le ◽  
Toan Luu Duc Huynh ◽  
Bryan S. Weber ◽  
Bao Khac Quoc Nguyen

PurposeThis paper aims to identify the disproportionate impacts of the COVID-19 pandemic on labor markets.Design/methodology/approachThe authors conduct a large-scale survey on 16,000 firms from 82 industries in Ho Chi Minh City, Vietnam, and analyze the data set by using different machine-learning methods.FindingsFirst, job loss and reduction in state-owned enterprises have been significantly larger than in other types of organizations. Second, employees of foreign direct investment enterprises suffer a significantly lower labor income than those of other groups. Third, the adverse effects of the COVID-19 pandemic on the labor market are heterogeneous across industries and geographies. Finally, firms with high revenue in 2019 are more likely to adopt preventive measures, including the reduction of labor forces. The authors also find a significant correlation between firms' revenue and labor reduction as traditional econometrics and machine-learning techniques suggest.Originality/valueThis study has two main policy implications. First, although government support through taxes has been provided, the authors highlight evidence that there may be some additional benefit from targeting firms that have characteristics associated with layoffs or other negative labor responses. Second, the authors provide information that shows which firm characteristics are associated with particular labor market responses such as layoffs, which may help target stimulus packages. Although the COVID-19 pandemic affects most industries and occupations, heterogeneous firm responses suggest that there could be several varieties of targeted policies-targeting firms that are likely to reduce labor forces or firms likely to face reduced revenue. In this paper, the authors outline several industries and firm characteristics which appear to more directly be reducing employee counts or having negative labor responses which may lead to more cost–effect stimulus.


Sign in / Sign up

Export Citation Format

Share Document