chinese word segmentation
Recently Published Documents


TOTAL DOCUMENTS

341
(FIVE YEARS 77)

H-INDEX

18
(FIVE YEARS 2)

2021 ◽  
pp. 1-13
Author(s):  
Jiawen Shi ◽  
Hong Li ◽  
Chiyu Wang ◽  
Zhicheng Pang ◽  
Jiale Zhou

Short text matching is one of the fundamental technologies in natural language processing. In previous studies, most of the text matching networks are initially designed for English text. The common approach to applying them to Chinese is segmenting each sentence into words, and then taking these words as input. However, this method often results in word segmentation errors. Chinese short text matching faces the challenges of constructing effective features and understanding the semantic relationship between two sentences. In this work, we propose a novel lexicon-based pseudo-siamese model (CL2 N), which can fully mine the information expressed in Chinese text. Instead of utilizing a character-sequence or a single word-sequence, CL2 N augments the text representation with multi-granularity information in characters and lexicons. Additionally, it integrates sentence-level features through single-sentence features as well as interactive features. Experimental studies on two Chinese text matching datasets show that our model has better performance than the state-of-the-art short text matching models, and the proposed method can solve the error propagation problem of Chinese word segmentation. Particularly, the incorporation of single-sentence features and interactive features allows the network to capture the contextual semantics and co-attentive lexical information, which contributes to our best result.


Symmetry ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 1742
Author(s):  
Yiwei Lu ◽  
Ruopeng Yang ◽  
Xuping Jiang ◽  
Dan Zhou ◽  
Changshen Yin ◽  
...  

A great deal of operational information exists in the form of text. Therefore, extracting operational information from unstructured military text is of great significance for assisting command decision making and operations. Military relation extraction is one of the main tasks of military information extraction, which aims at identifying the relation between two named entities from unstructured military texts. However, the traditional methods of extracting military relations cannot easily resolve problems such as inadequate manual features and inaccurate Chinese word segmentation in military fields, failing to make full use of symmetrical entity relations in military texts. With our approach, based on the pre-trained language model, we present a Chinese military relation extraction method, which combines the bi-directional gate recurrent unit (BiGRU) and multi-head attention mechanism (MHATT). More specifically, the conceptual foundation of our method lies in constructing an embedding layer and combining word embedding with position embedding, based on the pre-trained language model; the output vectors of BiGRU neural networks are symmetrically spliced to learn the semantic features of context, and they fuse the multi-head attention mechanism to improve the ability of expressing semantic information. On the military text corpus that we have built, we conduct extensive experiments. We demonstrate the superiority of our method over the traditional non-attention model, attention model, and improved attention model, and the comprehensive evaluation value F1-score of the model is improved by about 4%.


2021 ◽  
Vol 11 (18) ◽  
pp. 8682
Author(s):  
Ching-Sheng Lin ◽  
Jung-Sing Jwo ◽  
Cheng-Hsiung Lee

Clinical Named Entity Recognition (CNER) focuses on locating named entities in electronic medical records (EMRs) and the obtained results play an important role in the development of intelligent biomedical systems. In addition to the research in alphabetic languages, the study of non-alphabetic languages has attracted considerable attention as well. In this paper, a neural model is proposed to address the extraction of entities from EMRs written in Chinese. To avoid erroneous noise being caused by the Chinese word segmentation, we employ the character embeddings as the only feature without extra resources. In our model, concatenated n-gram character embeddings are used to represent the context semantics. The self-attention mechanism is then applied to model long-range dependencies of embeddings. The concatenation of the new representations obtained by the attention module is taken as the input to bidirectional long short-term memory (BiLSTM), followed by a conditional random field (CRF) layer to extract entities. The empirical study is conducted on the CCKS-2017 Shared Task 2 dataset to evaluate our method and the experimental results show that our model outperforms other approaches.


CONVERTER ◽  
2021 ◽  
pp. 100-112
Author(s):  
Ziyu Liu, Mengying Yao

In order to solve the problem that it is difficult for college students to find learning resources related to thecourses they are learning quickly and accurately in blended learning. This paper proposes an online materials course resources recommendation method based on text similarity. Firstly, collecting the data of course resources on the online learning platform through web crawler technology. Secondly, preprocessing the data which contend deleting noise data, the Chinese word segmentation and calculating the course similarity based on cosine similarity then getting the course recommendation results according to the similarity ranking. Thirdly, evaluating the recommendation results and optimizing the similarity calculation method according to the evaluation results. Finally, the learners are recommended curriculum resources according to the similarity ranking results. According to the courses learned on the Superstar platform, the experiment recommends similar course resources on the XueYin Online platform. The results show that the online materials course resources recommendation method based on text similarity can recommend relevant online materials course resources for learners quickly and accurately, which has certain reference significance and application value for online course resources recommendation.


2021 ◽  
Author(s):  
Fei Shen ◽  
Wenting Yu ◽  
Chen Min ◽  
Qianying Ye ◽  
Chuanli Xia ◽  
...  

Text mining has been a dominant approach to extracting useful information from massive unstructured data online. But existing tools for Chinese word segmentation are not ideal for processing social media text data in Cantonese. This project developed CyberCan (https://github.com/shenfei1010/CyberCan), a lexicon of contemporary Cantonese based on more than 100 million pieces of internet texts. We compared the performance of CyberCan with existing Mandarin and Cantonese lexicons in terms of their word segmentation performance. Findings suggest that CyberCan outperforms all existing lexicons by a considerable margin.


Sign in / Sign up

Export Citation Format

Share Document