scholarly journals A Large-Scale Chinese Short-Text Conversation Dataset

Author(s):  
Yida Wang ◽  
Pei Ke ◽  
Yinhe Zheng ◽  
Kaili Huang ◽  
Yong Jiang ◽  
...  
Keyword(s):  
Symmetry ◽  
2020 ◽  
Vol 12 (2) ◽  
pp. 186 ◽  
Author(s):  
Huiming Zhu ◽  
Chunhui He ◽  
Yang Fang ◽  
Bin Ge ◽  
Meng Xing ◽  
...  

With the rapid growth of patent applications, it has become an urgent problem to automatically classify the accepted patent application documents accurately and quickly. Most previous patent automatic classification studies are based on feature engineering and traditional machine learning methods like SVM, and some even rely on the knowledge of domain experts, hence they suffer from low accuracy problem and have poor generalization ability. In this paper, we propose a patent automatic classification method via the symmetric hierarchical convolution neural network (CNN) named PAC-HCNN. We use the title and abstract of the patent as the input data, and then apply the word embedding technique to segment and vectorize the input data. Then we design a symmetric hierarchical CNN framework to classify the patents based on the word embeddings, which is much more efficient than traditional RNN models dealing with texts, meanwhile keeping the history and future information of the input sequence. We also add gated linear units (GLUs) and residual connection to help realize the deep CNN. Additionally, we equip our model with a self attention mechanism to address the long-term dependency problem. Experiments are performed on large-scale datasets for Chinese short text patent classification. Experimental results prove our proposed model’s effectiveness, and it performs better than other state-of-the-art models significantly and consistently on both fine-grained and coarse-grained classification.


Author(s):  
Weidong Liu ◽  
Xiangfeng Luo ◽  
Jun Shu ◽  
Dandan Jiang

As the various social Medias emerge on the web, how to link the large scale of unordered short texts with semantic coherence is becoming a practical problem since these short texts have vast decentralized topics, weak associate relations, abundant noise and large redundancy. The challenging issues to solve the above problem includes what knowledge foundation supports sentence linking process and how to link these unordered short texts for pursuing well coherence. Herein, the authors develop bridging inference based sentence linking model by simulating human beings' discourse bridging process, which narrows semantic coherence gaps between short texts. Such model supports linking process by implicit and explicit knowledge and proposes different bridging inference schemas to guide the linking process. The bridging inference based linking process under different schemas generates different semantic coherence including central semantics, concise semantics and layered semantics etc. To validate the bridging inference based sentence linking model, the authors conduct some experiments. Experimental results confirm that the proposed bridging inference based sentence linking process increases semantic coherence. The model can be used in short-text origination, e-learning, e-science, web semantic search, and online question-answering system in the future works.


2021 ◽  
Author(s):  
Khanh Quoc Tran ◽  
Phap Ngoc Trinh ◽  
Khoa Nguyen-Anh Tran ◽  
An Tran-Hoai Le ◽  
Luan Van Ha ◽  
...  

In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models.


Author(s):  
Yan Chu ◽  
Zhengkui Wang ◽  
Man Chen ◽  
Linlin Xia ◽  
Fengmei Wei ◽  
...  

2019 ◽  
Vol 35 (20) ◽  
pp. 4129-4139 ◽  
Author(s):  
Zan-Xia Jin ◽  
Bo-Wen Zhang ◽  
Fan Fang ◽  
Le-Le Zhang ◽  
Xu-Cheng Yin

Abstract Motivation With the abundant medical resources, especially literature available online, it is possible for people to understand their own health status and relevant problems autonomously. However, how to obtain the most appropriate answer from the increasingly large-scale database, remains a great challenge. Here, we present a biomedical question answering framework and implement a system, Health Assistant, to enable the search process. Methods In Health Assistant, a search engine is firstly designed to rank biomedical documents based on contents. Then various query processing and search techniques are utilized to find the relevant documents. Afterwards, the titles and abstracts of top-N documents are extracted to generate candidate snippets. Finally, our own designed query processing and retrieval approaches for short text are applied to locate the relevant snippets to answer the questions. Results Our system is evaluated on the BioASQ benchmark datasets, and experimental results demonstrate the effectiveness and robustness of our system, compared to BioASQ participant systems and some state-of-the-art methods on both document retrieval and snippet retrieval tasks. Availability and implementation A demo of our system is available at https://github.com/jinzanxia/biomedical-QA.


Sign in / Sign up

Export Citation Format

Share Document