A Large-Scale Chinese Short-Text Conversation Dataset

Detecting Near-Duplicates in Large-Scale Short Text Databases

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-540-68125-0_87 ◽

2008 ◽

pp. 877-883 ◽

Cited By ~ 11

Author(s):

Caichun Gong ◽

Yulan Huang ◽

Xueqi Cheng ◽

Shuo Bai

Keyword(s):

Large Scale ◽

Short Text ◽

Text Databases

Download Full-text

Patent Automatic Classification Based on Symmetric Hierarchical Convolution Neural Network

Symmetry ◽

10.3390/sym12020186 ◽

2020 ◽

Vol 12 (2) ◽

pp. 186 ◽

Cited By ~ 1

Author(s):

Huiming Zhu ◽

Chunhui He ◽

Yang Fang ◽

Bin Ge ◽

Meng Xing ◽

...

Keyword(s):

Neural Network ◽

Input Data ◽

Large Scale ◽

Patent Application ◽

Input Sequence ◽

Automatic Classification ◽

Convolution Neural Network ◽

Coarse Grained ◽

Short Text ◽

Patent Classification

With the rapid growth of patent applications, it has become an urgent problem to automatically classify the accepted patent application documents accurately and quickly. Most previous patent automatic classification studies are based on feature engineering and traditional machine learning methods like SVM, and some even rely on the knowledge of domain experts, hence they suffer from low accuracy problem and have poor generalization ability. In this paper, we propose a patent automatic classification method via the symmetric hierarchical convolution neural network (CNN) named PAC-HCNN. We use the title and abstract of the patent as the input data, and then apply the word embedding technique to segment and vectorize the input data. Then we design a symmetric hierarchical CNN framework to classify the patents based on the word embeddings, which is much more efficient than traditional RNN models dealing with texts, meanwhile keeping the history and future information of the input sequence. We also add gated linear units (GLUs) and residual connection to help realize the deep CNN. Additionally, we equip our model with a self attention mechanism to address the long-term dependency problem. Experiments are performed on large-scale datasets for Chinese short text patent classification. Experimental results prove our proposed model’s effectiveness, and it performs better than other state-of-the-art models significantly and consistently on both fine-grained and coarse-grained classification.

Download Full-text

Bridging Inference Based Sentence Linking Model for Semantic Coherence

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.2016010103 ◽

2016 ◽

Vol 10 (1) ◽

pp. 32-54

Author(s):

Weidong Liu ◽

Xiangfeng Luo ◽

Jun Shu ◽

Dandan Jiang

Keyword(s):

Large Scale ◽

Question Answering ◽

Explicit Knowledge ◽

Human Beings ◽

Short Text ◽

Weak Associate ◽

Semantic Coherence ◽

E Learning ◽

Implicit And Explicit ◽

Bridging Inference

As the various social Medias emerge on the web, how to link the large scale of unordered short texts with semantic coherence is becoming a practical problem since these short texts have vast decentralized topics, weak associate relations, abundant noise and large redundancy. The challenging issues to solve the above problem includes what knowledge foundation supports sentence linking process and how to link these unordered short texts for pursuing well coherence. Herein, the authors develop bridging inference based sentence linking model by simulating human beings' discourse bridging process, which narrows semantic coherence gaps between short texts. Such model supports linking process by implicit and explicit knowledge and proposes different bridging inference schemas to guide the linking process. The bridging inference based linking process under different schemas generates different semantic coherence including central semantics, concise semantics and layered semantics etc. To validate the bridging inference based sentence linking model, the authors conduct some experiments. Experimental results confirm that the proposed bridging inference based sentence linking process increases semantic coherence. The model can be used in short-text origination, e-learning, e-science, web semantic search, and online question-answering system in the future works.

Download Full-text

An Empirical Investigation of Online News Classification on an Open-Domain, Large-Scale and High-Quality Dataset in Vietnamese

10.3233/faia210036 ◽

2021 ◽

Author(s):

Khanh Quoc Tran ◽

Phap Ngoc Trinh ◽

Khoa Nguyen-Anh Tran ◽

An Tran-Hoai Le ◽

Luan Van Ha ◽

...

Keyword(s):

Machine Learning ◽

Text Classification ◽

Large Scale ◽

Online News ◽

Open Domain ◽

High Quality ◽

Short Text ◽

Online Newspapers ◽

Online Newspaper ◽

Data Points

In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models.

Download Full-text

Transfer Learning in Large-Scale Short Text Analysis

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-319-25159-2_45 ◽

2015 ◽

pp. 499-511 ◽

Cited By ~ 2

Author(s):

Yan Chu ◽

Zhengkui Wang ◽

Man Chen ◽

Linlin Xia ◽

Fengmei Wei ◽

...

Keyword(s):

Transfer Learning ◽

Text Analysis ◽

Large Scale ◽

Short Text

Download Full-text

Sentiment Analysis by Exploring Large Scale Web-based Chinese Short Text

DEStech Transactions on Computer Science and Engineering ◽

10.12783/dtcse/csae2017/17572 ◽

2018 ◽

Author(s):

Ziyu Liu ◽

Yonggang Qi ◽

Zhanyu Ma ◽

Jie Yang

Keyword(s):

Sentiment Analysis ◽

Large Scale ◽

Web Based ◽

Short Text

Download Full-text

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

10.18653/v1/d15-1229 ◽

2015 ◽

Cited By ~ 48

Author(s):

Baotian Hu ◽

Qingcai Chen ◽

Fangze Zhu

Keyword(s):

Large Scale ◽

Text Summarization ◽

Short Text

Download Full-text

Health assistant: answering your questions anytime from biomedical literature

Bioinformatics ◽

10.1093/bioinformatics/btz195 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4129-4139 ◽

Cited By ~ 2

Author(s):

Zan-Xia Jin ◽

Bo-Wen Zhang ◽

Fan Fang ◽

Le-Le Zhang ◽

Xu-Cheng Yin

Keyword(s):

Query Processing ◽

Large Scale ◽

Question Answering ◽

State Of The Art ◽

Document Retrieval ◽

Biomedical Literature ◽

Search Process ◽

Short Text ◽

Benchmark Datasets ◽

Health Assistant

Abstract Motivation With the abundant medical resources, especially literature available online, it is possible for people to understand their own health status and relevant problems autonomously. However, how to obtain the most appropriate answer from the increasingly large-scale database, remains a great challenge. Here, we present a biomedical question answering framework and implement a system, Health Assistant, to enable the search process. Methods In Health Assistant, a search engine is firstly designed to rank biomedical documents based on contents. Then various query processing and search techniques are utilized to find the relevant documents. Afterwards, the titles and abstracts of top-N documents are extracted to generate candidate snippets. Finally, our own designed query processing and retrieval approaches for short text are applied to locate the relevant snippets to answer the questions. Results Our system is evaluated on the BioASQ benchmark datasets, and experimental results demonstrate the effectiveness and robustness of our system, compared to BioASQ participant systems and some state-of-the-art methods on both document retrieval and snippet retrieval tasks. Availability and implementation A demo of our system is available at https://github.com/jinzanxia/biomedical-QA.

Download Full-text