LCSTS: A Large Scale Chinese Short Text Summarization Dataset

A bootstrapping approach for robust topic analysis

Natural Language Engineering ◽

10.1017/s1351324902002929 ◽

2002 ◽

Vol 8 (2-3) ◽

pp. 209-233 ◽

Cited By ~ 1

Author(s):

OLIVIER FERRET ◽

BRIGITTE GRAU

Keyword(s):

Information Extraction ◽

Large Scale ◽

Text Summarization ◽

Great Precision ◽

Topic Analysis ◽

Structured Knowledge

Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.

Download Full-text

Detecting Near-Duplicates in Large-Scale Short Text Databases

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-540-68125-0_87 ◽

2008 ◽

pp. 877-883 ◽

Cited By ~ 11

Author(s):

Caichun Gong ◽

Yulan Huang ◽

Xueqi Cheng ◽

Shuo Bai

Keyword(s):

Large Scale ◽

Short Text ◽

Text Databases

Download Full-text

A Large-Scale Chinese Short-Text Conversation Dataset

Natural Language Processing and Chinese Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-60450-9_8 ◽

2020 ◽

pp. 91-103

Author(s):

Yida Wang ◽

Pei Ke ◽

Yinhe Zheng ◽

Kaili Huang ◽

Yong Jiang ◽

...

Keyword(s):

Large Scale ◽

Short Text

Download Full-text

Patent Automatic Classification Based on Symmetric Hierarchical Convolution Neural Network

Symmetry ◽

10.3390/sym12020186 ◽

2020 ◽

Vol 12 (2) ◽

pp. 186 ◽

Cited By ~ 1

Author(s):

Huiming Zhu ◽

Chunhui He ◽

Yang Fang ◽

Bin Ge ◽

Meng Xing ◽

...

Keyword(s):

Neural Network ◽

Input Data ◽

Large Scale ◽

Patent Application ◽

Input Sequence ◽

Automatic Classification ◽

Convolution Neural Network ◽

Coarse Grained ◽

Short Text ◽

Patent Classification

With the rapid growth of patent applications, it has become an urgent problem to automatically classify the accepted patent application documents accurately and quickly. Most previous patent automatic classification studies are based on feature engineering and traditional machine learning methods like SVM, and some even rely on the knowledge of domain experts, hence they suffer from low accuracy problem and have poor generalization ability. In this paper, we propose a patent automatic classification method via the symmetric hierarchical convolution neural network (CNN) named PAC-HCNN. We use the title and abstract of the patent as the input data, and then apply the word embedding technique to segment and vectorize the input data. Then we design a symmetric hierarchical CNN framework to classify the patents based on the word embeddings, which is much more efficient than traditional RNN models dealing with texts, meanwhile keeping the history and future information of the input sequence. We also add gated linear units (GLUs) and residual connection to help realize the deep CNN. Additionally, we equip our model with a self attention mechanism to address the long-term dependency problem. Experiments are performed on large-scale datasets for Chinese short text patent classification. Experimental results prove our proposed model’s effectiveness, and it performs better than other state-of-the-art models significantly and consistently on both fine-grained and coarse-grained classification.

Download Full-text

OnSeS: A Novel Online Short Text Summarization Based on BM25 and Neural Network

2016 IEEE Global Communications Conference (GLOBECOM) ◽

10.1109/glocom.2016.7842073 ◽

2016 ◽

Cited By ~ 4

Author(s):

Jianwei Niu ◽

Qingjuan Zhao ◽

Lei Wang ◽

Huan Chen ◽

Mohammed Atiquzzaman ◽

...

Keyword(s):

Neural Network ◽

Text Summarization ◽

Short Text

Download Full-text

Multi-Document Abstractive Summarization using Recursive Neural Network

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.g5274.059720 ◽

2020 ◽

Vol 9 (7) ◽

pp. 364-370

Keyword(s):

Neural Network ◽

Neural Networks ◽

Text Summarization ◽

Similarity Matrix ◽

Text Documents ◽

Short Text ◽

Current State ◽

Recursive Neural Networks ◽

Semantic Concepts ◽

Abstractive Summarization

Text summarization is an area of research with a goal to provide short text from huge text documents. Extractive text summarization methods have been extensively studied by many researchers. There are various type of multi document ranging from different formats to domains and topic specific. With the application of neural networks for text generation, interest for research in abstractive text summarization has increased significantly. This approach has been attempted for English and Telugu languages in this article. Recurrent neural networks are a subtype of recursive neural networks which try to predict the next sequence based on the current state and considering the information from previous states. The use of neural networks allows generation of summaries for long text sentences as well. The work implements semantic based filtering using a similarity matrix while keeping all stop-words. The similarity is calculated using semantic concepts and Jiang Similarity and making use of a Recurrent Neural Network (RNN) with an attention mechanism to generate summary. ROUGE score is used for measuring the performance of the applied method on Telugu and English langauges .

Download Full-text

Joint Learning of Answer Selection and Answer Summary Generation in Community Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6266 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7651-7658 ◽

Cited By ~ 2

Author(s):

Yang Deng ◽

Wai Lam ◽

Yuexiang Xie ◽

Daoyuan Chen ◽

Yaliang Li ◽

...

Keyword(s):

Large Scale ◽

Question Answering ◽

State Of The Art ◽

Reading Difficulties ◽

Text Summarization ◽

Essential Information ◽

Joint Learning ◽

Community Question Answering ◽

Proposed Model ◽

Correlation Information

Community question answering (CQA) gains increasing popularity in both academy and industry recently. However, the redundancy and lengthiness issues of crowdsourced answers limit the performance of answer selection and lead to reading difficulties and misunderstandings for community users. To solve these problems, we tackle the tasks of answer selection and answer summary generation in CQA with a novel joint learning model. Specifically, we design a question-driven pointer-generator network, which exploits the correlation information between question-answer pairs to aid in attending the essential information when generating answer summaries. Meanwhile, we leverage the answer summaries to alleviate noise in original lengthy answers when ranking the relevancy degrees of question-answer pairs. In addition, we construct a new large-scale CQA corpus, WikiHowQA, which contains long answers for answer selection as well as reference summaries for answer summarization. The experimental results show that the joint learning method can effectively address the answer redundancy issue in CQA and achieves state-of-the-art results on both answer selection and text summarization tasks. Furthermore, the proposed model is shown to be of great transferring ability and applicability for resource-poor CQA tasks, which lack of reference answer summaries.

Download Full-text

Bridging Inference Based Sentence Linking Model for Semantic Coherence

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.2016010103 ◽

2016 ◽

Vol 10 (1) ◽

pp. 32-54

Author(s):

Weidong Liu ◽

Xiangfeng Luo ◽

Jun Shu ◽

Dandan Jiang

Keyword(s):

Large Scale ◽

Question Answering ◽

Explicit Knowledge ◽

Human Beings ◽

Short Text ◽

Weak Associate ◽

Semantic Coherence ◽

E Learning ◽

Implicit And Explicit ◽

Bridging Inference

As the various social Medias emerge on the web, how to link the large scale of unordered short texts with semantic coherence is becoming a practical problem since these short texts have vast decentralized topics, weak associate relations, abundant noise and large redundancy. The challenging issues to solve the above problem includes what knowledge foundation supports sentence linking process and how to link these unordered short texts for pursuing well coherence. Herein, the authors develop bridging inference based sentence linking model by simulating human beings' discourse bridging process, which narrows semantic coherence gaps between short texts. Such model supports linking process by implicit and explicit knowledge and proposes different bridging inference schemas to guide the linking process. The bridging inference based linking process under different schemas generates different semantic coherence including central semantics, concise semantics and layered semantics etc. To validate the bridging inference based sentence linking model, the authors conduct some experiments. Experimental results confirm that the proposed bridging inference based sentence linking process increases semantic coherence. The model can be used in short-text origination, e-learning, e-science, web semantic search, and online question-answering system in the future works.

Download Full-text

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Language Resources and Evaluation ◽

10.1007/s10579-021-09568-y ◽

2022 ◽

Author(s):

Batuhan Baykara ◽

Tunga Güngör

Keyword(s):

Large Scale ◽

Text Summarization ◽

Agglutinative Languages

Download Full-text

An Empirical Investigation of Online News Classification on an Open-Domain, Large-Scale and High-Quality Dataset in Vietnamese

10.3233/faia210036 ◽

2021 ◽

Author(s):

Khanh Quoc Tran ◽

Phap Ngoc Trinh ◽

Khoa Nguyen-Anh Tran ◽

An Tran-Hoai Le ◽

Luan Van Ha ◽

...

Keyword(s):

Machine Learning ◽

Text Classification ◽

Large Scale ◽

Online News ◽

Open Domain ◽

High Quality ◽

Short Text ◽

Online Newspapers ◽

Online Newspaper ◽

Data Points

In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models.

Download Full-text