scholarly journals LCSTS: A Large Scale Chinese Short Text Summarization Dataset

Author(s):  
Baotian Hu ◽  
Qingcai Chen ◽  
Fangze Zhu
2002 ◽  
Vol 8 (2-3) ◽  
pp. 209-233 ◽  
Author(s):  
OLIVIER FERRET ◽  
BRIGITTE GRAU

Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.


Author(s):  
Yida Wang ◽  
Pei Ke ◽  
Yinhe Zheng ◽  
Kaili Huang ◽  
Yong Jiang ◽  
...  
Keyword(s):  

Symmetry ◽  
2020 ◽  
Vol 12 (2) ◽  
pp. 186 ◽  
Author(s):  
Huiming Zhu ◽  
Chunhui He ◽  
Yang Fang ◽  
Bin Ge ◽  
Meng Xing ◽  
...  

With the rapid growth of patent applications, it has become an urgent problem to automatically classify the accepted patent application documents accurately and quickly. Most previous patent automatic classification studies are based on feature engineering and traditional machine learning methods like SVM, and some even rely on the knowledge of domain experts, hence they suffer from low accuracy problem and have poor generalization ability. In this paper, we propose a patent automatic classification method via the symmetric hierarchical convolution neural network (CNN) named PAC-HCNN. We use the title and abstract of the patent as the input data, and then apply the word embedding technique to segment and vectorize the input data. Then we design a symmetric hierarchical CNN framework to classify the patents based on the word embeddings, which is much more efficient than traditional RNN models dealing with texts, meanwhile keeping the history and future information of the input sequence. We also add gated linear units (GLUs) and residual connection to help realize the deep CNN. Additionally, we equip our model with a self attention mechanism to address the long-term dependency problem. Experiments are performed on large-scale datasets for Chinese short text patent classification. Experimental results prove our proposed model’s effectiveness, and it performs better than other state-of-the-art models significantly and consistently on both fine-grained and coarse-grained classification.


Author(s):  
Jianwei Niu ◽  
Qingjuan Zhao ◽  
Lei Wang ◽  
Huan Chen ◽  
Mohammed Atiquzzaman ◽  
...  

Text summarization is an area of research with a goal to provide short text from huge text documents. Extractive text summarization methods have been extensively studied by many researchers. There are various type of multi document ranging from different formats to domains and topic specific. With the application of neural networks for text generation, interest for research in abstractive text summarization has increased significantly. This approach has been attempted for English and Telugu languages in this article. Recurrent neural networks are a subtype of recursive neural networks which try to predict the next sequence based on the current state and considering the information from previous states. The use of neural networks allows generation of summaries for long text sentences as well. The work implements semantic based filtering using a similarity matrix while keeping all stop-words. The similarity is calculated using semantic concepts and Jiang Similarity and making use of a Recurrent Neural Network (RNN) with an attention mechanism to generate summary. ROUGE score is used for measuring the performance of the applied method on Telugu and English langauges .


2020 ◽  
Vol 34 (05) ◽  
pp. 7651-7658 ◽  
Author(s):  
Yang Deng ◽  
Wai Lam ◽  
Yuexiang Xie ◽  
Daoyuan Chen ◽  
Yaliang Li ◽  
...  

Community question answering (CQA) gains increasing popularity in both academy and industry recently. However, the redundancy and lengthiness issues of crowdsourced answers limit the performance of answer selection and lead to reading difficulties and misunderstandings for community users. To solve these problems, we tackle the tasks of answer selection and answer summary generation in CQA with a novel joint learning model. Specifically, we design a question-driven pointer-generator network, which exploits the correlation information between question-answer pairs to aid in attending the essential information when generating answer summaries. Meanwhile, we leverage the answer summaries to alleviate noise in original lengthy answers when ranking the relevancy degrees of question-answer pairs. In addition, we construct a new large-scale CQA corpus, WikiHowQA, which contains long answers for answer selection as well as reference summaries for answer summarization. The experimental results show that the joint learning method can effectively address the answer redundancy issue in CQA and achieves state-of-the-art results on both answer selection and text summarization tasks. Furthermore, the proposed model is shown to be of great transferring ability and applicability for resource-poor CQA tasks, which lack of reference answer summaries.


Author(s):  
Weidong Liu ◽  
Xiangfeng Luo ◽  
Jun Shu ◽  
Dandan Jiang

As the various social Medias emerge on the web, how to link the large scale of unordered short texts with semantic coherence is becoming a practical problem since these short texts have vast decentralized topics, weak associate relations, abundant noise and large redundancy. The challenging issues to solve the above problem includes what knowledge foundation supports sentence linking process and how to link these unordered short texts for pursuing well coherence. Herein, the authors develop bridging inference based sentence linking model by simulating human beings' discourse bridging process, which narrows semantic coherence gaps between short texts. Such model supports linking process by implicit and explicit knowledge and proposes different bridging inference schemas to guide the linking process. The bridging inference based linking process under different schemas generates different semantic coherence including central semantics, concise semantics and layered semantics etc. To validate the bridging inference based sentence linking model, the authors conduct some experiments. Experimental results confirm that the proposed bridging inference based sentence linking process increases semantic coherence. The model can be used in short-text origination, e-learning, e-science, web semantic search, and online question-answering system in the future works.


2021 ◽  
Author(s):  
Khanh Quoc Tran ◽  
Phap Ngoc Trinh ◽  
Khoa Nguyen-Anh Tran ◽  
An Tran-Hoai Le ◽  
Luan Van Ha ◽  
...  

In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models.


Sign in / Sign up

Export Citation Format

Share Document