Improving the Performance of the Extractive Text Summarization by a Novel Topic Modeling and Sentence Embedding Technique using SBERT

Text summarization is an approach for identifying important information present within text documents. This computational technique aims to generate shorter versions of the source text, by including only the relevant and salient information present within the source text. In this paper, we propose a novel method to summarize a text document by clustering its contents based on latent topics produced using topic modeling techniques and by generating extractive summaries for each of the identified text clusters. All extractive sub-summaries are later combined to generate a summary for any given source document. We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization. This dataset is unlike the commonly used news datasets which are available for text summarization. The well-known news datasets present their most important information in the first few lines of their source texts, which make their summarization a lesser challenging task when compared to summarizing the WikiHow dataset. Contrary to these news datasets, the documents in the WikiHow dataset are written using a generalized approach and have lesser abstractedness and higher compression ratio, thus proposing a greater challenge to generate summaries. A lot of the current state-of-the-art text summarization techniques tend to eliminate important information present in source documents in the favor of brevity. Our proposed technique aims to capture all the varied information present in source documents. Although the dataset proved challenging, after performing extensive tests within our experimental setup, we have discovered that our model produces encouraging ROUGE results and summaries when compared to the other published extractive and abstractive text summarization models

Download Full-text

Eliminasi Non-Topic Menggunakan Pemodelan Topik untuk Peringkasan Otomatis Data Tweet dengan Konteks Covid-19

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.0814324 ◽

2021 ◽

Vol 8 (1) ◽

pp. 199

Author(s):

Putri Damayanti ◽

Diana Purwitasari ◽

Nanik Suciati

Keyword(s):

Topic Modeling ◽

Modeling Method ◽

Text Summarization ◽

Word Embedding ◽

Test Results ◽

Automatic Summarization ◽

The Public ◽

Twitter Data ◽

Processing Data ◽

Embedding Methods

Akun twitter, seperti Suara Surabaya, dapat membantu menyebarkan informasi tentang COVID-19 meskipun ada bahasan lainnya seperti kecelakaan, kemacetan atau topik lain. Peringkasan teks dapat diimplementasikan pada kasus pembacaan data twitter karena banyaknya jumlah tweet yang tersedia, sehingga akan mempermudah dalam memperoleh informasi penting terkini terkait COVID-19. Jumlah variasi bahasan pada teks tweet mengakibatkan hasil ringkasan yang kurang baik. Oleh karena itu dibutuhkan adanya eliminasi tweet yang tidak berkaitan dengan konteks sebelum dilakukan peringkasan. Kontribusi penelitian ini adalah adanya metode pemodelan topik sebagai bagian tahapan dalam serangkaian proses eliminasi data. Metode pemodelan topik sebagai salah satu teknik eliminasi data dapat digunakan dalam berbagai kasus namun pada penelitian ini difokuskan pada COVID-19. Tujuannya adalah untuk mempermudah masyarakat memperoleh informasi terkini secara ringkas. Tahapan yang dilakukan adalah pra-pemrosesan, eliminasi data menggunakan pemodelan topik dan peringkasan otomatis. Penelitian ini menggunakan kombinasi beberapa metode word embedding, pemodelan topik dan peringkasan otomatis sebagai pembanding. Ringkasan diuji menggunakan metode ROUGE dari setiap kombinasi untuk ditemukan kombinasi terbaik dari penelitian ini. Hasil pengujian menunjukkan kombinasi metode Word2Vec, LSI dan TextRank memiliki nilai ROUGE terbaik yaitu 0.67. Sedangkan kombinasi metode TFIDF, LDA dan Okapi BM25 memiliki nilai ROUGE terendah yaitu 0.35. AbstractTwitter accounts, such as Suara Surabaya, can help spread information about COVID-19 even though there are other topics such as accidents, traffic jams or other topics. Text summarization can be implemented in the case of reading Twitter data because of the large number of tweets available, making it easier to obtain the latest important information related to COVID-19. The number of discussion variations in the tweet text results in poor summary results. Therefore, it is necessary to eliminate tweets that are not related to the context before summarization is carried out. The contribution to this research is the topic modeling method as part of a series of data elimination processes. The topic modeling method as a data elimination technique can be used in various cases, but this research focuses on COVID-19. The aim is to make it easier for the public to obtain current information in a concise manner. The steps taken in this study were pre-processing, data elimination using topic modeling and automatic summarization. This study uses a combination of several word embedding methods, topic modeling and automatic summarization as a comparison. The summary is tested using the ROUGE method of each combination to find the best combination of this study. The test results show that the combination of Word2Vec, LSI and TextRank methods has the best ROUGE value, 0.67. While the combination of TFIDF, LDA and Okapi BM25 methods has the lowest ROUGE value, 0.35.

Download Full-text

Two-Level Text Summarization Using Topic Modeling

Advances in Intelligent Systems and Computing - Intelligent System Design ◽

10.1007/978-981-15-5400-1_16 ◽

2020 ◽

pp. 153-167

Author(s):

Dhannuri Saikumar ◽

P. Subathra

Keyword(s):

Topic Modeling ◽

Text Summarization

Download Full-text

A new graph-based extractive text summarization using keywords or topic modeling

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-020-02591-x ◽

2020 ◽

Author(s):

Ramesh Chandra Belwal ◽

Sawan Rai ◽

Atul Gupta

Keyword(s):

Topic Modeling ◽

Text Summarization

Download Full-text

An extractive text summarization approach using tagged-LDA based topic modeling

Multimedia Tools and Applications ◽

10.1007/s11042-020-09549-3 ◽

2020 ◽

Author(s):

Ruby Rani ◽

D. K. Lobiyal

Keyword(s):

Topic Modeling ◽

Text Summarization

Download Full-text

Self-Tuned Descriptive Document Clustering using a Predictive Network

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset21841135 ◽

2019 ◽

pp. 320-331

Author(s):

K. Syed Kousar Niasi ◽

P. Sidheshwari

Keyword(s):

Subject Matter ◽

Search Engines ◽

Topic Modeling ◽

Semantic Analysis ◽

Document Clustering ◽

Text Summarization ◽

Document Management ◽

Superior Performance ◽

Medical Database ◽

Query Result

Document network is defined as a collection of documents that are connected by links. Document clustering become ubiquitous nowadays due to the widespread use of online databases, such as academic search engines. Topic modeling has become a widely used tool for document management because of its superior performance. However, there are few topic models differentiate the importance of documents on different topics. In this survey, can implement text rank algorithms of documents to improve topic modeling and propose to incorporate link based ranking into topic modeling. Text summarization provides an important role in information retrieval. Snippets generated by web search engines for every query result is an application of text summarization. Existing text summarization techniques shows that the indexing is done on the basis of the words present in the document and consists of an array of the posting lists. Document features such as term frequency, text length are used to allocate indexing weight to words. Specifically, topical rank is used to compute the subject stage rating of files, which indicates the significance of documents on special topics. By taking flight the topical ranking of a file as the opportunity of the record concerned in corresponding subject matter, a generalized relation is created between ranking and subject matter modeling. In this thesis, can implement topic discovery model for large number of medical database. The datasets are trained and extract the key terms based text mining and fuzzy latent semantic analysis (FLSA), a novel approach in topic modeling using fuzzy perspective. FLSA can maintain health & medical corpora redundancy problem and provides a new method to estimate the number of topics.

Download Full-text