scholarly journals An Extractive Summarization Technique for Text Documents

In order to read as well as search information quickly, there was a need to reduce the size of the documents without any changes to its content. Therefore, in order to solve this problem, there was a solution to it by introducing a technique called as automatic text summarization which is used to generate summaries from the input document by condensing large sized input documents into smaller documents without losing its meaning as well as relevancy with respect to the original document. Text summarization stands for shortening of text into accurate, meaningful sentences. The paper shows an implementation of summarization of the original document by scoring the sentence based on term frequency and inverse document frequency matrix. The entire record was compressed so that only the relevant sentences in the document were retained. This technique can be applicable in various applications like automating text documents, quicker understanding of documents because of summarization

2021 ◽  
Vol 7 (2) ◽  
pp. 153
Author(s):  
Yunita Maulidia Sari ◽  
Nenden Siti Fatonah

Perkembangan teknologi yang pesat membuat kita lebih mudah dalam menemukan informasi-informasi yang dibutuhkan. Permasalahan muncul ketika informasi tersebut sangat banyak. Semakin banyak informasi dalam sebuah modul maka akan semakin panjang isi teks dalam modul tersebut. Hal tersebut akan memakan waktu yang cukup lama untuk memahami inti informasi dari modul tersebut. Salah satu solusi untuk mendapatkan inti informasi dari keseluruhan modul dengan cepat dan menghemat waktu adalah dengan membaca ringkasannya. Cara cepat untuk mendapatkan ringkasan sebuah dokumen adalah dengan cara peringkasan teks otomatis. Peringkasan teks otomatis (Automatic Text Summarization) merupakan teks yang dihasilkan dari satu atau lebih dokumen, yang mana hasil teks tersebut memberikan informasi penting dari sumber dokumen asli, serta secara otomatis hasil teks tersebut tidak lebih panjang dari setengah sumber dokumen aslinya. Penelitian ini bertujuan untuk menghasilkan peringkasan teks otomatis pada modul pembelajaran berbahasa Indonesia dan mengetahui hasil akurasi peringkasan teks otomatis yang menerapkan metode Cross Latent Semantic Analysis (CLSA). Jumlah data yang digunakan pada penelitian ini sebanyak 10 file modul pembelajaran yang berasal dari modul para dosen Universitas Mercu Buana, dengan format .docx sebanyak 5 file dan format .pdf sebanyak 5 file. Penelitian ini menerapkan metode Term Frequency-Inverse Document Frequency (TF-IDF) untuk pembobotan kata dan metode Cross Latent Semantic Analysis (CLSA) untuk peringkasan teks. Pengujian akurasi pada peringkasan modul pembelajaran dilakukan dengan cara membandingkan hasil ringkasan manual oleh manusia dan hasil ringkasan sistem. Yang mana pengujian ini menghasilkan rata-rata nilai f-measure, precision, dan recall tertinggi pada compression rate 20% dengan nilai berturut-turut 0.3853, 0.432, dan 0.3715.


Author(s):  
Hans Christian ◽  
Mikhael Pramodana Agus ◽  
Derwin Suhartono

The increasing availability of online information has triggered an intensive research in the area of automatic text summarization within the Natural Language Processing (NLP). Text summarization reduces the text by removing the less useful information which helps the reader to find the required information quickly. There are many kinds of algorithms that can be used to summarize the text. One of them is TF-IDF (TermFrequency-Inverse Document Frequency). This research aimed to produce an automatic text summarizer implemented with TF-IDF algorithm and to compare it with other various online source of automatic text summarizer. To evaluate the summary produced from each summarizer, The F-Measure as the standard comparison value had been used. The result of this research produces 67% of accuracy with three data samples which are higher compared to the other online summarizers.


2021 ◽  
Vol 11 (2) ◽  
pp. 303-312
Author(s):  
Nnaemeka M Oparauwah ◽  
Juliet N Odii ◽  
Ikechukwu I Ayogu ◽  
Vitalis C Iwuchukwu

The need to extract and manage vital information contained in copious volumes of text documents has given birth to several automatic text summarization (ATS) approaches. ATS has found application in academic research, medical health records analysis, content creation and search engine optimization, finance and media. This study presents a boundary-based tokenization method for extractive text summarization. The proposed method performs word tokenization by defining word boundaries in place of specific delimiters. An extractive summarization algorithm was further developed based on the proposed boundary-based tokenization method, as well as word length consideration to control redundancy in summary output. Experimental results showed that the proposed approach enhanced word tokenization by enhancing the selection of appropriate keywords from text document to be used for summarization.


Author(s):  
Mahsa Afsharizadeh ◽  
Hossein Ebrahimpour-Komleh ◽  
Ayoub Bagheri

Purpose: Pandemic COVID-19 has created an emergency for the medical community. Researchers require extensive study of scientific literature in order to discover drugs and vaccines. In this situation where every minute is valuable to save the lives of hundreds of people, a quick understanding of scientific articles will help the medical community. Automatic text summarization makes this possible. Materials and Methods: In this study, a recurrent neural network-based extractive summarization is proposed. The extractive method identifies the informative parts of the text. Recurrent neural network is very powerful for analyzing sequences such as text. The proposed method has three phases: sentence encoding, sentence ranking, and summary generation. To improve the performance of the summarization system, a coreference resolution procedure is used. Coreference resolution identifies the mentions in the text that refer to the same entity in the real world. This procedure helps to summarization process by discovering the central subject of the text. Results: The proposed method is evaluated on the COVID-19 research articles extracted from the CORD-19 dataset. The results show that the combination of using recurrent neural network and coreference resolution embedding vectors improves the performance of the summarization system. The Proposed method by achieving the value of ROUGE1-recall 0.53 demonstrates the improvement of summarization performance by using coreference resolution embedding vectors in the RNN-based summarization system. Conclusion: In this study, coreference information is stored in the form of coreference embedding vectors. Jointly use of recurrent neural network and coreference resolution results in an efficient summarization system.


2020 ◽  
Vol 17 (9) ◽  
pp. 4368-4374
Author(s):  
Perpetua F. Noronha ◽  
Madhu Bhan

Digital data in huge amount is being persistently generated at an unparalleled and exponential rate. In this digital era where internet stands the prime source for generating incredible information, it is vital to develop better means to mine the available information rapidly and most capably. Manual extraction of the salient information from the large input text documents is a time consuming and inefficient task. In this fast-moving world, it is difficult to read all the text-content and derive insights from it. Automatic methods are required. The task of probing for relevant documents from the large number of sources available, and consuming apt information from it is a challenging task and is need of the hour. Automatic text summarization technique can be used to generate relevant and quality information in less time. Text Summarization is used to condense the source text into a brief summary maintaining its salient information and readability. Generating summaries automatically is in great demand to attend to the growing and increasing amount of text data that is obtainable online in order to mark out the significant information and to consume it faster. Text summarization is becoming extremely popular with the advancement in Natural Language Processing (NLP) and deep learning methods. The most important gain of automatic text summarization is, it reduces the analysis time. In this paper we focus on key approaches to automatic text summarization and also about their efficiency and limitations.


2021 ◽  
Vol 11 (22) ◽  
pp. 10511
Author(s):  
Muhammad Mohsin ◽  
Shazad Latif ◽  
Muhammad Haneef ◽  
Usman Tariq ◽  
Muhammad Attique Khan ◽  
...  

Automatic Text Summarization (ATS) is gaining attention because a large volume of data is being generated at an exponential rate. Due to easy internet availability globally, a large amount of data is being generated from social networking websites, news websites and blog websites. Manual summarization is time consuming, and it is difficult to read and summarize a large amount of content. Automatic text summarization is the solution to deal with this problem. This study proposed two automatic text summarization models which are Genetic Algorithm with Hierarchical Clustering (GA-HC) and Particle Swarm Optimization with Hierarchical Clustering (PSO-HC). The proposed models use a word embedding model with Hierarchal Clustering Algorithm to group sentences conveying almost same meaning. Modified GA and adaptive PSO based sentence ranking models are proposed for text summary in news text documents. Simulations are conducted and compared with other understudied algorithms to evaluate the performance of proposed methodology. Simulations results validate the superior performance of the proposed methodology.


2021 ◽  
Vol 50 (3) ◽  
pp. 458-469
Author(s):  
Gang Sun ◽  
Zhongxin Wang ◽  
Jia Zhao

In the era of big data, information overload problems are becoming increasingly prominent. It is challengingfor machines to understand, compress and filter massive text information through the use of artificial intelligencetechnology. The emergence of automatic text summarization mainly aims at solving the problem ofinformation overload, and it can be divided into two types: extractive and abstractive. The former finds somekey sentences or phrases from the original text and combines them into a summarization; the latter needs acomputer to understand the content of the original text and then uses the readable language for the human tosummarize the key information of the original text. This paper presents a two-stage optimization method forautomatic text summarization that combines abstractive summarization and extractive summarization. First,a sequence-to-sequence model with the attention mechanism is trained as a baseline model to generate initialsummarization. Second, it is updated and optimized directly on the ROUGE metric by using deep reinforcementlearning (DRL). Experimental results show that compared with the baseline model, Rouge-1, Rouge-2,and Rouge-L have been increased on the LCSTS dataset and CNN/DailyMail dataset.


Author(s):  
Manju Lata Joshi ◽  
Nisheeth Joshi ◽  
Namita Mittal

Creating a coherent summary of the text is a challenging task in the field of Natural Language Processing (NLP). Various Automatic Text Summarization techniques have been developed for abstractive as well as extractive summarization. This study focuses on extractive summarization which is a process containing selected delineative paragraphs or sentences from the original text and combining these into smaller forms than the document(s) to generate a summary. The methods that have been used for extractive summarization are based on a graph-theoretic approach, machine learning, Latent Semantic Analysis (LSA), neural networks, cluster, and fuzzy logic. In this paper, a semantic graph-based approach SGATS (Semantic Graph-based approach for Automatic Text Summarization) is proposed to generate an extractive summary. The proposed approach constructs a semantic graph of the original Hindi text document by establishing a semantic relationship between sentences of the document using Hindi Wordnet ontology as a background knowledge source. Once the semantic graph is constructed, fourteen different graph theoretical measures are applied to rank the document sentences depending on their semantic scores. The proposed approach is applied to two data sets of different domains of Tourism and Health. The performance of the proposed approach is compared with the state-of-the-art TextRank algorithm and human-annotated summary. The performance of the proposed system is evaluated using widely accepted ROUGE measures. The outcomes exhibit that our proposed system produces better results than TextRank for health domain corpus and comparable results for tourism corpus. Further, correlation coefficient methods are applied to find a correlation between eight different graphical measures and it is observed that most of the graphical measures are highly correlated.


Sign in / Sign up

Export Citation Format

Share Document