A Framework for Generating Extractive Summary from Multiple Malayalam Documents

Automatic extractive text summarization retrieves a subset of data that represents most notable sentences in the entire document. In the era of digital explosion, which is mostly unstructured textual data, there is a demand for users to understand the huge amount of text in a short time; this demands the need for an automatic text summarizer. From summaries, the users get the idea of the entire content of the document and can decide whether to read the entire document or not. This work mainly focuses on generating a summary from multiple news documents. In this case, the summary helps to reduce the redundant news from the different newspapers. A multi-document summary is more challenging than a single-document summary since it has to solve the problem of overlapping information among sentences from different documents. Extractive text summarization yields the sensitive part of the document by neglecting the irrelevant and redundant sentences. In this paper, we propose a framework for extracting a summary from multiple documents in the Malayalam Language. Also, since the multi-document summarization data set is sparse, methods based on deep learning are difficult to apply. The proposed work discusses the performance of existing standard algorithms in multi-document summarization of the Malayalam Language. We propose a sentence extraction algorithm that selects the top ranked sentences with maximum diversity. The system is found to perform well in terms of precision, recall, and F-measure on multiple input documents.

Download Full-text

A Quantum-Inspired Genetic Algorithm for Extractive Text Summarization

International Journal of Natural Computing Research ◽

10.4018/ijncr.2021040103 ◽

2021 ◽

Vol 10 (2) ◽

pp. 42-60

Author(s):

Khadidja Chettah ◽

Amer Draa

Keyword(s):

Genetic Algorithm ◽

State Of The Art ◽

Text Summarization ◽

Automated System ◽

Evaluation Metrics ◽

Document Summarization ◽

Automatic Text Summarization ◽

Reference Methods ◽

Textual Data ◽

Automatic Text

Automatic text summarization has recently become a key instrument for reducing the huge quantity of textual data. In this paper, the authors propose a quantum-inspired genetic algorithm (QGA) for extractive single-document summarization. The QGA is used inside a totally automated system as an optimizer to search for the best combination of sentences to be put in the final summary. The presented approach is compared with 11 reference methods including supervised and unsupervised summarization techniques. They have evaluated the performances of the proposed approach on the DUC 2001 and DUC 2002 datasets using the ROUGE-1 and ROUGE-2 evaluation metrics. The obtained results show that the proposal can compete with other state-of-the-art methods. It is ranked first out of 12, outperforming all other algorithms.

Download Full-text

MHLM Majority Voting Based Hybrid Learning Model for Multi-Document Summarization

International Journal of Artificial Intelligence and Machine Learning ◽

10.4018/ijaiml.2019010104 ◽

2019 ◽

Vol 9 (1) ◽

pp. 67-81

Author(s):

Suneetha S. ◽

Venugopal Reddy A.

Keyword(s):

Numerical Data ◽

Hybrid Learning ◽

Learning Model ◽

Text Summarization ◽

Majority Voting ◽

Sentence Length ◽

Support Vector ◽

Data Set ◽

Document Summarization ◽

Multiple Documents

Text summarization from multiple documents is an active research area in the current scenario as the data in the World Wide Web (WWW) is found in abundance. The text summarization process is time-consuming and hectic for the users to retrieve the relevant contents from this mass collection of the data. Numerous techniques have been proposed to provide the relevant information to the users in the form of the summary. Accordingly, this article presents the majority voting based hybrid learning model (MHLM) for multi-document summarization. First, the multiple documents are subjected to pre-processing, and the features, such as title-based, sentence length, numerical data and TF-IDF features are extracted for all the individual sentences of the document. Then, the feature set is sent to the proposed MHLM classifier, which includes the Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Neural Network (NN) classifiers for evaluating the significance of the sentences present in the document. These classifiers provide the significance scores based on four features extracted from the sentences in the document. Then, the majority voting model decides the significant texts based on the significance scores and develops the summary for the user and thereby, reduces the redundancy, increasing the quality of the summary similar to the original document. The experiment performed with the DUC 2002 data set is used to analyze the effectiveness of the proposed MHLM that attains the precision and recall at a rate of 0.94, f-measure at a rate of 0.93, and ROUGE-1 at a rate of 0.6324.

Download Full-text

Automatic Text Summarization Using Latent Drichlet Allocation (LDA) for Document Clustering

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v1i3.43 ◽

2015 ◽

Vol 1 (3) ◽

pp. 132 ◽

Cited By ~ 5

Author(s):

Erwin Yudi Hidayat ◽

Fahri Firdausillah ◽

Khafiizh Hastuti ◽

Ika Novita Dewi ◽

Azhari Azhari

Keyword(s):

Clustering Algorithm ◽

Document Clustering ◽

Text Summarization ◽

Data Set ◽

Document Summarization ◽

Automatic Text Summarization ◽

Improve Accuracy ◽

Automatic Document Summarization ◽

Document Compression ◽

Automatic Text

In this paper, we present Latent Drichlet Allocation in automatic text summarization to improve accuracy in document clustering. The experiments involving 398 data set from public blog article obtained by using python scrapy crawler and scraper. Several steps of clustering in this research are preprocessing, automatic document compression using feature method, automatic document compression using LDA, word weighting and clustering algorithm The results show that automatic document summarization with LDA reaches 72% in LDA 40%, compared to traditional k-means method which only reaches 66%.

Download Full-text

Information Extraction Tasks based on BERT and SpaCy on Tourism Domain

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.2021151.228621 ◽

2021 ◽

Vol 15 (1) ◽

pp. 108-122

Author(s):

Chantana Chantrapornchai ◽

Aphisit Tunsakul

Keyword(s):

Name Entity Recognition ◽

Text Summarization ◽

Training Data ◽

Entity Recognition ◽

Entity Extraction ◽

Data Set ◽

Name Entity ◽

Sentence Extraction ◽

Relation Type ◽

Proper Training

In this paper, we present two methodologies to extract particular information based on the full text returned from the search engine to facilitate the users. The approaches are based three tasks: name entity recognition (NER), text classiﬁcation and text summarization. The ﬁrst step is the building training data and data cleansing. We consider tourism domain such as restaurant, hotels, shopping and tourism data set crawling from the websites. First, the tourism data are gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purpose. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts, we demonstrate to build the model to extract the desired entity,i.e, name, location, facility as well as relation type, classify the reviews or summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks.

Download Full-text

An Automatic Text Summarization Method with the Concern of Covering Complete Formation

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190716105347 ◽

2020 ◽

Vol 13 (5) ◽

pp. 977-986

Author(s):

Srinivasa Rao Kongara ◽

Dasika Sree Rama Chandra Murthy ◽

Gangadhara Rao Kancherla

Keyword(s):

Research Method ◽

Research Work ◽

Fuzzy Rule ◽

Text Summarization ◽

Document Summarization ◽

Summarization Method ◽

Overall Evaluation ◽

Multiple Documents ◽

Rule System ◽

Value Decomposition

Background: Text summarization is the process of generating a short description of the entire document which is more difficult to read. This method provides a convenient way of extracting the most useful information and a short summary of the documents. In the existing research work, this is focused by introducing the Fuzzy Rule-based Automated Summarization Method (FRASM). Existing work tends to have various limitations which might limit its applicability to the various real-world applications. The existing method is only suitable for the single document summarization where various applications such as research industries tend to summarize information from multiple documents. Methods: This paper proposed Multi-document Automated Summarization Method (MDASM) to introduce the summarization framework which would result in the accurate summarized outcome from the multiple documents. In this work, multi-document summarization is performed whereas in the existing system only single document summarization was performed. Initially document clustering is performed using modified k means cluster algorithm to group the similar kind of documents that provides the same meaning. This is identified by measuring the frequent term measurement. After clustering, pre-processing is performed by introducing the Hybrid TF-IDF and Singular value decomposition technique which would eliminate the irrelevant content and would result in the required content. Then sentence measurement is one by introducing the additional metrics namely Title measurement in addition to the existing work metrics to accurately retrieve the sentences with more similarity. Finally, a fuzzy rule system is applied to perform text summarization. Results: The overall evaluation of the research work is conducted in the MatLab simulation environment from which it is proved that the proposed research method ensures the optimal outcome than the existing research method in terms of accurate summarization. MDASM produces 89.28% increased accuracy, 89.28% increased precision, 89.36% increased recall value and 70% increased the f-measure value which performs better than FRASM. Conclusion: The summarization processes carried out in this work provides the accurate summarized outcome.

Download Full-text

Origins and Destinations – Social Security Claimant Dynamics

Journal of Social Policy ◽

10.1017/s0047279498005327 ◽

1998 ◽

Vol 27 (3) ◽

pp. 351-369 ◽

Cited By ~ 5

Author(s):

MICHAEL NOBLE ◽

SIN YI CHEUNG ◽

GEORGE SMITH

Keyword(s):

The United States ◽

Data Sets ◽

Income Support ◽

Data Set ◽

Family Structures ◽

Highly Educated ◽

Lone Parents ◽

Lone Mothers ◽

Time Period ◽

Short Time

This article briefly reviews American and British literature on welfare dynamics and examines the concepts of welfare dependency and ‘dependency culture’ with particular reference to lone parents. Using UK benefit data sets, the welfare dynamics of lone mothers are examined to explore the extent to which they inform the debates. Evidence from Housing Benefits data show that even over a relatively short time period, there is significant turnover in the benefits-dependent lone parent population with movement in and out of income support as well as movement into other family structures. Younger lone parents and owner-occupiers tend to leave the data set while older lone parents and council tenants are most likely to stay. Some owner-occupier lone parents may be relatively well off and on income support for a relatively short time between separation and a financial settlement being reached. They may also represent a more highly educated and highly skilled group with easier access to the labour market than renters. Any policy moves paralleling those in the United States to time limit benefit will disproportionately affect older lone parents.

Download Full-text

Text Summarization

10.1093/oxfordhb/9780199276349.013.0032 ◽

2012 ◽

Cited By ~ 11

Author(s):

Eduard Hovy

Keyword(s):

Research And Development ◽

Evaluation Studies ◽

Text Summarization ◽

Single Measurement ◽

Document Summarization ◽

Topic Identification ◽

Evaluation Strategies ◽

And Performance

This article describes research and development on the automated creation of summaries of one or more texts. It defines the concept of summary and presents an overview of the principal approaches in summarization. It describes the design, implementation, and performance of various summarization systems. The stages of automated text summarization are topic identification, interpretation, and summary generation, each having its sub stages. Due to the challenges involved, multi-document summarization is much less developed than single-document summarization. This article reviews particular techniques used in several summarization systems. Finally, this article assesses the methods of evaluating summaries. This article reviews evaluation strategies, from previous evaluation studies, to the two-basic measures method. Summaries are so task and genre specific; therefore, no single measurement covers all cases of evaluation

Download Full-text

Texts of “internet confessions” as a source for training data set for the research on the sentiment-analysis field

Vestnik NSU Series Linguistics and Intercultural Communication ◽

10.25205/1818-7935-2019-17-3-71-82 ◽

2019 ◽

Vol 17 (3) ◽

pp. 71-82

Author(s):

Anastasia V. Kolmogorova

Keyword(s):

Sentiment Analysis ◽

Narrative Structure ◽

Training Data ◽

Data Set ◽

Financial Reports ◽

Technological Basis ◽

Self Image ◽

Textual Data ◽

Primary Advantage ◽

Multiclass Classifier

The article aims to analyze the validity of Internet confession texts used as a source of training data set for designing computer classifier of Internet texts in Russian according to their emotional tonality. Thus, the classifier, backed by Lövheim’s emotional cube model, is expected to detect eight classes of emotions represented in the text or to assign the text to the emotionally neutral class. The first and one of the most important stages of the classifier creation is the training data set selection. The training data set in Machine Learning is the actual dataset used to train the model for performing various actions. The internet text genres that are traditionally used in sentiment analysis to train two or three tonalities classifiers are twits, films and market reviews, blogs and financial reports. The novelty of our project consists in designing multiclass classifier that requires a new non-trivial training data. As such, we have chosen the texts from public group Overheard in Russian social network VKontakte. As all texts show similarities, we united them under the genre name “Internet confession”. To feature the genre, we applied the method of narrative semiotics describing six positions forming the deep narrative structure of “Internet confession”: Addresser – a person aware of her/his separateness from the society; Addressee – society / public opinion; Subject – a narrator describing his / her emotional state; Object – the person’s self-image; Helper – the person’s frankness; Adversary – the person’s shame. The above mentioned genre features determine its primary advantage – a qualitative one – to be especially focused on the emotionality while more traditional sources of textual data are based on such categories as expressivity (twits) or axiological estimations (all sorts of reviews). The structural analysis of texts under discussion has also demonstrated several advantages due to the technological basis of the Overheard project: the text hashtagging prevents the researcher from submitting the whole collection to the crowdsourcing assessment; its size is optimal for assessment by experts; despite their hyperbolized emotionality, the texts of Internet confession genre share the stylistic features typical of different types of personal internet discourse. However, the narrative character of all Internet confession texts implies some restrictions in their use within sentiment analysis project.

Download Full-text

Text analysis on health product reviews using r approach

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v18.i3.pp1303-1310 ◽

2020 ◽

Vol 18 (3) ◽

pp. 1303

Author(s):

Nasibah Husna Mohd Kadir ◽

Sharifah Aliman

Keyword(s):

Language Processing ◽

Text Summarization ◽

Unstructured Data ◽

Product Reviews ◽

Health Product ◽

Text Analytics ◽

Data Set ◽

Web Based ◽

The Social ◽

Key Techniques

In the social media, product reviews contain of text, emoticon, numbers and symbols that hard to identify the text summarization. Text analytics is one of the key techniques in exploring the unstructured data. The purpose of this study is solving the unstructured data by sort and summarizes the review data through a Web-Based Text Analytics using R approach. According to the comparative table between studies in Natural Language Processing (NLP) features, it was observed that Web-Based Text Analytics using R approach can analyze the unstructured data by using the data processing package in R. It combines all the NLP features in the menu part of the text analytics process in steps and it is labeled to make it easier for users to view all the text summarization. This study uses health product review from Shaklee as the data set. The proposed approach shows the acceptable performance in terms of system features execution compared with the baseline model system.

Download Full-text

A rapid deployment instrument network for temporarily monitoring volcanic SO<sub>2</sub> emissions – a study case from Telica volcano

Geoscientific Instrumentation Methods and Data Systems Discussions ◽

10.5194/gid-4-191-2014 ◽

2014 ◽

Vol 4 (1) ◽

pp. 191-211

Author(s):

V. Conde ◽

D. Nilsson ◽

B. Galle ◽

R. Cartagena ◽

A. Muñoz

Keyword(s):

Measurement Techniques ◽

Active Volcano ◽

Optical Absorption Spectroscopy ◽

So2 Emissions ◽

Volcanic Gas ◽

Data Set ◽

So2 Flux ◽

Geophysical Processes ◽

Rapid Deployment ◽

Short Time

Abstract. Volcanic gas emissions play a crucial role in describing geophysical processes; hence measurements of magmatic gases such as SO2 can be used as tracers prior and during volcanic crises. Different measurement techniques based on optical spectroscopy have provided valuable information when assessing volcanic crises. This paper describes the design and implementation of a network of spectroscopic instruments based on Differential Optical Absorption Spectroscopy (DOAS) for remote sensing of volcanic SO2 emissions, which is robust, portable and can be deployed in relative short time. The setup allows the processing of raw data in situ even in remote areas with limited accessibility, and delivers pre-processed data to end-users in near real time even during periods of volcanic crisis, via a satellite link. In addition, the hardware can be used to conduct short term studies of volcanic plumes in remotes areas. The network was tested at Telica, an active volcano located in western Nicaragua, producing what is so far the largest data set of continuous SO2 flux measurements at this volcano.

Download Full-text