text document
Recently Published Documents


TOTAL DOCUMENTS

528
(FIVE YEARS 189)

H-INDEX

26
(FIVE YEARS 5)

2022 ◽  
pp. 57-90
Author(s):  
Surabhi Verma ◽  
Ankit Kumar Jain

People regularly use social media to express their opinions about a wide variety of topics, goods, and services which make it rich in text mining and sentiment analysis. Sentiment analysis is a form of text analysis determining polarity (positive, negative, or neutral) in text, document, paragraph, or clause. This chapter offers an overview of the subject by examining the proposed algorithms for sentiment analysis on Twitter and briefly explaining them. In addition, the authors also address fields related to monitoring sentiments over time, regional view of views, neutral tweet analysis, sarcasm detection, and various other tasks in this area that have drawn the researchers ' attention to this subject nearby. Within this chapter, all the services used are briefly summarized. The key contribution of this survey is the taxonomy based on the methods suggested and the debate on the theme's recent research developments and related fields.


Author(s):  
Awatif Karim ◽  
Chakir Loqman ◽  
Youssef Hami ◽  
Jaouad Boumhidi

In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.


2022 ◽  
Vol 19 (1) ◽  
pp. 1719
Author(s):  
Saravanan Arumugam ◽  
Sathya Bama Subramani

With the increase in the amount of data and documents on the web, text summarization has become one of the significant fields which cannot be avoided in today’s digital era. Automatic text summarization provides a quick summary to the user based on the information presented in the text documents. This paper presents the automated single document summarization by constructing similitude graphs from the extracted text segments. On extracting the text segments, the feature values are computed for all the segments by comparing them with the title and the entire document and by computing segment significance using the information gain ratio. Based on the computed features, the similarity between the segments is evaluated to construct the graph in which the vertices are the segments and the edges specify the similarity between them. The segments are ranked for including them in the extractive summary by computing the graph score and the sentence segment score. The experimental analysis has been performed using ROUGE metrics and the results are analyzed for the proposed model. The proposed model has been compared with the various existing models using 4 different datasets in which the proposed model acquired top 2 positions with the average rank computed on various metrics such as precision, recall, F-score. HIGHLIGHTS Paper presents the automated single document summarization by constructing similitude graphs from the extracted text segments It utilizes information gain ratio, graph construction, graph score and the sentence segment score computation Results analysis has been performed using ROUGE metrics with 4 popular datasets in the document summarization domain The model acquired top 2 positions with the average rank computed on various metrics such as precision, recall, F-score GRAPHICAL ABSTRACT


2021 ◽  
pp. 1-12
Author(s):  
Kushagri Tandon ◽  
Niladri Chatterjee

Multi-label text classification aims at assigning more than one class to a given text document, which makes the task more ambiguous and challenging at the same time. The ambiguities come from the fact that often several labels in the prescribed label set are semantically close to each other, making clear demarcation between them difficult. As a consequence, any Machine Learning based approach for developing multi-label classification scheme needs to define its feature space by choosing features beyond linguistic or semi-linguistic features, so that the semantic closeness between the labels is also taken into account. The present work describes a scheme of feature extraction where the training document set and the prescribed label set are intertwined in a novel way to capture the ambiguity in a meaningful way. In particular, experiments were conducted using Topic Modeling and Fuzzy C-means clustering which aim at measuring the underlying uncertainty using probability and membership based measures, respectively. Several Nonparametric hypothesis tests establish the effectiveness of the features obtained through Fuzzy C-Means clustering in multi-label classification. A new algorithm has been proposed for training the system for multi-label classification using the above set of features.


2021 ◽  
Vol 21 (3) ◽  
pp. 3-10
Author(s):  
Petr ŠALOUN ◽  
◽  
Barbora CIGÁNKOVÁ ◽  
David ANDREŠIČ ◽  
Lenka KRHUTOVÁ ◽  
...  

For a long time, both professionals and the lay public showed little interest in informal carers. Yet these people deals with multiple and common issues in their everyday lives. As the population is aging we can observe a change of this attitude. And thanks to the advances in computer science, we can offer them some effective assistance and support by providing necessary information and connecting them with both professional and lay public community. In this work we describe a project called “Research and development of support networks and information systems for informal carers for persons after stroke” producing an information system visible to public as a web portal. It does not provide just simple a set of information but using means of artificial intelligence, text document classification and crowdsourcing further improving its accuracy, it also provides means of effective visualization and navigation over the content made by most by the community itself and personalized on a level of informal carer’s phase of the care-taking timeline. In can be beneficial for informal carers as it allows to find a content specific to their current situation. This work describes our approach to classification of text documents and its improvement through crowdsourcing. Its goal is to test text documents classifier based on documents similarity measured by N-grams method and to design evaluation and crowdsourcing-based classification improvement mechanism. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful.


2021 ◽  
pp. 894-911
Author(s):  
Bhavesh Kataria, Dr. Harikrishna B. Jethva

India's constitution has 22 languages written in 17 different scripts. These materials have a limited lifespan, and as generations pass, these materials deteriorate, and the vital knowledge is lost. This work uses digital texts to convey information to future generations. Optical Character Recognition (OCR) helps extract information from scanned manuscripts (printed text). This paper proposes a simple and effective solution of optical character recognition (OCR) Sanskrit Character from text document images using long short-term memory (LSTM) and neural networks of Sanskrit Characters. Existing methods focuses only upon the single touching characters. But our main focus is to design a robust method using Bidirectional Long Short-Term Memory (BLSTM) architecture for overlapping lines, touching characters in middle and upper zone and half character which would increase the accuracy of the present OCR system for recognition of poorly maintained Sanskrit literature.


Automatic text summarization is a technique of generating short and accurate summary of a longer text document. Text summarization can be classified based on the number of input documents (single document and multi-document summarization) and based on the characteristics of the summary generated (extractive and abstractive summarization). Multi-document summarization is an automatic process of creating relevant, informative and concise summary from a cluster of related documents. This paper does a detailed survey on the existing literature on the various approaches for text summarization. Few of the most popular approaches such as graph based, cluster based and deep learning-based summarization techniques are discussed here along with the evaluation metrics, which can provide an insight to the future researchers.


Author(s):  
Manju Lata Joshi ◽  
Nisheeth Joshi ◽  
Namita Mittal

Creating a coherent summary of the text is a challenging task in the field of Natural Language Processing (NLP). Various Automatic Text Summarization techniques have been developed for abstractive as well as extractive summarization. This study focuses on extractive summarization which is a process containing selected delineative paragraphs or sentences from the original text and combining these into smaller forms than the document(s) to generate a summary. The methods that have been used for extractive summarization are based on a graph-theoretic approach, machine learning, Latent Semantic Analysis (LSA), neural networks, cluster, and fuzzy logic. In this paper, a semantic graph-based approach SGATS (Semantic Graph-based approach for Automatic Text Summarization) is proposed to generate an extractive summary. The proposed approach constructs a semantic graph of the original Hindi text document by establishing a semantic relationship between sentences of the document using Hindi Wordnet ontology as a background knowledge source. Once the semantic graph is constructed, fourteen different graph theoretical measures are applied to rank the document sentences depending on their semantic scores. The proposed approach is applied to two data sets of different domains of Tourism and Health. The performance of the proposed approach is compared with the state-of-the-art TextRank algorithm and human-annotated summary. The performance of the proposed system is evaluated using widely accepted ROUGE measures. The outcomes exhibit that our proposed system produces better results than TextRank for health domain corpus and comparable results for tourism corpus. Further, correlation coefficient methods are applied to find a correlation between eight different graphical measures and it is observed that most of the graphical measures are highly correlated.


Author(s):  
Halyna Lukash ◽  
Olga Anisimova

The purpose of the article is to shed light on the history of stable phrases of reference in the context of documentary linguistics. The methodology is based on a combination of structural-typological and cognitive-discursive approaches, which made it possible to identify and generalize the main features and differences of language clichés, stamps and documentary formulas; delve into the semantic, structural, linguistic and stylistic spheres of documentation in terms of the feasibility of introducing clichés and stamps into the text. The scientific novelty of the work is to outline the main features of language clichés, typology of clichéd units on various grounds, highlighting the factors that affect the functioning of language stereotypes. identifying trends in the development of stable combinations of documents at the stage of formation. Conclusions. The most productive language constructions in the documentary text are analyzed and described. It is noted that the formation of the language composition of the documentary text of a certain era was due to foreign language factors. Cliché semantics encompasses the whole complex of extralinguistic meanings acquired as a result of the collective experience of mankind and connects the semantic characteristics of the verbal sign with the system of traditions of the people, is objectified by communication and stereotypical linguistic situation. Among the results is what is established: clichés and document formulas are an organic element of the text of documents. It is proved that modern functional studies of documentary linguistics require a certain reorientation of the analysis of language units to the integration of conceptual, linguistic and communicative aspects of the function. Keywords: language cliché, language stamp, document formulas, stereotypical units, established expressions, language constructions, document text, document.


Sign in / Sign up

Export Citation Format

Share Document