text data
Recently Published Documents





2022 ◽  
Vol 13 (1) ◽  
pp. 1-14
Shuteng Niu ◽  
Yushan Jiang ◽  
Bowen Chen ◽  
Jian Wang ◽  
Yongxin Liu ◽  

In the past decades, information from all kinds of data has been on a rapid increase. With state-of-the-art performance, machine learning algorithms have been beneficial for information management. However, insufficient supervised training data is still an adversity in many real-world applications. Therefore, transfer learning (TF) was proposed to address this issue. This article studies a not well investigated but important TL problem termed cross-modality transfer learning (CMTL). This topic is closely related to distant domain transfer learning (DDTL) and negative transfer. In general, conventional TL disciplines assume that the source domain and the target domain are in the same modality. DDTL aims to make efficient transfers even when the domains or the tasks are entirely different. As an extension of DDTL, CMTL aims to make efficient transfers between two different data modalities, such as from image to text. As the main focus of this study, we aim to improve the performance of image classification by transferring knowledge from text data. Previously, a few CMTL algorithms were proposed to deal with image classification problems. However, most existing algorithms are very task specific, and they are unstable on convergence. There are four main contributions in this study. First, we propose a novel heterogeneous CMTL algorithm, which requires only a tiny set of unlabeled target data and labeled source data with associate text tags. Second, we introduce a latent semantic information extraction method to connect the information learned from the image data and the text data. Third, the proposed method can effectively handle the information transfer across different modalities (text-image). Fourth, we examined our algorithm on a public dataset, Office-31. It has achieved up to 5% higher classification accuracy than “non-transfer” algorithms and up to 9% higher than existing CMTL algorithms.

2022 ◽  
Vol 3 (1) ◽  
pp. 1-16
Haoran Ding ◽  
Xiao Luo

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.

2022 ◽  
Vol 10 (4) ◽  
pp. 544-553
Ratna Kurniasari ◽  
Rukun Santoso ◽  
Alan Prahutama

Effective communication between the government and society is essential to achieve good governance. The government makes an effort to provide a means of public complaints through an online aspiration and complaint service called “LaporGub..!”. To group incoming reports easier, the topic of the report is searched by using clustering. Text Mining is used to convert text data into numeric data so that it can be processed further. Clustering is classified as soft clustering (fuzzy) and hard clustering. Hard clustering will divide data into clusters strictly without any overlapping membership with other clusters. Soft clustering can enter data into several clusters with a certain degree of membership value. Different membership values make fuzzy grouping have more natural results than hard clustering because objects at the boundary between several classes are not forced to fully fit into one class but each object is assigned a degree of membership. Fuzzy c-means has an advantage in terms of having a more precise placement of the cluster center compared to other cluster methods, by improving the cluster center repeatedly. The formation of the best number of clusters is seen based on the maximum silhouette coefficient. Wordcloud is used to determine the dominant topic in each cluster. Word cloud is a form of text data visualization. The results show that the maximum silhouette coefficient value for fuzzy c-means clustering is shown by the three clusters. The first cluster produces a word cloud regarding road conditions as many as 449 reports, the second cluster produces a word cloud regarding covid assistance as many as 964 reports, and the third cluster produces a word cloud regarding farmers fertilizers as many as 176 reports. The topic of the report regarding covid assistance is the cluster with the most number of members. 

2022 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Nuria Recuero-Virto ◽  
Cristina Valilla-Arróspide

PurposeIn a sector that needs to satisfy a fast-increasing population, advancements like cultivated meat and bio-circular economy are basic to sustain the industry and the society. As innovations are key for economic and social progress, it is crucial to understand consumers' position on this matter.Design/methodology/approachBased on text data mining, 7,030 tweets were collected and organised into 14 different food-related topics. Of the total, 6 of these categories were positive, 5 were negative and 3 were neutral.FindingsIn total, 6 categories related to food technologies were positively perceived by Twitter users, such as innovative solutions and sustainable agriculture, while 5 like the virtual dimensions of the industry or crisis-related scenarios were negatively perceived. It is remarkable that 3 categories had a neutral sentiment, which gives ground to improvement before consumers have a negative opinion and consequently will be more complicated to change their minds.Originality/valueTechnological innovations are becoming predominant in the food industry. The SARS-CoV-2 pandemic has made the sector improve even faster. Traditional methods needed to be substituted and technologies such as robots, artificial intelligence, blockchain and genetics are here to stay.

2022 ◽  
Matej Gjurković ◽  
Iva Vukojević ◽  
Jan Šnajder

Automated text-based personality assessment (ATBPA) methods can analyze large amounts of text data and identify nuanced linguistic personality cues. However, current approaches lack the interpretability, explainability, and validity offered by standard questionnaire instruments. To address these weaknesses, we propose an approach that combines questionnaire-based and text-based approaches to personality assessment. Our Statement-to-Item Matching Personality Assessment (SIMPA) framework uses natural language processing methods to detect self-referencing descriptions of personality in a target’s text and utilizes these descriptions for personality assessment. The core of the framework is the notion of a trait-constrained semantic similarity between the target’s freely expressed statements and questionnaire items. The conceptual basis is provided by the realistic accuracy model (RAM), which describes the process of accurate personality judgments and which we extend with a feedback loop mechanism to improve the accuracy of judgments. We present a simple proof-of-concept implementation of SIMPA for ATBPA on the social media site Reddit. We show how the framework can be used directly for unsupervised estimation of a target’s Big 5 scores and indirectly to produce features for a supervised ATBPA model, demonstrating state-of-the-art results for the personality prediction task on Reddit.

Sobhan Sarkar ◽  
Sammangi Vinay ◽  
Chawki Djeddi ◽  
J. Maiti

AbstractClassifying or predicting occupational incidents using both structured and unstructured (text) data are an unexplored area of research. Unstructured texts, i.e., incident narratives are often unutilized or underutilized. Besides the explicit information, there exist a large amount of hidden information present in a dataset, which cannot be explored by the traditional machine learning (ML) algorithms. There is a scarcity of studies that reveal the use of deep neural networks (DNNs) in the domain of incident prediction, and its parameter optimization for achieving better prediction power. To address these issues, initially, key terms are extracted from the unstructured texts using LDA-based topic modeling. Then, these key terms are added with the predictor categories to form the feature vector, which is further processed for noise reduction and fed to the adaptive moment estimation (ADAM)-based DNN (i.e., ADNN) for classification, as ADAM is superior to GD, SGD, and RMSProp. To evaluate the effectiveness of our proposed method, a comparative study has been conducted using some state-of-the-arts on five benchmark datasets. Moreover, a case study of an integrated steel plant in India has been demonstrated for the validation of the proposed model. Experimental results reveal that ADNN produces superior performance than others in terms of accuracy. Therefore, the present study offers a robust methodological guide that enables us to handle the issues of unstructured data and hidden information for developing a predictive model.

2022 ◽  
Vol 9 ◽  
Suleman Khan ◽  
Saqib Hakak ◽  
N. Deepa ◽  
B. Prabadevi ◽  
Kapal Dev ◽  

Since its emergence in December 2019, there have been numerous posts and news regarding the COVID-19 pandemic in social media, traditional print, and electronic media. These sources have information from both trusted and non-trusted medical sources. Furthermore, the news from these media are spread rapidly. Spreading a piece of deceptive information may lead to anxiety, unwanted exposure to medical remedies, tricks for digital marketing, and may lead to deadly factors. Therefore, a model for detecting fake news from the news pool is essential. In this work, the dataset which is a fusion of news related to COVID-19 that has been sourced from data from several social media and news sources is used for classification. In the first step, preprocessing is performed on the dataset to remove unwanted text, then tokenization is carried out to extract the tokens from the raw text data collected from various sources. Later, feature selection is performed to avoid the computational overhead incurred in processing all the features in the dataset. The linguistic and sentiment features are extracted for further processing. Finally, several state-of-the-art machine learning algorithms are trained to classify the COVID-19-related dataset. These algorithms are then evaluated using various metrics. The results show that the random forest classifier outperforms the other classifiers with an accuracy of 88.50%.

Runumi Devi ◽  
Deepti Mehrotra ◽  
Sana Ben Abdallah Ben Lamine

Electronic Health Record (EHR) systems in healthcare organisations are primarily maintained in isolation from each other that makes interoperability of unstructured(text) data stored in these EHR systems challenging in the healthcare domain. Similar information may be described using different terminologies by different applications that can be evaded by transforming the content into the Resource Description Framework (RDF) model that is interoperable amongst organisations. RDF requires a document’s contents to be translated into a repository of triplets (subject, predicate, object) known as RDF statements. Natural Language Processing (NLP) techniques can help get actionable insights from these text data and create triplets for RDF model generation. This paper discusses two NLP-based approaches to generate the RDF models from unstructured patients’ documents, namely dependency structure-based and constituent(phrase) structure-based parser. Models generated by both approaches are evaluated in two aspects: exhaustiveness of the represented knowledge and the model generation time. The precision measure is used to compute the models’ exhaustiveness in terms of the number of facts that are transformed into RDF representations.

2022 ◽  
Vol 13 (1) ◽  
pp. 161-171
Asriani Abbas ◽  
Kaharuddin ◽  
Muhammad Hasyim

Makassarese language belongs to the Austronesian language family, currently spoken as a mother language by a group of people in South Sulawesi province, eastern Indonesia. This research focuses on personal pronoun organization in the sentence construction of the Makassarese language. The form, position, and function of personal pronouns in the language sentences are explained. It used ‘simak’ (to-observe) method in form of a conversational involved-observation technique including recording and note-taking in collecting data. The data sources were oral data and text data. The oral data were taken from five informants selected purposively. The text data were taken from the folklore script of South Sulawesi written in the Makassarese language. The data were presented descriptively and analyzed by using the distributional method. The findings show two forms of personal pronouns used dominantly in constructing sentences: free personal pronoun and bound personal pronoun (clitic). Position of the free personal pronoun is in front of, in the middle of, and at the end of a sentence. The clitic is in front of and at the end of the verb. In addition, there is also clitic attached at the end of the noun that serves as possessive. The sentence starting with a free personal pronoun forms the pattern of SV (subject-verb) or SVO (subject-verb-object) and the sentence starting with clitic-attached verb forms the pattern of VS (verb-subject) or VSO (verb-subject-object). The basic structure of the Makassarese sentence is VS or VSO. The derivative structure is SV or SVO with other varieties.

2022 ◽  
Vol 2161 (1) ◽  
pp. 012034
Pratyaksh Jain ◽  
Karthik Ram Srinivas ◽  
Abhishek Vichare

Abstract Depression is a common type of mental illness that can impair performance and lead to suicide ideation or attempts. Traditional techniques used by mental health experts can assist in determining an individual’s type of depression. Machine learning and NLP were used to understand how to predict posts that indicate depression in people and their accuracy. For this work, we have used a dataset from reddit. Reddit is an ideal destination to use as a supplement to the traditional public health system because of its punctuality in exchanging ideas, versatility in presenting emotions, as well as compatibility to use medical terms. We examined the comments and posts about suicidal ideation. We used NLP to gain a better understanding of interdisciplinary fields which are related to suicide. We discovered two help groups for depression and suicidal thoughts: r/depression and r/SuicideWatch. The famous “SuicideWatch” subreddit is commonly used by people who have thoughts of suicide and gives significant signals for suicidal behavior. A brief scan through the articles discloses that the subreddits are legitimate online spots to seek assistance and provide honest text data about people’s mental state. We have used multiple ML algorithms such as Naïve Bayes, SVM. To address the research problem, we have considered two subreddits that provided us with appropriate information to track people at risk. We achieved results of 77.29 % accuracy and 0.77 f1-score of Logistic Regression, 74.35 % accuracy and 0.74 f1-score of Naïve Bayes, 77.120% accuracy and 0.77 f1-score of Support Vector Machine, 77.298% accuracy, and 0.77 f1-score of Random Forest.

Sign in / Sign up

Export Citation Format

Share Document