A Big Data Text Coverless Information Hiding Based on Topic Distribution and TF-IDF

Coverless information hiding has become a hot topic in recent years. The existing steganalysis tools are invalidated due to coverless steganography without any modification to the carrier. However, for the text coverless has relatively low hiding capacity, this paper proposed a big data text coverless information hiding method based on LDA (latent Dirichlet allocation) topic distribution and keyword TF-IDF (term frequency-inverse document frequency). Firstly, the sender and receiver build codebook, including word segmentation, word frequency and TF-IDF features, LDA topic model clustering. The sender then shreds the secret information, converts it into keyword ID through the keywords-index table, and searches the text containing the secret information keywords. Secondly, the searched text is taken as the index tag according to the topic distribution and TF-IDF features. At the same time, random numbers are introduced to control the keyword order of secret information.

Download Full-text

The application of text mining algorithms in summarizing trends in anti-epileptic drug research

10.1101/269308 ◽

2018 ◽

Cited By ~ 2

Author(s):

Shatrunjai P. Singh ◽

Swagata Karkare ◽

Sudhir M. Baswan ◽

Vijendra P. Singh

Keyword(s):

Text Mining ◽

Latent Dirichlet Allocation ◽

Drug Research ◽

Topic Model ◽

Analysis Model ◽

Data Intensive ◽

Document Frequency ◽

Anti Epileptic Drug ◽

The Us ◽

Mining Algorithms

1.AbstractContent summarization is an important area of research in traditional data mining. The volume of studies published on anti-epileptic drugs (AED) has increased exponentially over the last two decades, making it an important area for the application of text mining based summarization algorithms. In the current study, we use text analytics algorithms to mine and summarize 10,000 PubMed abstracts related to anti-epileptic drugs published within the last 10 years. A Text Frequency – Inverse Document Frequency based filtering was applied to identify drugs with highest frequency of mentions within these abstracts. The US Food and Drug database was scrapped and linked to the results to quantify the most frequently mentioned modes of action and elucidate the pharmaceutical entities marketing these drugs. A sentiment analysis model was created to score the abstracts for sentiment positivity or negativity. Finally, a modified Latent Dirichlet Allocation topic model was generated to extract key topics associated with the most frequently mentioned AEDs. Results of this study provide accurate and data intensive insights on the progress of anti-epileptic drug research.

Download Full-text

Pemodelan Topik dengan LDA untuk Temu Kembali Informasi dalam Rekomendasi Tugas Akhir

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v5i3.3049 ◽

2021 ◽

Vol 5 (3) ◽

pp. 421-428

Author(s):

Diana Purwitasari ◽

Aida Muflichah ◽

Novrindah Alvi Hasanah ◽

Agus Zainal Arifin

Keyword(s):

Learning Outcomes ◽

Recommendation System ◽

Latent Dirichlet Allocation ◽

Cluster Number ◽

Inverse Document Frequency ◽

Final Project ◽

Document Frequency ◽

Student Graduation ◽

Undergraduate Thesis ◽

Optimal Cluster

Undergraduate thesis as the final project, or in Indonesian called as Tugas Akhir, for each undergraduate student is a pre-requisite before student graduation and the successfulness in finishing the project becomes as one of learning outcomes among others. Determining the topic of the final project according to the ability of students is an important thing. One strategy to decide the topic is reading some literatures but it takes up more time. There is a need for a recommendation system to help students in determining the topic according to their abilities or subject understanding which is based on their academic transcripts. This study focused on a system for final project topic recommendations based on evaluating competencies in previous academic transcripts of graduated students. Collected data of previous final projects, namely titles and abstracts weighted by term occurences of TF-IDF (term frequency–inverse document frequency) and grouped by using K-Means Clustering. From each cluster result, we prepared candidates for recommended topics using Latent Dirichlet Allocation (LDA) with Gibbs Sampling that focusing on the word distribution of each topic in the cluster. Some evaluations were performed to evaluate the optimal cluster number, topic number and then made more thorough exploration on the recommendation results. Our experiments showed that the proposed system could recommend final project topic ideas based on student competence represented in their academic transcripts.

Download Full-text

Prospecting the Effect of Topic Modeling in Information Retrieval

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2021070102 ◽

2021 ◽

Vol 17 (3) ◽

pp. 18-34

Author(s):

Aakanksha Sharaff ◽

Jitesh Kumar Dewangan ◽

Dilip Singh Sisodia

Keyword(s):

Information Retrieval ◽

Topic Modeling ◽

Topic Model ◽

Language Model ◽

High Dimensionality ◽

Retrieval Process ◽

Coherence Measure ◽

Retrieval Task ◽

Inverse Document Frequency ◽

Document Frequency

Enormous records and data are gathered every day. Organization of this data is a challenging task. Topic modeling provides a way to categorize these documents, where high dimensionality of the corpus affects the result of topic model, making it important to apply feature selection or information retrieval process for dimensionality reduction. The requirement for efficient topic modeling includes the removal of unrelated words that might lead to specious coexistence of the unrelated words. This paper proposes an efficient framework for the generation of better topic coherence, where term frequency-inverse document frequency (TF-IDF) and parsimonious language model (PLM) are used for the information retrieval task. PLM extracts the important information and expels the general words from the corpus, whereas TF-IDF re-estimates the weightage of each word in the corpus. The work carried out in this paper improved the topic coherence measure to provide a better correlation among the actual topic and the topics generated from PLM.

Download Full-text

CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

Complexity ◽

10.1155/2018/2503816 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lirong Qiu ◽

Jia Yu

Keyword(s):

Big Data ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

User Interest ◽

Text Data ◽

Data Set ◽

Data Sparsity ◽

Short Text ◽

Text Filtering

In the present big data background, how to effectively excavate useful information is the problem that big data is facing now. The purpose of this study is to construct a more effective method of mining interest preferences of users in a particular field in the context of today’s big data. We mainly use a large number of user text data from microblog to study. LDA is an effective method of text mining, but it will not play a very good role in applying LDA directly to a large number of short texts in microblog. In today’s more effective topic modeling project, short texts need to be aggregated into long texts to avoid data sparsity. However, aggregated short texts are mixed with a lot of noise, reducing the accuracy of mining the user’s interest preferences. In this paper, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously. The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts. Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined. Experimental results in a real microblog data set show that CLDA outperforms many advanced models in mining user interest, and we also confirm that CLDA also has good performance in recommending systems.

Download Full-text

Research on the Changing Trend of Employment-Relevant Terms Based on Internet Big Data Analysis

E3S Web of Conferences ◽

10.1051/e3sconf/202125101050 ◽

2021 ◽

Vol 251 ◽

pp. 01050

Author(s):

Yang Wei

Keyword(s):

Big Data ◽

Data Analysis ◽

Research Result ◽

Big Data Analysis ◽

Inverse Document Frequency ◽

Teaching Administration ◽

Employment Skills ◽

Document Frequency ◽

Changing Trend ◽

Big Data Technology

With publicly-available data collected from mainstream information platforms, this study used the term frequency inverse document frequency (TF-IDF) algorithm to detect 74 popular terms and phrases about employment, analyzed the changes in the ranking of these terms and phrases, and visualized the changing trend in the attention to employment skills from 2017 to 2019. The research result will facilitate application of big data technology to teaching administration in colleges, and provide a guide for college students to plan their study of vocational skills.

Download Full-text

Application of Social Big Data to Identify Trends of School Bullying Forms in South Korea

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16142596 ◽

2019 ◽

Vol 16 (14) ◽

pp. 2596 ◽

Cited By ~ 2

Author(s):

Hayoung Kim ◽

Yoonsun Han ◽

Juyoung Song ◽

Tae Min Song

Keyword(s):

Big Data ◽

School Bullying ◽

Cyber Bullying ◽

High Rate ◽

Web Documents ◽

Inverse Document Frequency ◽

Digital Era ◽

Social Big Data ◽

Document Frequency ◽

Bullying Experiences

As the contemporary phenomenon of school bullying has become more widespread, diverse, and frequent among adolescents in Korea, social big data may offer a new methodological paradigm for understanding the trends of school bullying in the digital era. This study identified Term Frequency-Inverse Document Frequency (TF-IDF) and Future Signals of 177 school bullying forms to understand the current and future bullying experiences of adolescents from 436,508 web documents collected between 1 January 2013, and 31 December 2017. In social big data, sexual bullying rapidly increased, and physical and cyber bullying had high frequency with a high rate of growth. School bullying forms, such as “group assault” and “sexual harassment”, appeared as Weak Signals, and “cyber bullying” was a Strong Signal. Findings considering five school bullying forms (verbal, physical, relational, sexual, and cyber bullying) are valuable for developing insights into the burgeoning phenomenon of school bullying.

Download Full-text

Design and Implementation of a Big Data Evaluator Recommendation System Using Deep Learning Methodology

Applied Sciences ◽

10.3390/app10228000 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8000

Author(s):

Sukil Cha ◽

Mun Y. Yi ◽

Sekyoung Youm

Keyword(s):

Big Data ◽

Deep Learning ◽

Full Text ◽

Recommendation System ◽

Selection Process ◽

Korean Literature ◽

Inverse Document Frequency ◽

Design And Implementation ◽

Research Fields ◽

Document Frequency

As the number of researchers in South Korea has grown, there is increasing dissatisfaction with the selection process for national research and development (R&D) projects among unsuccessful applicants. In this study, we designed a system that can recommend the best possible R&D evaluators using big data that are collected from related systems, refined, and analyzed. Our big data recommendation system compares keywords extracted from applications and from the full-text of the achievements of the evaluator candidates. Weights for different keywords are scored using the term frequency–inverse document frequency algorithm. Comparing the keywords extracted from the achievement of the evaluator candidates’, a project comparison module searches, scores, and ranks these achievements similarly to the project applications. The similarity scoring module calculates the overall similarity scores for different candidates based on the project comparison module scores. To assess the performance of the evaluator candidate recommendation system, 61 applications in three Review Board (RB) research fields (system fusion, organic biochemistry, and Korean literature) were recommended as the evaluator candidates by the recommendation system in the same manner as the RB’s recommendation. Our tests reveal that the evaluator candidates recommended by the Korean Review Board and those recommended by our system for 61 applications in different areas, were the same. However, our system performed the recommendation in less time with no bias and fewer personnel. The system requiresrevisions to reflect qualitative indicators, such as journal reputation, before it can entirely replace the current evaluator recommendation process.

Download Full-text

Network fields, cultural identities and labor rights communities: Big data analytics with topic model and community detection

Chinese Journal of Sociology ◽

10.1177/2057150x18820500 ◽

2019 ◽

Vol 5 (1) ◽

pp. 3-28 ◽

Cited By ~ 1

Author(s):

Ronggui Huang

Keyword(s):

Big Data ◽

Data Analytics ◽

Latent Dirichlet Allocation ◽

Social Space ◽

Topic Model ◽

Big Data Analytics ◽

Urban Life ◽

Labor Rights ◽

Interaction Patterns ◽

Labor Issues

The Weibo platform is a social space for interaction and expression. This requires scholars to examine, in a simultaneous fashion, communication patterns and the communicated content among Weibo users. Based on theories of ‘network and culture’ and relational sociology, this article contends that network fields and the communicated cultural meanings are mutually constituted. A latent Dirichlet allocation (LDA) topic model and social network analysis techniques were used to examine 51,288 Weibo posts published by users concerned for workers revealing the relationship between community structures and communities’ focal topics. Specifically, the result of LDA topic modeling shows that the focal topics regarding labor issues could be categorized into four groups: workers’ culture (art and entertainment) and welfare; predicaments and problems; strikes (rights defending actions) and labor organizations; and institutions and labor rights. Analysis of interaction patterns among users resulted in the identification of five major online communities which, based on the primary communicated topics within communities, were labeled as the Labor Homeland Community; Labor Culture Community; Labor Rights Protection Community; Labor Interest Concerned Community; and Labor Institution Concerned Community. The results also showed two new trends in relation to labor issues: first, workers’ culture and their integration into urban life have garnered increasing online attention with the growth of new generation workers; and second, the Weibo platform provides an interaction channel for labor researchers and labor non-governmental organizations, and such interaction facilitates the latter to critically reflect the current conditions or plights of workers from an institutional/structural perspective. This article concludes with a discussion about the significance of utilizing big data analytics to study online culture and social mentality.

Download Full-text

Coverless Information Hiding Based on WGAN-GP Model

International Journal of Digital Crime and Forensics ◽

10.4018/ijdcf.20210701.oa5 ◽

2021 ◽

Vol 13 (4) ◽

pp. 57-70

Author(s):

Xintao Duan ◽

Baoxia Li ◽

Daidou Guo ◽

Kai Jia ◽

En Zhang ◽

...

Keyword(s):

Image Quality ◽

Information Hiding ◽

Experimental Results ◽

Natural Image ◽

High Image Quality ◽

Secret Image ◽

Secret Information ◽

Coverless Information Hiding ◽

Gp Model

Steganalysis technology judges whether there is secret information in the carrier by monitoring the abnormality of the carrier data, so the traditional information hiding technology has reached the bottleneck. Therefore, this paper proposed the coverless information hiding based on the improved training of Wasserstein GANs (WGAN-GP) model. The sender trains the WGAN-GP with a natural image and a secret image. The generated image and secret image are visually identical, and the parameters of generator are saved to form the codebook. The sender uploads the natural image (disguise image) to the cloud disk. The receiver downloads the camouflage image from the cloud disk and obtains the corresponding generator parameter in the codebook and inputs it to the generator. The generator outputs the same image for the secret image, which realized the same results as sending the secret image. The experimental results indicate that the scheme produces high image quality and good security.

Download Full-text

Understand research hotspots surrounding COVID-19 and other coronavirus infections using topic modeling

10.1101/2020.03.26.20044164 ◽

2020 ◽

Cited By ~ 2

Author(s):

Mengying Dong ◽

Xiaojun Cao ◽

Mingbiao Liang ◽

Lijuan Li ◽

Guangjian Liu ◽

...

Keyword(s):

Epidemiological Study ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Topic Model ◽

Virus Transmission ◽

Respiratory Illness ◽

Future Research ◽

Topic Distribution ◽

Virus Diagnostics ◽

Novel Coronavirus

AbstractBackgroundSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which results in global outbreak of novel coronavirus disease (COVID-19) currently. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using topic modeling.MethodsWe extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we trained a topic model from the corpus, analyzed the semantic relationships between topics and compared the topic distribution between COVID-19 and other CoV infections.ResultsEight topics emerged overall: clinical characterization, pathogenesis research, therapeutics research, epidemiological study, virus transmission, vaccines research, virus diagnostics, and viral genomics. It was observed that current COVID-19 research puts more emphasis on clinical characterization, epidemiological study, and virus transmission. In contrast, topics about diagnostics, therapeutics, vaccines, genomics and pathogenesis only account for less than 10% or even 4% of all the COVID-19 publications, much lower than those of other CoV infections.ConclusionsThese results identified knowledge gaps in the area of COVID-19 and offered directions for future research.

Download Full-text