Research on Classification of Chinese Text Data Based on SVM

Abstract— This article presents an approach to protection of printed text data by watermark embedding in the printing process. Data protection is based on robust watermark embedding that is invariant to text data format converting into image. The choice of a robust watermark within the confines of the presented classification of digital watermark is justified. The requirements to developed robust watermark have been formed. According to the formed requirements and existing restrictions, an approach to robust watermark embedding into text data based on a steganographic algorithm of line spacing shifting has been developed. The block diagram and the description of the developed algorithm of data embedding into text data are given. An experimental estimation of the embedding capacity and perceptual invisibility of the developed data embedding approach was carried out. An approach to extract embedded information from images containing a robust watermark has been developed. The limits of the retrieval, extraction accuracy and robustness evaluation of embedded data to various transformations have been experimentally established.Tóm tắt— Bài báo trình bày cách tiếp cận để bảo vệ dữ liệu văn bản in bằng cách nhúng vào văn bản một đoạn thủy vân trong quá trình in. Bảo vệ dữ liệu dựa trên việc sử dụng thủy vân bền vững có khả năng chống lại sự chuyển đổi định dạng dữ liệu văn bản sang dữ liệu hình ảnh. Sau quá trình phân tích các hệ thống thủy vân số hiện có, nhận thấy việc lựa chọn một mô hình thủy vân bền vững là hợp lý. Do yêu cầu thực tế và các hạn chế của phương pháp nhúng thủy vân vào dữ liệu văn bản hiện có, bài báo đưa ra phương pháp nhúng mới được phát triển dựa trên một thuật toán ẩn mã sử dụng cách thay đổi khoảng cách giữa các dòng trong văn bản. Bài báo đưa ra một sơ đồ khối và mô tả thuật toán nhúng thông tin vào dữ liệu văn bản. Các thực nghiệm về khả năng nhúng và khả năng che giấu thông tin với tri giác thông thường của dữ liệu nhúng cũng được trình bày. Bài báo cũng nêu cách tiếp cận để trích xuất thông tin được nhúng từ các hình ảnh có chứa thủy vân bền vững. Bên cạnh đó, chúng tôi cũng đưa ra các giới hạn về khả năng ứng dụng của phương pháp dựa trên các thực nghiệm, các đánh giá về độ chính xác của việc trích xuất được dữ liệu và độ mạnh của phương pháp nhúng mới này đối với các phép biến đổi ảnh khác nhau.

Download Full-text

Cluster-Preserving Dimension Reduction Methods for Efficient Classification of Text Data

Survey of Text Mining ◽

10.1007/978-1-4757-4305-0_1 ◽

2004 ◽

pp. 3-23 ◽

Cited By ~ 4

Author(s):

Peg Howland ◽

Haesun Park

Keyword(s):

Dimension Reduction ◽

Text Data ◽

Reduction Methods

Download Full-text

Varying Naïve Bayes Models With Applications to Classification of Chinese Text Documents

Journal of Business and Economic Statistics ◽

10.1080/07350015.2014.903086 ◽

2014 ◽

Vol 32 (3) ◽

pp. 445-456 ◽

Cited By ~ 3

Author(s):

Guoyu Guan ◽

Jianhua Guo ◽

Hansheng Wang

Keyword(s):

Chinese Text ◽

Naive Bayes ◽

Naïve Bayes ◽

Text Documents

Download Full-text

A Data Mining Technique for Tourist Destination Brand Image Building

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f8329.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 4617-4622

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Tourism Industry ◽

Destination Image ◽

Data Mining Technique ◽

Tourist Destination ◽

Text Data ◽

Clustering And Classification ◽

Google Search

The destination image branding is the domain of tourism industry where the facts and information is collected and evaluated for finding the credibility of a target tourist destination. Manual collection and processing of collected information accurately is a complicated and time consuming task therefore a data mining model is suggested ,in this presented work that collect and evaluate the destination image accurately and based on evaluation can make the recommendations about visits of tourist. In order to perform this task data mining techniques are applied on text data source. In first the data is extracted from the Google search engine and it is preprocessed for make it impure. In further the data is labeled based on the positive and negative words available in the collected facts. Finally the clustering and classification of text is performed. For clustering of data FCM (fuzzy c means) clustering algorithm and for classification the Bayesian classifier is used. Based on final classification of text data the decision is made for the destination visits.

Download Full-text

Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records

Rheumatology ◽

10.1093/rheumatology/kez375 ◽

2019 ◽

Vol 59 (5) ◽

pp. 1059-1065 ◽

Cited By ~ 1

Author(s):

Sizheng Steven Zhao ◽

Chuan Hong ◽

Tianrun Cai ◽

Chang Xu ◽

Jie Huang ◽

...

Keyword(s):

Electronic Health Records ◽

Predictive Value ◽

Area Under The Curve ◽

Free Text ◽

Text Data ◽

Health Records ◽

Disease Concepts ◽

Icd Codes ◽

Electronic Health

Abstract Objectives To develop classification algorithms that accurately identify axial SpA (axSpA) patients in electronic health records, and compare the performance of algorithms incorporating free-text data against approaches using only International Classification of Diseases (ICD) codes. Methods An enriched cohort of 7853 eligible patients was created from electronic health records of two large hospitals using automated searches (⩾1 ICD codes combined with simple text searches). Key disease concepts from free-text data were extracted using NLP and combined with ICD codes to develop algorithms. We created both supervised regression-based algorithms—on a training set of 127 axSpA cases and 423 non-cases—and unsupervised algorithms to identify patients with high probability of having axSpA from the enriched cohort. Their performance was compared against classifications using ICD codes only. Results NLP extracted four disease concepts of high predictive value: ankylosing spondylitis, sacroiliitis, HLA-B27 and spondylitis. The unsupervised algorithm, incorporating both the NLP concept and ICD code for AS, identified the greatest number of patients. By setting the probability threshold to attain 80% positive predictive value, it identified 1509 axSpA patients (mean age 53 years, 71% male). Sensitivity was 0.78, specificity 0.94 and area under the curve 0.93. The two supervised algorithms performed similarly but identified fewer patients. All three outperformed traditional approaches using ICD codes alone (area under the curve 0.80–0.87). Conclusion Algorithms incorporating free-text data can accurately identify axSpA patients in electronic health records. Large cohorts identified using these novel methods offer exciting opportunities for future clinical research.

Download Full-text

On the unsupervised analysis of domain-specific Chinese texts

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1516510113 ◽

2016 ◽

Vol 113 (22) ◽

pp. 6154-6159 ◽

Cited By ~ 5

Author(s):

Ke Deng ◽

Peter K. Bol ◽

Kate J. Li ◽

Jun S. Liu

Keyword(s):

Chinese Text ◽

Context Analysis ◽

Text Data ◽

Training Corpus ◽

Domain Specific ◽

Association Pattern ◽

Supervised Segmentation ◽

Chinese Texts ◽

Chinese Text Mining ◽

Better Than

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

Download Full-text

Research on SQL injection detection technology based on SVM

MATEC Web of Conferences ◽

10.1051/matecconf/201817301004 ◽

2018 ◽

Vol 173 ◽

pp. 01004 ◽

Cited By ~ 2

Author(s):

Zhuang Chen ◽

Min Guo ◽

Lin zhou

Keyword(s):

Hot Spot ◽

Web Security ◽

Classification Model ◽

Matching Method ◽

Text Data ◽

Sql Injection ◽

Detection Technology ◽

Svm Algorithm ◽

Fast Variation

SQL injection, which has the characteristics of great harm and fast variation, has always ranked the top of the OWASP TOP 10, which has always been a hot spot in the research of web security. In view of the difficulty of detecting unknown attacks by the existing rule matching method, a method of SQL injection detection based on machine learning is proposed. And the author analyses the method of SQL injection feature extraction, f Finally, the word2vec method is selected to process the text data of the HTTP request, which can effectively represent the SQL injection features containing the attack payload. Training and classification of processed samples with SVM algorithm, The experiment shows that this method effectively solves the problem of SQL injection to the mutation and the high leakage rate of the rule matching. By comparing with the classification results of statistical features, this SQL injection classification model has a higher detection rate.

Download Full-text

Methods for Classification of Text Data: Can the Potential of Quantitative Analysis Be Applied to Qualitative Research?

Inter ◽

10.19181/inter.2021.13.2.5 ◽

2021 ◽

Vol 13 (2) ◽

pp. 81-96

Author(s):

Marina Aleksandrova

Keyword(s):

Social Sciences ◽

European Social Survey ◽

Research Field ◽

Item Nonresponse ◽

Text Data ◽

Social Survey ◽

The Social ◽

Textual Data ◽

Analysis Of Results

Text mining has developed rapidly in recent years. In this article we compare classification methods that are suitable for solving problems of predicting item nonresponse. The author builds reasoning about how the analysis of textual data can be implemented in a wider research field based on this material. The author considers a number of metrics adapted for textual analysis in the social sciences: accuracy, precision, recall, F1-score, and gives examples that can help a sociologist figure out which of them is worth paying attention depending on the task at hand (classify text data with equal accuracy, or more fully describe one of the classes of interest). The article proposes an analysis of results obtained by analyzing texts based on the materials of the European Social Survey (ESS).

Download Full-text

Comparing ELM with SVM in the Field of Sentiment Classification of Social Media Text Data

Proceedings in Adaptation, Learning and Optimization - Proceedings of ELM 2018 ◽

10.1007/978-3-030-23307-5_36 ◽

2019 ◽

pp. 336-344

Author(s):

Zhihuan Chen ◽

Zhaoxia Wang ◽

Zhiping Lin ◽

Ting Yang

Keyword(s):

Social Media ◽

Sentiment Classification ◽

Text Data ◽

Social Media Text

Download Full-text

Ternary Classification of Product Based Reviews: Survey, Open Issues and New Approach for Sentiment Analysis

Indian Journal of Artificial Intelligence and Neural Networking ◽

10.35940/ijainn.b1008.041221 ◽

2021 ◽

Vol 1 (2) ◽

pp. 1-8

Author(s):

Sushila Sonare ◽

Megha Kamble

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Text Analysis ◽

Text Data ◽

New Approach ◽

Meaningful Information ◽

Multi Class Classification ◽

Almost All ◽

Open Issues

Now-a-days, it is very common that the customers share their thoughts about any product, brand and their experience in social media. The analysts collect these reviews and process it, to extract meaningful information about the product. The beauty of social media is, it’s involved in all the domains. So the analysts got reviews from different social media and platforms for almost all kind of thing. The Sentiment Analysis is applied to predict outcomes for getting useful information, for ex.; like predict the blockbuster for a movie, rating for any new launches and many more. This type of prediction is really helpful for the customer to buy any goods or take any services in this competitive world. This paper is focused on e-commerce website reviews which are normally in text form with some special characters and some symbols (emojis). Each word in this text set got some meaning in terms of context, emotion and prior experience. These characteristics contribute to some of the features of text data for prediction. The objective of this paper is to compile existing research works on text analysis and emotion based analysis. The open issues and challenges of document based sentiment analysis are also discussed. The paper concluded with proposing a new approach of multi class classification. Ternary classification for classes positive, negative and neutral is suggested primarily for product based text and emoji reviews on Twitter social media.

Download Full-text