scholarly journals Research on Classification of Chinese Text Data Based on SVM

Author(s):  
Yuan Lin ◽  
Hongzhi Yu ◽  
Fucheng Wan ◽  
Tao Xu
Keyword(s):  
2020 ◽  
Vol 7 (1) ◽  
pp. 3-13
Author(s):  
Alexander Kozachok ◽  
Sergey Kopylov

 Abstract— This article presents an approach to protection of printed text data by watermark embedding in the printing process. Data protection is based on robust watermark embedding that is invariant to text data format converting into image. The choice of a robust watermark within the confines of the presented classification of digital watermark is justified. The requirements to developed robust watermark have been formed. According to the formed requirements and existing restrictions, an approach to robust watermark embedding into text data based on a steganographic algorithm of line spacing shifting has been developed. The block diagram and the description of the developed algorithm of data embedding into text data are given. An experimental estimation of the embedding capacity and perceptual invisibility of the developed data embedding approach was carried out. An approach to extract embedded information from images containing a robust watermark has been developed. The limits of the retrieval, extraction accuracy and robustness evaluation of embedded data to various transformations have been experimentally established.Tóm tắt— Bài báo trình bày cách tiếp cận để bảo vệ dữ liệu văn bản in bằng cách nhúng vào văn bản một đoạn thủy vân trong quá trình in. Bảo vệ dữ liệu dựa trên việc sử dụng thủy vân bền vững có khả năng chống lại sự chuyển đổi định dạng dữ liệu văn bản sang dữ liệu hình ảnh. Sau quá trình phân tích các hệ thống thủy vân số hiện có, nhận thấy việc lựa chọn một mô hình thủy vân bền vững là hợp lý. Do yêu cầu thực tế và các hạn chế của phương pháp nhúng thủy vân vào dữ liệu văn bản hiện có, bài báo đưa ra phương pháp nhúng mới được phát triển dựa trên một thuật toán ẩn mã sử dụng cách thay đổi khoảng cách giữa các dòng trong văn bản. Bài báo đưa ra một sơ đồ khối và mô tả thuật toán nhúng thông tin vào dữ liệu văn bản. Các thực nghiệm về khả năng nhúng và khả năng che giấu thông tin với tri giác thông thường của dữ liệu nhúng cũng được trình bày. Bài báo cũng nêu cách tiếp cận để trích xuất thông tin được nhúng từ các hình ảnh có chứa thủy vân bền vững. Bên cạnh đó, chúng tôi cũng đưa ra các giới hạn về khả năng ứng dụng của phương pháp dựa trên các thực nghiệm, các đánh giá về độ chính xác của việc trích xuất được dữ liệu và độ mạnh của phương pháp nhúng mới này đối với các phép biến đổi ảnh khác nhau. 


2020 ◽  
Vol 8 (6) ◽  
pp. 4617-4622

The destination image branding is the domain of tourism industry where the facts and information is collected and evaluated for finding the credibility of a target tourist destination. Manual collection and processing of collected information accurately is a complicated and time consuming task therefore a data mining model is suggested ,in this presented work that collect and evaluate the destination image accurately and based on evaluation can make the recommendations about visits of tourist. In order to perform this task data mining techniques are applied on text data source. In first the data is extracted from the Google search engine and it is preprocessed for make it impure. In further the data is labeled based on the positive and negative words available in the collected facts. Finally the clustering and classification of text is performed. For clustering of data FCM (fuzzy c means) clustering algorithm and for classification the Bayesian classifier is used. Based on final classification of text data the decision is made for the destination visits.


Rheumatology ◽  
2019 ◽  
Vol 59 (5) ◽  
pp. 1059-1065 ◽  
Author(s):  
Sizheng Steven Zhao ◽  
Chuan Hong ◽  
Tianrun Cai ◽  
Chang Xu ◽  
Jie Huang ◽  
...  

Abstract Objectives To develop classification algorithms that accurately identify axial SpA (axSpA) patients in electronic health records, and compare the performance of algorithms incorporating free-text data against approaches using only International Classification of Diseases (ICD) codes. Methods An enriched cohort of 7853 eligible patients was created from electronic health records of two large hospitals using automated searches (⩾1 ICD codes combined with simple text searches). Key disease concepts from free-text data were extracted using NLP and combined with ICD codes to develop algorithms. We created both supervised regression-based algorithms—on a training set of 127 axSpA cases and 423 non-cases—and unsupervised algorithms to identify patients with high probability of having axSpA from the enriched cohort. Their performance was compared against classifications using ICD codes only. Results NLP extracted four disease concepts of high predictive value: ankylosing spondylitis, sacroiliitis, HLA-B27 and spondylitis. The unsupervised algorithm, incorporating both the NLP concept and ICD code for AS, identified the greatest number of patients. By setting the probability threshold to attain 80% positive predictive value, it identified 1509 axSpA patients (mean age 53 years, 71% male). Sensitivity was 0.78, specificity 0.94 and area under the curve 0.93. The two supervised algorithms performed similarly but identified fewer patients. All three outperformed traditional approaches using ICD codes alone (area under the curve 0.80–0.87). Conclusion Algorithms incorporating free-text data can accurately identify axSpA patients in electronic health records. Large cohorts identified using these novel methods offer exciting opportunities for future clinical research.


2016 ◽  
Vol 113 (22) ◽  
pp. 6154-6159 ◽  
Author(s):  
Ke Deng ◽  
Peter K. Bol ◽  
Kate J. Li ◽  
Jun S. Liu

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.


2018 ◽  
Vol 173 ◽  
pp. 01004 ◽  
Author(s):  
Zhuang Chen ◽  
Min Guo ◽  
Lin zhou

SQL injection, which has the characteristics of great harm and fast variation, has always ranked the top of the OWASP TOP 10, which has always been a hot spot in the research of web security. In view of the difficulty of detecting unknown attacks by the existing rule matching method, a method of SQL injection detection based on machine learning is proposed. And the author analyses the method of SQL injection feature extraction, f Finally, the word2vec method is selected to process the text data of the HTTP request, which can effectively represent the SQL injection features containing the attack payload. Training and classification of processed samples with SVM algorithm, The experiment shows that this method effectively solves the problem of SQL injection to the mutation and the high leakage rate of the rule matching. By comparing with the classification results of statistical features, this SQL injection classification model has a higher detection rate.


Inter ◽  
2021 ◽  
Vol 13 (2) ◽  
pp. 81-96
Author(s):  
Marina Aleksandrova

Text mining has developed rapidly in recent years. In this article we compare classification methods that are suitable for solving problems of predicting item nonresponse. The author builds reasoning about how the analysis of textual data can be implemented in a wider research field based on this material. The author considers a number of metrics adapted for textual analysis in the social sciences: accuracy, precision, recall, F1-score, and gives examples that can help a sociologist figure out which of them is worth paying attention depending on the task at hand (classify text data with equal accuracy, or more fully describe one of the classes of interest). The article proposes an analysis of results obtained by analyzing texts based on the materials of the European Social Survey (ESS).


Author(s):  
Sushila Sonare ◽  
Megha Kamble

Now-a-days, it is very common that the customers share their thoughts about any product, brand and their experience in social media. The analysts collect these reviews and process it, to extract meaningful information about the product. The beauty of social media is, it’s involved in all the domains. So the analysts got reviews from different social media and platforms for almost all kind of thing. The Sentiment Analysis is applied to predict outcomes for getting useful information, for ex.; like predict the blockbuster for a movie, rating for any new launches and many more. This type of prediction is really helpful for the customer to buy any goods or take any services in this competitive world. This paper is focused on e-commerce website reviews which are normally in text form with some special characters and some symbols (emojis). Each word in this text set got some meaning in terms of context, emotion and prior experience. These characteristics contribute to some of the features of text data for prediction. The objective of this paper is to compile existing research works on text analysis and emotion based analysis. The open issues and challenges of document based sentiment analysis are also discussed. The paper concluded with proposing a new approach of multi class classification. Ternary classification for classes positive, negative and neutral is suggested primarily for product based text and emoji reviews on Twitter social media.


Sign in / Sign up

Export Citation Format

Share Document