scholarly journals Solving the problem of calendar data preprocessing during the implementation of Data Mining technology

2020 ◽  
Vol 15 (90) ◽  
pp. 27-41
Author(s):  
Boris V. Okunev ◽  
◽  
Alexander S. Shurykin ◽  

At the moment, dirty data, that is, low-quality data, is becoming one of the main problems of effectively solving Data Mining tasks. Since the source data is accumulated from a variety of sources, the probability of getting dirty data is very high. In this regard, one of the most important tasks that have to be solved during the implementation of the Data Mining process is the initial processing (clearing) of data, i.e. preprocessing. It should be noted that preprocessing calendar data is a rather time-consuming procedure that can take up to half of the entire time of implementing the Data Mining technology. Reducing the time spent on the data cleaning procedure can be achieved by automating this process using specially designed tools (algorithms and programs). At the same time, of course, it should be remembered that the use of the above elements does not guarantee one hundred percent cleaning of "dirty" data, and in some cases may even lead to additional errors in the source data. The authors developed a model for automated preprocessing of calendar data based on parsing and regular expressions. The proposed algorithm is characterized by flexible configuration of preprocessing parameters, fairly simple implementability and high interpretability of results, which in turn provides additional opportunities for analyzing unsuccessful results of Data Mining technology application. Despite the fact that the proposed algorithm is not a tool for cleaning absolutely all types of dirty calendar data, nevertheless, it successfully functions in a significant part of real practical situations.

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Jie Wan ◽  
Xue Cao ◽  
Kun Yao ◽  
Donghui Yang ◽  
E. Peng ◽  
...  

False information on the Internet is being heralded as serious social harm to our society. To recognize false text information, in this paper, an effective method for mining text features is proposed in the field of false drug advertisements. Firstly, the data of false drug advertisements and real drug advertisements were collected from the official websites to build a database of false and real drug advertisements. Secondly, by performing feature extraction on the text of drug advertisements, this work built a characteristic matrix based on the effective features and assigned positive or negative labels to the feature vector of the matrix according to whether it is a fake medical advertisement or not. Thirdly, this study trained and tested several different classifiers, selected the classification model with the best performance in identifying false drug advertisements, and found the key characteristics that can determine the classification. Finally, the model with the best performance was used to predict new false drug advertisements collected from Sina Weibo. In the case of identifying false drug advertisements, the classification effect of the support vector machine (SVM) classifier established on the feature set after feature selection was the most effective. The findings of this study can provide an effective method for the government to identify and combat false advertisements. This study has a certain reference significance in demonstrating the use of text data mining technology to identify and detect information fraud behavior.


2014 ◽  
Vol 644-650 ◽  
pp. 1976-1979
Author(s):  
Chao Wang ◽  
Ying Jie Lian

This paper introduces the basic theory of the data mining technology and genetic algorithm, analyzed the feasibility of genetic algorithm in data mining technology application, and set customer satisfaction as an example to demonstrate the feasibility and validity of the model.


Author(s):  
Richard C. Kittler

Abstract Analysis of manufacturing data as a tool for failure analysts often meets with roadblocks due to the complex non-linear behaviors of the relationships between failure rates and explanatory variables drawn from process history. The current work describes how the use of a comprehensive engineering database and data mining technology over-comes some of these difficulties and enables new classes of problems to be solved. The characteristics of the database design necessary for adequate data coverage and unit traceability are discussed. Data mining technology is explained and contrasted with traditional statistical approaches as well as those of expert systems, neural nets, and signature analysis. Data mining is applied to a number of common problem scenarios. Finally, future trends in data mining technology relevant to failure analysis are discussed.


2021 ◽  
pp. 1-11
Author(s):  
Liu Narengerile ◽  
Li Di ◽  

At present, the college English testing system has become an indispensable system in many universities. However, the English test system is not highly humanized due to problems such as unreasonable framework structure. This paper combines data mining technology to build a college English test framework. The college English test system software based on data mining mainly realizes the computer program to automatically generate test papers, set the test time to automatically judge the test takers’ test results, and give out results on the spot. The test takers log in to complete the test through the test system software. The examination system software solves the functions of printing test papers, arranging invigilation classrooms, invigilating teachers, invigilating process, collecting test papers, scoring and analyzing test papers in traditional examinations. Finally, this paper analyzes the performance of this paper through experimental research. The research results show that the system constructed in this paper has certain practical effects.


2020 ◽  
Vol 1684 ◽  
pp. 012024
Author(s):  
Yiqun Liu ◽  
Xiaogang Wang ◽  
Xiaoyuan Gong ◽  
Hua Mu

Sign in / Sign up

Export Citation Format

Share Document