Management of Looses Supported from Master Data Cleaning Process, for OSHEE Distribution Sector

We present a data cleaning project that utilizes real vendor master data of a large public university in the United States. Our main objective when developing this case was to identify the areas where students need guidance in order to apply a problem solving approach to the project. This includes initial analysis of the data and the task at hand, planning for cleaning and testing activities, executing this plan, and communicating the results in a written report. We provide a data set with 29K records of vendor master data, and a subset of the same data with 800 records. The assignment has two parts - the planning and the actual cleaning, each with its own deliverable. It can be used in many different courses and completed with almost any data analytics software. We provide suggested solutions and detailed solution notes for Excel and for Alteryx Designer.

Download Full-text

Data cleaning process for HIV-indicator data extracted from DHIS2 national reporting system: a case study of Kenya

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01315-7 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Milka Bochere Gesicho ◽

Martin Chieng Were ◽

Ankica Babic

Keyword(s):

Data Cleaning ◽

National Level ◽

Voluntary Medical Male Circumcision ◽

Cleaning Process ◽

Process Data ◽

District Health ◽

Medical Male Circumcision ◽

Secondary Analyses ◽

Indicator Data ◽

Quality Issues

Abstract Background The District Health Information Software-2 (DHIS2) is widely used by countries for national-level aggregate reporting of health-data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic, and transparent data cleaning approaches form a core component of preparing DHIS2 data for analyses. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. The aim of this study was to report on methods and results of a systematic and replicable data cleaning approach applied on HIV-data gathered within DHIS2 from 2011 to 2018 in Kenya, for secondary analyses. Methods Six programmatic area reports containing HIV-indicators were extracted from DHIS2 for all care facilities in all counties in Kenya from 2011 to 2018. Data variables extracted included reporting rate, reporting timeliness, and HIV-indicator data elements per facility per year. 93,179 facility-records from 11,446 health facilities were extracted from year 2011 to 2018. Van den Broeck et al.’s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed semi-automatically within a generic five-step data-cleaning sequence, which was developed and applied in cleaning the extracted data. Various quality issues were identified, and Friedman analysis of variance conducted to examine differences in distribution of records with selected issues across eight years. Results Facility-records with no data accounted for 50.23% and were removed. Of the remaining, 0.03% had over 100% in reporting rates. Of facility-records with reporting data, 0.66% and 0.46% were retained for voluntary medical male circumcision and blood safety programmatic area reports respectively, given that few facilities submitted data or offered these services. Distribution of facility-records with selected quality issues varied significantly by programmatic area (p < 0.001). The final clean dataset obtained was suitable to be used for subsequent secondary analyses. Conclusions Comprehensive, systematic, and transparent reporting of cleaning-process is important for validity of the research studies as well as data utilization. The semi-automatic procedures used resulted in improved data quality for use in secondary analyses, which could not be secured by automated procedures solemnly.

Download Full-text

INVESTIGATION OF TECHNIQUES FOR EFFICIENT & ACCURATE INDEXING FOR SCALABLE RECORD LINKAGE & DEDUPLICATION

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2015.1275 ◽

2015 ◽

pp. 59-65

Author(s):

SUNITHA YEDDULA ◽

K. LAKSHMAIAH

Keyword(s):

Record Linkage ◽

Data Cleaning ◽

Real Data ◽

Data Sets ◽

Cleaning Process ◽

Matching Process ◽

Indexing Techniques ◽

Proper Definition ◽

Definition Of ◽

Matched Data

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys.

Download Full-text

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study (Preprint)

10.2196/preprints.10013 ◽

2018 ◽

Author(s):

Hyunki Woo ◽

Kyunga Kim ◽

KyeongMin Cha ◽

Jin-Young Lee ◽

Hansong Mun ◽

...

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Data Cleaning ◽

Text Clustering ◽

Data Accuracy ◽

Stool Examination ◽

Cleaning Process ◽

Clustering Methods ◽

Text Data ◽

Efficient Data

BACKGROUND Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.

Download Full-text

Research on Subway Pedestrian Detection Algorithm Based on Big Data Cleaning Technology

Wireless Communications and Mobile Computing ◽

10.1155/2021/4700204 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Zhuoyang Lyu

Keyword(s):

Data Cleaning ◽

Pedestrian Detection ◽

Detection Algorithm ◽

Motion Blur ◽

Cleaning Process ◽

Detection Model ◽

Subway Stations ◽

Cleaning Technology ◽

Blurred Images

The pedestrian detection model has a high requirement on the quality of the dataset. Concerning this problem, this paper uses data cleaning technology to improve the quality of the dataset, so as to improve the performance of the pedestrian detection model. The dataset used in this paper is obtained from subway stations in Beijing and Nanjing. The data images’ quality is subject to motion blur, uneven illumination, and other noisy factors. Therefore, data cleaning is very important for this paper. The data cleaning process in this paper is divided into two parts: detection and correction. First, the whole dataset goes through blur detection, and the severely blurred images are filtered as the difficult samples. Then, the image is sent to DeblurGAN for deblur processing. 2D gamma function adaptive illumination correction algorithm is used to correct the subway pedestrian image. Then, the processed data is sent to the pedestrian detection model. Under different data cleaning datasets, through the analysis of the detection results, it is proved that the data cleaning process significantly improves the detection model’s performance.

Download Full-text

IntelliClean: A Teaching Case Designed to Integrate Data Cleaning and Spreadsheet Skills into the Audit Curriculum

Journal of Emerging Technologies in Accounting ◽

10.2308/jeta-2020-025 ◽

2020 ◽

Vol 17 (2) ◽

pp. 17-24

Author(s):

Kara Hunter ◽

Cristina T. Alberti ◽

Scott R. Boss ◽

Jay C. Thibodeau

Keyword(s):

Data Cleaning ◽

Private Universities ◽

Large Datasets ◽

Learning Objectives ◽

Cleaning Process ◽

Case Assignment ◽

Teaching Case ◽

Auditing Standards ◽

Electronic Spreadsheet ◽

Integrate Data

ABSTRACT This teaching case provides an approach for educators to impart knowledge about both the data-cleaning process and critical electronic spreadsheet functionalities that are used by auditors to students as part of the auditing curriculum. The cleaning of large datasets has become a vital task that is routinely performed by the most junior audit professional on the team. In this case, students learn how to cleanse a dataset and verify the completeness and accuracy of the dataset in accordance with relevant auditing standards. Importantly, all steps are completed on an author-created database within an electronic spreadsheet platform. The response from students has been strong. After completing the case assignment, a total of 81 auditing students at two private universities provided feedback. The results of the questionnaire reveal that students largely agree that the key learning objectives were achieved, validating the use of this case in the auditing curriculum.

Download Full-text

Optimization of Associative Knowledge Graph using TF-IDF based Ranking Score

Applied Sciences ◽

10.3390/app10134590 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4590 ◽

Cited By ~ 1

Author(s):

Hyun-Jin Kim ◽

Ji-Won Baek ◽

Kyungyong Chung

Keyword(s):

Association Rules ◽

Association Rule ◽

Data Cleaning ◽

Optimization Method ◽

Knowledge Graph ◽

Apriori Algorithm ◽

Cleaning Process ◽

Rule Generation ◽

Knowledge Model ◽

Ranking Score

This study proposes the optimization method of the associative knowledge graph using TF-IDF based ranking scores. The proposed method calculates TF-IDF weights in all documents and generates term ranking. Based on the terms with high scores from TF-IDF based ranking, optimized transactions are generated. News data are first collected through crawling and then are converted into a corpus through preprocessing. Unnecessary data are removed through preprocessing including lowercase conversion, removal of punctuation marks and stop words. In the document term matrix, words are extracted and then transactions are generated. In the data cleaning process, the Apriori algorithm is applied to generate association rules and make a knowledge graph. To optimize the generated knowledge graph, the proposed method utilizes TF-IDF based ranking scores to remove terms with low scores and recreate transactions. Based on the result, the association rule algorithm is applied to create an optimized knowledge model. The performance is evaluated in rule generation speed and usefulness of association rules. The association rule generation speed of the proposed method is about 22 seconds faster. And the lift value of the proposed method for usefulness is about 0.43 to 2.51 higher than that of each one of conventional association rule algorithms.

Download Full-text

Significance of field trials data cleaning process for making more reliable breeder decisions

Selekcija i semenarstvo ◽

10.5937/selsem1902023b ◽

2019 ◽

Vol 25 (2) ◽

pp. 23-30

Author(s):

Milosav Babić ◽

Petar Čanak ◽

Bojana Vujošević ◽

Vojka Babić ◽

Dušan Stanisavljević

Keyword(s):

Data Cleaning ◽

Field Trials ◽

Cleaning Process

Download Full-text

Detection and Correction of Abnormal Data with Optimized Dirty Data: A New Data Cleaning Model

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622021500188 ◽

2021 ◽

pp. 1-33

Author(s):

Kumar Rahul ◽

Rohitash Kumar Banyal

Keyword(s):

Optimization Algorithm ◽

Optimization Problems ◽

Data Cleaning ◽

Huffman Coding ◽

Grey Wolf Optimizer ◽

Data Detection ◽

Test Case ◽

Cleaning Process ◽

Data Prediction ◽

Dirty Data

Each and every business enterprises require noise-free and clean data. There is a chance of an increase in dirty data as the data warehouse loads and refreshes a large quantity of data continuously from the various sources. Hence, in order to avoid the wrong conclusions, the data cleaning process becomes a vital one in various data-connected projects. This paper made an effort to introduce a novel data cleaning technique for the effective removal of dirty data. This process involves the following two steps: (i) dirty data detection and (ii) dirty data cleaning. The dirty data detection process has been assigned with the following process namely, data normalization, hashing, clustering, and finding the suspected data. In the clustering process, the optimal selection of centroid is the promising one and is carried out by employing the optimization concept. After the finishing of dirty data prediction, the subsequent process: dirty data cleaning begins to activate. The cleaning process also assigns with some processes namely, the leveling process, Huffman coding, and cleaning the suspected data. The cleaning of suspected data is performed based on the optimization concept. Hence, for solving all optimization problems, a new hybridized algorithm is proposed, the so-called Firefly Update Enabled Rider Optimization Algorithm (FU-ROA), which is the hybridization of the Rider Optimization Algorithm (ROA) and Firefly (FF) algorithm is introduced. To the end, the analysis of the performance of the implanted data cleaning method is scrutinized over the other traditional methods like Particle Swarm Optimization (PSO), FF, Grey Wolf Optimizer (GWO), and ROA in terms of their positive and negative measures. From the result, it can be observed that for iteration 12, the performance of the proposed FU-ROA model for test case 1 on was 0.013%, 0.7%, 0.64%, and 0.29% better than the extant PSO, FF, GWO, and ROA models, respectively.

Download Full-text