IntelliClean: A Teaching Case Designed to Integrate Data Cleaning and Spreadsheet Skills into the Audit Curriculum

ABSTRACT This teaching case provides an approach for educators to impart knowledge about both the data-cleaning process and critical electronic spreadsheet functionalities that are used by auditors to students as part of the auditing curriculum. The cleaning of large datasets has become a vital task that is routinely performed by the most junior audit professional on the team. In this case, students learn how to cleanse a dataset and verify the completeness and accuracy of the dataset in accordance with relevant auditing standards. Importantly, all steps are completed on an author-created database within an electronic spreadsheet platform. The response from students has been strong. After completing the case assignment, a total of 81 auditing students at two private universities provided feedback. The results of the questionnaire reveal that students largely agree that the key learning objectives were achieved, validating the use of this case in the auditing curriculum.

Download Full-text

Enhancing High-Quality User Stories with AQUSA: An Overview Study of Data Cleaning Process

10.1109/icsecs52883.2021.00060 ◽

2021 ◽

Author(s):

Siti Nur Fathin Najwa Binti Mustaffa ◽

Jamaludin Bin Sallim ◽

Rozlina Binti Mohamed

Keyword(s):

Data Cleaning ◽

Cleaning Process ◽

High Quality ◽

User Stories

Download Full-text

Bio-informatics and psychiatric epidemiology

Practical Psychiatric Epidemiology ◽

10.1093/med/9780198735564.003.0021 ◽

2020 ◽

pp. 359-372

Author(s):

Nicola Voyle ◽

Maximilian Kerz ◽

Steven Kiddle ◽

Richard Dobson

Keyword(s):

Machine Learning ◽

Data Cleaning ◽

Epidemiological Studies ◽

Selection Method ◽

Psychiatric Epidemiology ◽

Large Datasets ◽

Data Exploration ◽

Feature Identification ◽

Data Formats ◽

Method Selection

This chapter highlights the methodologies which are increasingly being applied to large datasets or ‘big data’, with an emphasis on bio-informatics. The first stage of any analysis is to collect data from a well-designed study. The chapter begins by looking at the raw data that arises from epidemiological studies and highlighting the first stages in creating clean data that can be used to draw informative conclusions through analysis. The remainder of the chapter covers data formats, data exploration, data cleaning, missing data (i.e. the lack of data for a variable in an observation), reproducibility, classification versus regression, feature identification and selection, method selection (e.g. supervised versus unsupervised machine learning), training a classifier, and drawing conclusions from modelling.

Download Full-text

Data cleaning process for HIV-indicator data extracted from DHIS2 national reporting system: a case study of Kenya

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01315-7 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Milka Bochere Gesicho ◽

Martin Chieng Were ◽

Ankica Babic

Keyword(s):

Data Cleaning ◽

National Level ◽

Voluntary Medical Male Circumcision ◽

Cleaning Process ◽

Process Data ◽

District Health ◽

Medical Male Circumcision ◽

Secondary Analyses ◽

Indicator Data ◽

Quality Issues

Abstract Background The District Health Information Software-2 (DHIS2) is widely used by countries for national-level aggregate reporting of health-data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic, and transparent data cleaning approaches form a core component of preparing DHIS2 data for analyses. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. The aim of this study was to report on methods and results of a systematic and replicable data cleaning approach applied on HIV-data gathered within DHIS2 from 2011 to 2018 in Kenya, for secondary analyses. Methods Six programmatic area reports containing HIV-indicators were extracted from DHIS2 for all care facilities in all counties in Kenya from 2011 to 2018. Data variables extracted included reporting rate, reporting timeliness, and HIV-indicator data elements per facility per year. 93,179 facility-records from 11,446 health facilities were extracted from year 2011 to 2018. Van den Broeck et al.’s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed semi-automatically within a generic five-step data-cleaning sequence, which was developed and applied in cleaning the extracted data. Various quality issues were identified, and Friedman analysis of variance conducted to examine differences in distribution of records with selected issues across eight years. Results Facility-records with no data accounted for 50.23% and were removed. Of the remaining, 0.03% had over 100% in reporting rates. Of facility-records with reporting data, 0.66% and 0.46% were retained for voluntary medical male circumcision and blood safety programmatic area reports respectively, given that few facilities submitted data or offered these services. Distribution of facility-records with selected quality issues varied significantly by programmatic area (p < 0.001). The final clean dataset obtained was suitable to be used for subsequent secondary analyses. Conclusions Comprehensive, systematic, and transparent reporting of cleaning-process is important for validity of the research studies as well as data utilization. The semi-automatic procedures used resulted in improved data quality for use in secondary analyses, which could not be secured by automated procedures solemnly.

Download Full-text

Bootstrap Sampling Based Data Cleaning and Maximum Entropy SVMs for Large Datasets

2012 IEEE 24th International Conference on Tools with Artificial Intelligence ◽

10.1109/ictai.2012.164 ◽

2012 ◽

Cited By ~ 1

Author(s):

Senzhang Wang ◽

Zhoujun Li ◽

Xiaoming Zhang

Keyword(s):

Maximum Entropy ◽

Data Cleaning ◽

Large Datasets ◽

Bootstrap Sampling

Download Full-text

INVESTIGATION OF TECHNIQUES FOR EFFICIENT & ACCURATE INDEXING FOR SCALABLE RECORD LINKAGE & DEDUPLICATION

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2015.1275 ◽

2015 ◽

pp. 59-65

Author(s):

SUNITHA YEDDULA ◽

K. LAKSHMAIAH

Keyword(s):

Record Linkage ◽

Data Cleaning ◽

Real Data ◽

Data Sets ◽

Cleaning Process ◽

Matching Process ◽

Indexing Techniques ◽

Proper Definition ◽

Definition Of ◽

Matched Data

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys.

Download Full-text

Management of Looses Supported from Master Data Cleaning Process, for OSHEE Distribution Sector

Academic Journal of Interdisciplinary Studies ◽

10.5901/ajis.2015.v4n2p129 ◽

2015 ◽

Author(s):

Anni Dasho ◽

Diana Sharra ◽

Genci Sharko ◽

Indrit Baholli

Keyword(s):

Data Cleaning ◽

Cleaning Process ◽

Master Data ◽

Distribution Sector

Download Full-text

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study (Preprint)

10.2196/preprints.10013 ◽

2018 ◽

Author(s):

Hyunki Woo ◽

Kyunga Kim ◽

KyeongMin Cha ◽

Jin-Young Lee ◽

Hansong Mun ◽

...

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Data Cleaning ◽

Text Clustering ◽

Data Accuracy ◽

Stool Examination ◽

Cleaning Process ◽

Clustering Methods ◽

Text Data ◽

Efficient Data

BACKGROUND Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.

Download Full-text

Research on Subway Pedestrian Detection Algorithm Based on Big Data Cleaning Technology

Wireless Communications and Mobile Computing ◽

10.1155/2021/4700204 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Zhuoyang Lyu

Keyword(s):

Data Cleaning ◽

Pedestrian Detection ◽

Detection Algorithm ◽

Motion Blur ◽

Cleaning Process ◽

Detection Model ◽

Subway Stations ◽

Cleaning Technology ◽

Blurred Images

The pedestrian detection model has a high requirement on the quality of the dataset. Concerning this problem, this paper uses data cleaning technology to improve the quality of the dataset, so as to improve the performance of the pedestrian detection model. The dataset used in this paper is obtained from subway stations in Beijing and Nanjing. The data images’ quality is subject to motion blur, uneven illumination, and other noisy factors. Therefore, data cleaning is very important for this paper. The data cleaning process in this paper is divided into two parts: detection and correction. First, the whole dataset goes through blur detection, and the severely blurred images are filtered as the difficult samples. Then, the image is sent to DeblurGAN for deblur processing. 2D gamma function adaptive illumination correction algorithm is used to correct the subway pedestrian image. Then, the processed data is sent to the pedestrian detection model. Under different data cleaning datasets, through the analysis of the detection results, it is proved that the data cleaning process significantly improves the detection model’s performance.

Download Full-text

Diamond Foods, Inc.: A Comprehensive Case in Financial Auditing

Issues in Accounting Education ◽

10.2308/iace-51361 ◽

2015 ◽

Vol 32 (1) ◽

pp. 95-112 ◽

Cited By ~ 2

Author(s):

Mahendra R. Gujarathi

Keyword(s):

Financial Statements ◽

Internal Controls ◽

Learning Objectives ◽

Problem Solving Skills ◽

Professional Skepticism ◽

Capstone Course ◽

Accounting Curriculum ◽

Auditing Standards ◽

External Auditor ◽

The Cost

ABSTRACT Diamond Foods is a comprehensive case that provides an opportunity for students to apply several financial auditing concepts and professional auditing standards to a real-world context. Diamond overstated its earnings of fiscal 2010 and 2011, by 38 percent and 47 percent, respectively, by delaying recognition of the cost of walnuts acquired to later accounting periods. The case requires students to determine whether Diamond's external auditor—Deloitte & Touche, LLP (Deloitte)—fulfilled its responsibility to obtain reasonable assurance that the financial statements were free of material misstatement. Students need to determine whether Deloitte's issuance of an unqualified opinion on Diamond's financial statements and internal controls violated professional auditing standards. In addition, students are required to evaluate whether Deloitte obtained sufficient understanding of Diamond's business and industry, exercised the needed professional skepticism, and used appropriate analytical procedures to discharge its professional responsibilities. The case presents an opportunity to achieve several learning objectives, including the development of research, critical-thinking, communication, and problem-solving skills. The case is appropriate for use in a graduate or undergraduate course in financial auditing. It can also be used in a fraud examination course, or in a capstone course in the accounting curriculum.

Download Full-text

Optimization of Associative Knowledge Graph using TF-IDF based Ranking Score

Applied Sciences ◽

10.3390/app10134590 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4590 ◽

Cited By ~ 1

Author(s):

Hyun-Jin Kim ◽

Ji-Won Baek ◽

Kyungyong Chung

Keyword(s):

Association Rules ◽

Association Rule ◽

Data Cleaning ◽

Optimization Method ◽

Knowledge Graph ◽

Apriori Algorithm ◽

Cleaning Process ◽

Rule Generation ◽

Knowledge Model ◽

Ranking Score

This study proposes the optimization method of the associative knowledge graph using TF-IDF based ranking scores. The proposed method calculates TF-IDF weights in all documents and generates term ranking. Based on the terms with high scores from TF-IDF based ranking, optimized transactions are generated. News data are first collected through crawling and then are converted into a corpus through preprocessing. Unnecessary data are removed through preprocessing including lowercase conversion, removal of punctuation marks and stop words. In the document term matrix, words are extracted and then transactions are generated. In the data cleaning process, the Apriori algorithm is applied to generate association rules and make a knowledge graph. To optimize the generated knowledge graph, the proposed method utilizes TF-IDF based ranking scores to remove terms with low scores and recreate transactions. Based on the result, the association rule algorithm is applied to create an optimized knowledge model. The performance is evaluated in rule generation speed and usefulness of association rules. The association rule generation speed of the proposed method is about 22 seconds faster. And the lift value of the proposed method for usefulness is about 0.43 to 2.51 higher than that of each one of conventional association rule algorithms.

Download Full-text