IntelliClean: A Teaching Case Designed to Integrate Data Cleaning and Spreadsheet Skills into the Audit Curriculum

2020 ◽  
Vol 17 (2) ◽  
pp. 17-24
Author(s):  
Kara Hunter ◽  
Cristina T. Alberti ◽  
Scott R. Boss ◽  
Jay C. Thibodeau

ABSTRACT This teaching case provides an approach for educators to impart knowledge about both the data-cleaning process and critical electronic spreadsheet functionalities that are used by auditors to students as part of the auditing curriculum. The cleaning of large datasets has become a vital task that is routinely performed by the most junior audit professional on the team. In this case, students learn how to cleanse a dataset and verify the completeness and accuracy of the dataset in accordance with relevant auditing standards. Importantly, all steps are completed on an author-created database within an electronic spreadsheet platform. The response from students has been strong. After completing the case assignment, a total of 81 auditing students at two private universities provided feedback. The results of the questionnaire reveal that students largely agree that the key learning objectives were achieved, validating the use of this case in the auditing curriculum.

2021 ◽  
Author(s):  
Siti Nur Fathin Najwa Binti Mustaffa ◽  
Jamaludin Bin Sallim ◽  
Rozlina Binti Mohamed

Author(s):  
Nicola Voyle ◽  
Maximilian Kerz ◽  
Steven Kiddle ◽  
Richard Dobson

This chapter highlights the methodologies which are increasingly being applied to large datasets or ‘big data’, with an emphasis on bio-informatics. The first stage of any analysis is to collect data from a well-designed study. The chapter begins by looking at the raw data that arises from epidemiological studies and highlighting the first stages in creating clean data that can be used to draw informative conclusions through analysis. The remainder of the chapter covers data formats, data exploration, data cleaning, missing data (i.e. the lack of data for a variable in an observation), reproducibility, classification versus regression, feature identification and selection, method selection (e.g. supervised versus unsupervised machine learning), training a classifier, and drawing conclusions from modelling.


Author(s):  
Milka Bochere Gesicho ◽  
Martin Chieng Were ◽  
Ankica Babic

Abstract Background The District Health Information Software-2 (DHIS2) is widely used by countries for national-level aggregate reporting of health-data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic, and transparent data cleaning approaches form a core component of preparing DHIS2 data for analyses. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. The aim of this study was to report on methods and results of a systematic and replicable data cleaning approach applied on HIV-data gathered within DHIS2 from 2011 to 2018 in Kenya, for secondary analyses. Methods Six programmatic area reports containing HIV-indicators were extracted from DHIS2 for all care facilities in all counties in Kenya from 2011 to 2018. Data variables extracted included reporting rate, reporting timeliness, and HIV-indicator data elements per facility per year. 93,179 facility-records from 11,446 health facilities were extracted from year 2011 to 2018. Van den Broeck et al.’s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed semi-automatically within a generic five-step data-cleaning sequence, which was developed and applied in cleaning the extracted data. Various quality issues were identified, and Friedman analysis of variance conducted to examine differences in distribution of records with selected issues across eight years. Results Facility-records with no data accounted for 50.23% and were removed. Of the remaining, 0.03% had over 100% in reporting rates. Of facility-records with reporting data, 0.66% and 0.46% were retained for voluntary medical male circumcision and blood safety programmatic area reports respectively, given that few facilities submitted data or offered these services. Distribution of facility-records with selected quality issues varied significantly by programmatic area (p < 0.001). The final clean dataset obtained was suitable to be used for subsequent secondary analyses. Conclusions Comprehensive, systematic, and transparent reporting of cleaning-process is important for validity of the research studies as well as data utilization. The semi-automatic procedures used resulted in improved data quality for use in secondary analyses, which could not be secured by automated procedures solemnly.


Author(s):  
SUNITHA YEDDULA ◽  
K. LAKSHMAIAH

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys.


2018 ◽  
Author(s):  
Hyunki Woo ◽  
Kyunga Kim ◽  
KyeongMin Cha ◽  
Jin-Young Lee ◽  
Hansong Mun ◽  
...  

BACKGROUND Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Zhuoyang Lyu

The pedestrian detection model has a high requirement on the quality of the dataset. Concerning this problem, this paper uses data cleaning technology to improve the quality of the dataset, so as to improve the performance of the pedestrian detection model. The dataset used in this paper is obtained from subway stations in Beijing and Nanjing. The data images’ quality is subject to motion blur, uneven illumination, and other noisy factors. Therefore, data cleaning is very important for this paper. The data cleaning process in this paper is divided into two parts: detection and correction. First, the whole dataset goes through blur detection, and the severely blurred images are filtered as the difficult samples. Then, the image is sent to DeblurGAN for deblur processing. 2D gamma function adaptive illumination correction algorithm is used to correct the subway pedestrian image. Then, the processed data is sent to the pedestrian detection model. Under different data cleaning datasets, through the analysis of the detection results, it is proved that the data cleaning process significantly improves the detection model’s performance.


2015 ◽  
Vol 32 (1) ◽  
pp. 95-112 ◽  
Author(s):  
Mahendra R. Gujarathi

ABSTRACT Diamond Foods is a comprehensive case that provides an opportunity for students to apply several financial auditing concepts and professional auditing standards to a real-world context. Diamond overstated its earnings of fiscal 2010 and 2011, by 38 percent and 47 percent, respectively, by delaying recognition of the cost of walnuts acquired to later accounting periods. The case requires students to determine whether Diamond's external auditor—Deloitte & Touche, LLP (Deloitte)—fulfilled its responsibility to obtain reasonable assurance that the financial statements were free of material misstatement. Students need to determine whether Deloitte's issuance of an unqualified opinion on Diamond's financial statements and internal controls violated professional auditing standards. In addition, students are required to evaluate whether Deloitte obtained sufficient understanding of Diamond's business and industry, exercised the needed professional skepticism, and used appropriate analytical procedures to discharge its professional responsibilities. The case presents an opportunity to achieve several learning objectives, including the development of research, critical-thinking, communication, and problem-solving skills. The case is appropriate for use in a graduate or undergraduate course in financial auditing. It can also be used in a fraud examination course, or in a capstone course in the accounting curriculum.


2020 ◽  
Vol 10 (13) ◽  
pp. 4590 ◽  
Author(s):  
Hyun-Jin Kim ◽  
Ji-Won Baek ◽  
Kyungyong Chung

This study proposes the optimization method of the associative knowledge graph using TF-IDF based ranking scores. The proposed method calculates TF-IDF weights in all documents and generates term ranking. Based on the terms with high scores from TF-IDF based ranking, optimized transactions are generated. News data are first collected through crawling and then are converted into a corpus through preprocessing. Unnecessary data are removed through preprocessing including lowercase conversion, removal of punctuation marks and stop words. In the document term matrix, words are extracted and then transactions are generated. In the data cleaning process, the Apriori algorithm is applied to generate association rules and make a knowledge graph. To optimize the generated knowledge graph, the proposed method utilizes TF-IDF based ranking scores to remove terms with low scores and recreate transactions. Based on the result, the association rule algorithm is applied to create an optimized knowledge model. The performance is evaluated in rule generation speed and usefulness of association rules. The association rule generation speed of the proposed method is about 22 seconds faster. And the lift value of the proposed method for usefulness is about 0.43 to 2.51 higher than that of each one of conventional association rule algorithms.


Sign in / Sign up

Export Citation Format

Share Document