A Comparative Study of Data Cleaning Tools

2019 ◽  
Vol 15 (4) ◽  
pp. 48-65 ◽  
Author(s):  
Samson Oni ◽  
Zhiyuan Chen ◽  
Susan Hoban ◽  
Onimi Jademi

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resources of the data analyst. Adequate tools and techniques must be used for data cleaning. There exist a lot of data cleaning tools but it is unclear how to choose them in various situations. This research aims at helping researchers and organizations choose the right tools for data cleaning. This article conducts a comparative study of four commonly used data cleaning tools on two real data sets and answers the research question of which tool will be useful based on different scenario.

2013 ◽  
Vol 1 (1) ◽  
pp. 19-25 ◽  
Author(s):  
Abdelkader Baaziz ◽  
Luc Quoniam

“Big Data is the oil of the new economy” is the most famous citation during the three last years. It has even been adopted by the World Economic Forum in 2011. In fact, Big Data is like crude! It’s valuable, but if unrefined it cannot be used. It must be broken down, analyzed for it to have value. But what about Big Data generated by the Petroleum Industry and particularly its upstream segment? Upstream is no stranger to Big Data. Understanding and leveraging data in the upstream segment enables firms to remain competitive throughout planning, exploration, delineation, and field development.Oil Gas Companies conduct advanced geophysics modeling and simulation to support operations where 2D, 3D 4D Seismic generate significant data during exploration phases. They closely monitor the performance of their operational assets. To do this, they use tens of thousands of data-collecting sensors in subsurface wells and surface facilities to provide continuous and real-time monitoring of assets and environmental conditions. Unfortunately, this information comes in various and increasingly complex forms, making it a challenge to collect, interpret, and leverage the disparate data. As an example, Chevron’s internal IT traffic alone exceeds 1.5 terabytes a day.Big Data technologies integrate common and disparate data sets to deliver the right information at the appropriate time to the correct decision-maker. These capabilities help firms act on large volumes of data, transforming decision-making from reactive to proactive and optimizing all phases of exploration, development and production. Furthermore, Big Data offers multiple opportunities to ensure safer, more responsible operations. Another invaluable effect of that would be shared learning.The aim of this paper is to explain how to use Big Data technologies to optimize operations. How can Big Data help experts to decision-making leading the desired outcomes?Keywords:Big Data; Analytics; Upstream Petroleum Industry; Knowledge Management; KM; Business Intelligence; BI; Innovation; Decision-making under Uncertainty


Author(s):  
SUNITHA YEDDULA ◽  
K. LAKSHMAIAH

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Fatima Ulubekova ◽  
Gamze Ozel

The aim of the study is to obtain the alpha power Kumaraswamy (APK) distribution. Some main statistical properties of the APK distribution are investigated including survival, hazard rate and quantile functions, skewness, kurtosis, order statistics. The hazard rate function of the proposed distribution could be useful to model data sets with bathtub hazard rates. We provide a real data application and show that the APK distribution is better than the other compared distributions fort the right-skewed data sets.


Author(s):  
David McCullin

This study uses a critically appraised topic (CAT) to explore the potential of integrating evidence-based management (EBMgt) and military judgment and decision-making (MJDM). The study uses the five focus areas of the Commandant’s Planning Guidance: 38th Commandant of the Marine Corps to search scholarly databases such as the ABI/INFORM Collection from ProQuest and Business Source Premier from EBSCO. The search process was conducted using a research question variable synthesis (RQVS), introduced specifically for this study by the author. The RQVS applied to each of the Commandant’s five priority focus areas in separate inquiries, producing five separate data sets. The RQVS is a search enhancement methodology developed by the author. This methodology improves the linkage between the search process and the research question (RQ) and enhances rigor and transparency of the overall study. The key findings are that there is sufficient scholarship to address problem areas in each of the Commandant’s five priority focus areas. This study demonstrates that an integration of EBMgt and MJDM is both feasible and pragmatic.


2017 ◽  
Vol 7 (1) ◽  
pp. 86
Author(s):  
M.E. Ghitany ◽  
Ramesh C. Gupta ◽  
Shaochen Wang

This paper deals with some characterization results based on truncated expectations in the continuous as well as in the discrete case. Both the right and left truncations are considered and some general results are derived. Some of the known results dealing with truncated moments, residual moments and residual partial moments are obtained as special cases. These results are utilized to obtain certain characterization results for the Lindley type distributions. The characterization results provide new methods for estimating the unknown parameter of Lindley-type distributions and their goodness-of-fit. The results on the Lindley-type distributions are applied on some real data sets.


Author(s):  
Surajit Bag

Background: Big data and predictive analysis have been hailed as the fourth paradigm of science. Big data and analytics are critical to the future of business sustainability. The demand for data scientists is increasing with the dynamic nature of businesses, thus making it indispensable to manage big data, derive meaningful results and interpret management decisions.Objectives: The purpose of this study was to provide a brief conceptual review of big data and analytics and further illustrate the use of a multicriteria decision-making technique in selecting the right skilled candidate for big data and analytics in procurement management.Method: It is important for firms to select and recruit the right data analyst, both in terms of skills sets and scope of analysis. The nature of such a problem is complex and multicriteria decision-making, which deals with both qualitative and quantitative factors. In the current study, an application of the Fuzzy VIsekriterijumska optimizacija i KOmpromisno Resenje (VIKOR) method was used to solve the big data analyst selection problem.Results: From this study, it was identified that Technical knowledge (C1), Intellectual curiosity (C4) and Business acumen (C5) are the strongest influential criteria and must be present in the candidate for the big data and analytics job.Conclusion: Fuzzy VIKOR is the perfect technique in this kind of multiple criteria decisionmaking problematic scenario. This study will assist human resource managers and procurement managers in selecting the right workforce for big data analytics.


Author(s):  
Onur Doğan ◽  
Hakan  Aşan ◽  
Ejder Ayç

In today’s competitive world, organizations need to make the right decisions to prolong their existence. Using non-scientific methods and making emotional decisions gave way to the use of scientific methods in the decision making process in this competitive area. Within this scope, many decision support models are still being developed in order to assist the decision makers and owners of organizations. It is easy to collect massive amount of data for organizations, but generally the problem is using this data to achieve economic advances. There is a critical need for specialization and automation to transform the data into the knowledge in big data sets. Data mining techniques are capable of providing description, estimation, prediction, classification, clustering, and association. Recently, many data mining techniques have been developed in order to find hidden patterns and relations in big data sets. It is important to obtain new correlations, patterns, and trends, which are understandable and useful to the decision makers. There have been many researches and applications focusing on different data mining techniques and methodologies.In this study, we aim to obtain understandable and applicable results from a large volume of record set that belong to a firm, which is active in the meat processing industry, by using data mining techniques. In the application part, firstly, data cleaning and data integration, which are the first steps of data mining process, are performed on the data in the database. With the aid of data cleaning and data integration, the data set was obtained, which is suitable for data mining. Then, various association rule algorithms were applied to this data set. This analysis revealed that finding unexplored patterns in the set of data would be beneficial for the decision makers of the firm. Finally, many association rules are obtained, which are useful for decision makers of the local firm. 


2013 ◽  
Vol 4 (2) ◽  
pp. 31-50 ◽  
Author(s):  
Simon Andrews ◽  
Constantinos Orphanides

Formal Concept Analysis (FCA) has been successfully applied to data in a number of problem domains. However, its use has tended to be on an ad hoc, bespoke basis, relying on FCA experts working closely with domain experts and requiring the production of specialised FCA software for the data analysis. The availability of generalised tools and techniques, that might allow FCA to be applied to data more widely, is limited. Two important issues provide barriers: raw data is not normally in a form suitable for FCA and requires undergoing a process of transformation to make it suitable, and even when converted into a suitable form for FCA, real data sets tend to produce a large number of results that can be difficult to manage and interpret. This article describes how some open-source tools and techniques have been developed and used to address these issues and make FCA more widely available and applicable. Three examples of real data sets, and real problems related to them, are used to illustrate the application of the tools and techniques and demonstrate how FCA can be used as a semantic technology to discover knowledge. Furthermore, it is shown how these tools and techniques enable FCA to deliver a visual and intuitive means of mining large data sets for association and implication rules that complements the semantic analysis. In fact, it transpires that FCA reveals hidden meaning in data that can then be examined in more detail using an FCA approach to traditional data mining methods.


2019 ◽  
Vol 8 (6) ◽  
pp. 51 ◽  
Author(s):  
Ahmad Alzaghal ◽  
Duha Hamed

In this paper, we propose new families of generalized Lomax distributions named T-LomaxfYg. Using the methodology of the Transformed-Transformer, known as T-X framework, the T-Lomax families introduced are arising from the quantile functions of exponential, Weibull, log-logistic, logistic, Cauchy and extreme value distributions. Various structural properties of the new families are derived including moments, modes and Shannon entropies. Several new generalized Lomax distributions are studied. The shapes of these T-LomaxfYg distributions are very flexible and can be symmetric, skewed to the right, skewed to the left, or bimodal. The method of maximum likelihood is proposed for estimating the distributions parameters and a simulation study is carried out to assess its performance. Four applications of real data sets are used to demonstrate the flexibility of T-LomaxfYg family of distributions in fitting unimodal and bimodal data sets from di erent disciplines.


2020 ◽  
Vol 15 (5) ◽  
pp. 1158-1177
Author(s):  
Jenna A. Harder

When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.


Sign in / Sign up

Export Citation Format

Share Document