data cleaning Latest Research Papers

2022 ◽

Vol 6 (GROUP) ◽

pp. 1-23

Author(s):

Trine Rask Nielsen ◽

Naja Holten Møller

Keyword(s):

Decision Making ◽

International Institutions ◽

Data Cleaning ◽

Country Of Origin ◽

Decision Makers ◽

Data Driven ◽

Decision Making Process ◽

Nation States ◽

Context Data

In asylum decision-making, legal authorities rely on the criterion "credibility" as a measure for determining whether an individual has a legitimate asylum claim; that is, whether they have a well-founded fear of persecution upon returning to their country of origin. Nation states, international institutions, and NGOs increasingly seek to leverage data-driven technologies to support such decisions, deploying processes of data cleaning, contestation, and interpretation. We qualitatively analyzed 50 asylum cases to understand how the asylum decision-making process in Denmark leverages data to configure individuals as credible (or not). In this context, data can vary from the applicant's testimony to data acquired on the applicant from registers and alphanumerical data. Our findings suggest that legal authorities assess credibility through a largely discretionary practice, establishing certainty by ruling out divergence or contradiction between the different forms of data and documentation involved in an asylum case. As with other reclassification processes [following Bowker and Star 1999], credibility is an ambiguous prototypical concept for decision-makers to attempt certainty, especially important to consider in the design of data-driven technologies where stakeholders have differential power.

Download Full-text

The Relationship of Insufficient Effort Responding and Response Styles: An Online Experiment

Frontiers in Psychology ◽

10.3389/fpsyg.2021.784375 ◽

2022 ◽

Vol 12 ◽

Author(s):

Gene M. Alarcon ◽

Michael A. Lee

Keyword(s):

Data Cleaning ◽

Cognitive Effort ◽

Response Styles ◽

Self Report ◽

Future Research ◽

Wide Range ◽

Before And After ◽

The Difference ◽

Relationship Of ◽

The Relationship

While self-report data is a staple of modern psychological studies, they rely on participants accurately self-reporting. Two constructs that impede accurate results are insufficient effort responding (IER) and response styles. These constructs share conceptual underpinnings and both utilized to reduce cognitive effort when responding to self-report scales. Little research has extensively explored the relationship of the two constructs. The current study explored the relationship of the two constructs across even-point and odd-point scales, as well as before and after data cleaning procedures. We utilized IRTrees, a statistical method for modeling response styles, to examine the relationship between IER and response styles. To capture the wide range of IER metrics available, we employed several forms of IER assessment in our analyses and generated IER factors based on the type of IER being detected. Our results indicated an overall modest relationship between IER and response styles, which varied depending on the type of IER metric being considered or type of scale being evaluated. As expected, data cleaning also changed the relationships of some of the variables. We posit the difference between the constructs may be the degree of cognitive effort participants are willing to expend. Future research and applications are discussed.

Download Full-text

Partial Discharge Diagnostics: Data Cleaning and Feature Extraction

Energies ◽

10.3390/en15020508 ◽

2022 ◽

Vol 15 (2) ◽

pp. 508

Author(s):

Donny Soh ◽

Sivaneasan Bala Krishnan ◽

Jacob Abraham ◽

Lai Kai Xian ◽

Tseng King Jet ◽

...

Keyword(s):

Feature Extraction ◽

Data Cleaning ◽

Partial Discharge ◽

Measurement Tool ◽

Accuracy Score ◽

Machine Learning Classification ◽

Detection Algorithms ◽

Efficient Detection ◽

Novel Approach ◽

Operational Data

Detection of partial discharge (PD) in switchgears requires extensive data collection and time-consuming analyses. Data from real live operational environments pose great challenges in the development of robust and efficient detection algorithms due to overlapping PDs and the strong presence of random white noise. This paper presents a novel approach using clustering for data cleaning and feature extraction of phase-resolved partial discharge (PRPD) plots derived from live operational data. A total of 452 PRPD 2D plots collected from distribution substations over a six-month period were used to test the proposed technique. The output of the clustering technique is evaluated on different types of machine learning classification techniques and the accuracy is compared using balanced accuracy score. The proposed technique extends the measurement abilities of a portable PD measurement tool for diagnostics of switchgear condition, helping utilities to quickly detect potential PD activities with minimal human manual analysis and higher accuracy.

Download Full-text

Dataset of Philippine Presidents’ Speeches from 1935 to 2016

International Journal of Computing Sciences Research ◽

10.25147/ijcsr.2017.001.1.72 ◽

2022 ◽

Vol 6 ◽

pp. 781-791

Author(s):

John Paul Miranda ◽

Keyword(s):

Economic Development ◽

Sentiment Analysis ◽

Public Services ◽

Data Cleaning ◽

Data Preparation ◽

Is Development

Purpose–The dataset was collected to examine and identify possible key topicswithin these texts. Method–Data preparation such as data cleaning, transformation, tokenization, removal of stop wordsfrom both English and Filipino, and word stemmingwas employed in the datasetbefore feeding it to sentiment analysis and the LDA model.Results–The topmost occurring word within the dataset is "development" and there are three (3) likely topics from the speeches of Philippine presidents: economic development, enhancement of public services, and addressing challenges.Conclusion–The datasetwas ableto provide valuable insights contained among official documents. While the study showedthatpresidentshave used their annual address to express their visions for the country. It alsopresentedthat the presidents from 1935 to 2016 faced the same problems during their term.Recommendations–Future researchers may collect other speeches made by presidents during their term;combine them to the dataset used in this studyto furtherinvestigate these important textsby subjecting them to the same methodology used in this study.The dataset may be requested from the authors and it is recommended for further analysis. For example, determine how the speeches of the president reflect the preamble or foundations of the Philippine constitution.

Download Full-text

Application of machine learning to predict the thermal power plant process condition

Journal of Physics Conference Series ◽

10.1088/1742-6596/2150/1/012029 ◽

2022 ◽

Vol 2150 (1) ◽

pp. 012029

Author(s):

M M Sultanov ◽

I A Boldyrev ◽

K V Evseev

Keyword(s):

Machine Learning ◽

Power Plant ◽

Thermal Power Plant ◽

Input Data ◽

Data Cleaning ◽

Thermal Power ◽

Process Condition ◽

Process Variables ◽

Machine Learning Model ◽

Plant Process

Abstract This paper deals with the development of an algorithm for predicting thermal power plant process variables. The input data are described, and the data cleaning algorithm is presented along with the Python frameworks used. The employed machine learning model is discussed, and the results are presented.

Download Full-text

Medicinal Plant Identification Using Machine Learning Techniques

10.4018/978-1-7998-7685-4.ch008 ◽

2022 ◽

pp. 120-130

Author(s):

Udaya C. S. ◽

Usharani M.

Keyword(s):

Machine Learning ◽

Side Effects ◽

Medicinal Plants ◽

Data Cleaning ◽

Active Role ◽

Machine Learning Techniques ◽

Plant Identification ◽

The Public ◽

Learning Techniques ◽

Image Dataset

In this world there are thousands of plant species available, and plants have medicinal values. Medicinal plants play a very active role in healthcare traditions. Ayurveda is one of the oldest systems of medicinal science that is used even today. So proper identification of the medicinal plants has major benefits for not only manufacturing medicines but also for forest department peoples, life scientists, physicians, medication laboratories, government, and the public. The manual method is good for identifying plants easily, but is usually done by the skilled practitioners who have achieved expertise in this field. However, it is time consuming. There may be chances to misidentification, which leads to certain side effects and may lead to serious problems. This chapter focuses on creation of image dataset by using a mobile-based tool for image acquisition, which helps to capture the structured images, and reduces the effort of data cleaning. This chapter also suggests that by ANN, CNN, or PNN classifier, the classification can be done accurately.

Download Full-text

Zomato Data Analysis

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39303 ◽

2021 ◽

Vol 9 (12) ◽

pp. 644-649

Author(s):

Arpit Saxena

Keyword(s):

Data Analysis ◽

Data Cleaning ◽

Cost Effective ◽

Food Delivery ◽

Time Data ◽

Target Variable ◽

Information Sets ◽

Exploratory Data ◽

Null Values ◽

True Time

Abstract: Whenever we would like to visit a brand new place in delhi -NCR, we often search for the most effective restaurant or the most cost effective restaurant, but of decent quality. For looking of our greatest restaurants we frequently goes for various websites and apps to induce an overall idea of restaurants service. the foremost important criteria for all this is often rating and reviews of the those that have already got experience in these restaurants. People see for rating and compare these restaurants with one another and choose for his or her best. We restrict our data only to Delhi-NCR. This Zomato dataset provides us with enough information in order that one can decide which restaurants is suitable at which place and what kind of food they must serve so as get maximum profit. it's 9552 rows and 22 columns during this dataset. We'd wish to find the most affordable restaurant in Delhi-NCR.We can discuss various relationships between various columns of information sets like between rating and cuisine type , locality and cuisine etc. Since it's a true time data we might start first with data cleaning like cleaning spaces , garbage texts etc , then data exploratory like handling the None values, null values, dropping duplicates and other Transformations then randomization of dataset so analysis. Our target variable is that the "Aggregate Rating" column. We explore the link of the opposite features within the dataset with relevancy Rates. we'll the visualize the relation of all the opposite depend features with relevance our target variable, and hence find the foremost correlated features which effects our target variable. Keywords: Online food delivery, Marketing mix strategies, Competitive analysis, Pre-processing, Data Cleaning, Data Mining, Exploratory data analysis , Classification , Pandas , MatPlotLib.

Download Full-text

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Scientific Programming ◽

10.1155/2021/5916748 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Biqiu Li ◽

Jiabin Wang ◽

Xueli Liu

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithm ◽

Parallel Implementation ◽

Data Cleaning ◽

Position Vector ◽

Data Sets ◽

Implementation Scheme ◽

Mining Work ◽

Context Features

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Download Full-text