The Pre Big Data Matching Redundancy Avoidance Algorithm with Mapreduce

BackgroundCarcinogen exposure data can potentially guide the work of health and safety (H and S) regulators. This project aims to use CAREX Canada data to estimate carcinogen exposures in New Zealand industries. This requires the creation of a cross-walk between the countries’ industry classifications.MethodsAgile and big-data-science methodologies were used to construct two versions of an industry classification cross-walk from the 2006 Australian and New Zealand Standard Industrial Classification (ANZSIC06) to the Canadian version of the 2002 North American Industrial Classification (NAICS2002), used by CAREX Canada.Firstly, concordance files from government statistics bureaus cross-walked the path: ANZSIC06 ->International Standard Industrial Classification of All Economic Activities Rev4 ->NAICS2017->NAICS2012->NAICS2007->NAICS2002. The cross-walk accounted for ‘one-to-many-to-one’, non-machine formats, and missing/erroneous values.Secondly, a fuzzy data matching pipeline was designed. Data preparation removed redundant, stop, and common domain words, and lemmatised using morphological analysis (e.g. fishing to fish). Data matching used a hybrid algorithm combining ‘JaroWinkler-distance’ and a token-sort approach (i.e. ignoring the positional occurrence of words in a sentence) to match descriptions. A trial-and-error approach was used to assign weightings and concatenate the hierarchical industry classification levels to improve match accuracy. Python language was used for implementation.For each method, random samples of 50 matches were manually classified as either poor or sufficient by two people. Disagreements were discussed and consensus reached.ResultsThe concordance cross-walk sample had 52% (95% C.I. 38%–66%) sufficient matches compared to 84% (95% C.I. 74%–94%) for the fuzzy data matching pipeline cross-walk sample.ConclusionsCross-walking countries’ industry classifications using a fuzzy data matching pipeline was more accurate than using a concordance cross-walk. The pipeline is modular enough to easily include more components. This work is part of a vision to design a semantic big-data lake, enabling integration of any data relevant to H and S.

Download Full-text

Big Data Matching Using the Identity Correlation Approach

Proceedings of the 1st International Conference on Advanced Research Methods and Analytics ◽

10.4995/carma2016.2016.2991 ◽

2016 ◽

Author(s):

Mary Smyth ◽

Kevin McCormack

Keyword(s):

Big Data ◽

Public Service ◽

Central Government ◽

Statistical Technique ◽

Statistical Techniques ◽

Unique Identifier ◽

Individual Level ◽

Data Matching ◽

Business Survey ◽

Administrative Datasets

Abstract The Identity Correlation Approach (ICA) is a statistical technique developed for matching big data where a unique identifier does not exist. This technique was developed to match the Irish Census 2011 dataset to Central Government Administrative Datasets in order to attach a unique identifier to each individual person in the Census dataset (McCormack & Smyth, 20151). The unique identifier attached is the PPS No. (Personal Public Service No.2). By attaching the PPS No. to the Census dataset, each individual can be linked to datasets held centrally by Public Sector Organisations. This expands the range of variables for statistical analysis at individual level. Statistical techniques developed here were undertaken for a major European Structure of Earnings Survey (SES) compiled by the CSO using administrative data only, and thus eliminating the need for an expensive business survey to be conducted (NES, 20073,4,5). A description of how the Identity Correlation Approach was developed is given in this paper. Data matching results and conclusions are presented here in relation to the Structure of Earnings Survey (SES)6 results for 2011.

Download Full-text

The Contemporary Ethical and Privacy Issues of Smart Medical Fields

Research Anthology on Privatizing and Securing Data ◽

10.4018/978-1-7998-8954-0.ch092 ◽

2021 ◽

pp. 1899-1908

Author(s):

Victor Chang ◽

Yujie Shi ◽

Yan Zhang

Keyword(s):

Big Data ◽

Medical Diagnosis ◽

Medical Information ◽

Health Management ◽

Sensor Technology ◽

Flow Process ◽

Hospital System ◽

Ethical Problems ◽

Data Matching ◽

Privacy Issues

With the development of big data technology, such as data mining and data matching, many industries have started a revolution, including medical field. Big data not only strengthens the accuracy of medical diagnosis, but it also enhances the efficiency of the entire medical system and relevant medical staff. Additionally, with the rethinking of innovation, the application of wearable intelligent device, RFID technology and sensor technology play positive roles in promoting medical interaction between hospital system and wearer. Smart medical provides effective methods for individual health management and promotes the progress of medical information. However, there are also some inevitable ethical problems, e.g., the leakage of privacy information, which cannot be avoided to some extents. The authors recommend some suggestions to reduce the possibilities of ethical problems happened during the data flow process.

Download Full-text

The Contemporary Ethical and Privacy Issues of Smart Medical Fields

International Journal of Strategic Engineering ◽

10.4018/ijose.2019070104 ◽

2019 ◽

Vol 2 (2) ◽

pp. 35-43

Author(s):

Victor Chang ◽

Yujie Shi ◽

Yan Zhang

Keyword(s):

Big Data ◽

Medical Information ◽

Health Management ◽

Sensor Technology ◽

Flow Process ◽

Hospital System ◽

Ethical Problems ◽

Data Matching ◽

Privacy Issues ◽

Big Data Technology

With the development of big data technology, such as data mining and data matching, many industries have started a revolution, including medical field. Big data not only strengthens the accuracy of medical diagnosis, but it also enhances the efficiency of the entire medical system and relevant medical staff. Additionally, with the rethinking of innovation, the application of wearable intelligent device, RFID technology and sensor technology play positive roles in promoting medical interaction between hospital system and wearer. Smart medical provides effective methods for individual health management and promotes the progress of medical information. However, there are also some inevitable ethical problems, e.g., the leakage of privacy information, which cannot be avoided to some extents. The authors recommend some suggestions to reduce the possibilities of ethical problems happened during the data flow process.

Download Full-text

Automated Matching of Multi-Scale Building Data Based on Relaxation Labelling and Pattern Combinations

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8010038 ◽

2019 ◽

Vol 8 (1) ◽

pp. 38

Author(s):

Yunfei Zhang ◽

Jincai Huang ◽

Min Deng ◽

Chi Chen ◽

Fangbin Zhou ◽

...

Keyword(s):

Big Data ◽

Contextual Information ◽

Crucial Issue ◽

Matching Method ◽

Multi Scale ◽

Data Matching ◽

Matching Pair ◽

Data Updating ◽

Different Sources ◽

Functional Solution

With the increasingly urgent demand for map conflation and timely data updating, data matching has become a crucial issue in big data and the GIS community. However, non-rigid deviation, shape homogenization, and uncertain scale differences occur in crowdsourced and official building data, causing challenges in conflating heterogeneous building datasets from different sources and scales. This paper thus proposes an automated building data matching method based on relaxation labelling and pattern combinations. The proposed method first detects all possible matching objects and pattern combinations to create a matching table, and calculates four geo-similarities for each candidate-matching pair to initialize a probabilistic matching matrix. After that, the contextual information of neighboring candidate-matching pairs is explored to heuristically amend the geo-similarity-based matching matrix for achieving a contextual matching consistency. Three case studies are conducted to illustrate that the proposed method obtains high matching accuracies and correctly identifies various 1:1, 1:M, and M:N matching. This indicates the pattern-level relaxation labelling matching method can efficiently overcome the problems of shape homogeneity and non-rigid deviation, and meanwhile has weak sensitivity to uncertain scale differences, providing a functional solution for conflating crowdsourced and official building data.

Download Full-text

“Ask for More Time”: Big Data Chronopolitics in the Australian Welfare Bureaucracy

Critical Sociology ◽

10.1177/0896920519866004 ◽

2019 ◽

Vol 46 (6) ◽

pp. 867-880 ◽

Cited By ~ 1

Author(s):

Andrew Whelan

Keyword(s):

Big Data ◽

National Income ◽

Welfare Recipients ◽

Combined Effects ◽

Income Support ◽

Data Matching ◽

Matching Design ◽

Compliance Intervention ◽

Government Rhetoric ◽

Reported Income

Since 2016, welfare recipients in Australia have been subject to the Online Compliance Intervention (OCI), implemented through the national income support agency, Centrelink. This is a big data initiative, matching reported income to tax records to recoup welfare overpayments. The OCI proved controversial, notably for a “reverse onus,” requiring that claimants disprove debts, and for data-matching design leading frequently to incorrect debts. As algorithmic governance, the OCI directs attention to the chronopolitics of contemporary welfare bureaucracies. It outsources labor previously conducted by Centrelink to clients, compelling them to submit documentation lest debts be raised against them. It imposes an active wait against a deadline on those issued debt notifications. Belying government rhetoric about the accessibility of the digital state, the OCI demonstrates how automation exacerbates punitive welfare agendas, through transfers of time, money, and labor whose combined effects are such as to occupy the time of people experiencing poverty.

Download Full-text

Find Out About 'Big Data' to Track Outcomes

ASHA Leader ◽

10.1044/leader.an5.18022013.59 ◽

2013 ◽

Vol 18 (2) ◽

pp. 59-59

Keyword(s):

Big Data

Find Out About 'Big Data' to Track Outcomes

Download Full-text

U.S., Canada Collaborate on Big Data in ASD Research

ASHA Leader ◽

10.1044/leader.nib4.20122015.16 ◽

2015 ◽

Vol 20 (12) ◽

pp. 16-16

Keyword(s):

Big Data

Download Full-text

Correlating Personality and Actual Phone Usage

Journal of Individual Differences ◽

10.1027/1614-0001/a000139 ◽

2014 ◽

Vol 35 (3) ◽

pp. 158-165 ◽

Cited By ~ 38

Author(s):

Christian Montag ◽

Konrad Błaszkiewicz ◽

Bernd Lachmann ◽

Ionut Andone ◽

Rayna Sariyska ◽

...

Keyword(s):

Big Data ◽

Social Network ◽

Mobile Phone ◽

Mobile Phones ◽

Self Report ◽

Psychological Variables ◽

New Approach ◽

Report Data ◽

Self Report Data ◽

Voice Calls

In the present study we link self-report-data on personality to behavior recorded on the mobile phone. This new approach from Psychoinformatics collects data from humans in everyday life. It demonstrates the fruitful collaboration between psychology and computer science, combining Big Data with psychological variables. Given the large number of variables, which can be tracked on a smartphone, the present study focuses on the traditional features of mobile phones – namely incoming and outgoing calls and SMS. We observed N = 49 participants with respect to the telephone/SMS usage via our custom developed mobile phone app for 5 weeks. Extraversion was positively associated with nearly all related telephone call variables. In particular, Extraverts directly reach out to their social network via voice calls.

Download Full-text

“Big Data in Psychology: Methods and Applications”

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000295 ◽

2017 ◽

Vol 225 (3) ◽

pp. 287-288

Keyword(s):

Big Data

An associated conference will take place at ZPID – Leibniz Institute for Psychology Information in Trier, Germany, on June 7–9, 2018. For further details, see: http://bigdata2018.leibniz-psychology.org

Download Full-text