scholarly journals The Pre Big Data Matching Redundancy Avoidance Algorithm with Mapreduce

2015 ◽  
Vol 8 (33) ◽  
Author(s):  
G. Somasekhar ◽  
K. Karthikeyan
Keyword(s):  
Big Data ◽  
2019 ◽  
Vol 76 (Suppl 1) ◽  
pp. A39.1-A39
Author(s):  
Ganesh Selvaraj ◽  
Michael Butchard

BackgroundCarcinogen exposure data can potentially guide the work of health and safety (H and S) regulators. This project aims to use CAREX Canada data to estimate carcinogen exposures in New Zealand industries. This requires the creation of a cross-walk between the countries’ industry classifications.MethodsAgile and big-data-science methodologies were used to construct two versions of an industry classification cross-walk from the 2006 Australian and New Zealand Standard Industrial Classification (ANZSIC06) to the Canadian version of the 2002 North American Industrial Classification (NAICS2002), used by CAREX Canada.Firstly, concordance files from government statistics bureaus cross-walked the path: ANZSIC06 ->International Standard Industrial Classification of All Economic Activities Rev4 ->NAICS2017->NAICS2012->NAICS2007->NAICS2002. The cross-walk accounted for ‘one-to-many-to-one’, non-machine formats, and missing/erroneous values.Secondly, a fuzzy data matching pipeline was designed. Data preparation removed redundant, stop, and common domain words, and lemmatised using morphological analysis (e.g. fishing to fish). Data matching used a hybrid algorithm combining ‘JaroWinkler-distance’ and a token-sort approach (i.e. ignoring the positional occurrence of words in a sentence) to match descriptions. A trial-and-error approach was used to assign weightings and concatenate the hierarchical industry classification levels to improve match accuracy. Python language was used for implementation.For each method, random samples of 50 matches were manually classified as either poor or sufficient by two people. Disagreements were discussed and consensus reached.ResultsThe concordance cross-walk sample had 52% (95% C.I. 38%–66%) sufficient matches compared to 84% (95% C.I. 74%–94%) for the fuzzy data matching pipeline cross-walk sample.ConclusionsCross-walking countries’ industry classifications using a fuzzy data matching pipeline was more accurate than using a concordance cross-walk. The pipeline is modular enough to easily include more components. This work is part of a vision to design a semantic big-data lake, enabling integration of any data relevant to H and S.


Author(s):  
Mary Smyth ◽  
Kevin McCormack

Abstract The Identity Correlation Approach (ICA) is a statistical technique developed for matching big data where a unique identifier does not exist. This technique was developed to match the Irish Census 2011 dataset to Central Government Administrative Datasets in order to attach a unique identifier to each individual person in the Census dataset (McCormack & Smyth, 20151). The unique identifier attached is the PPS No. (Personal Public Service No.2). By attaching the PPS No. to the Census dataset, each individual can be linked to datasets held centrally by Public Sector Organisations. This expands the range of variables for statistical analysis at individual level. Statistical techniques developed here were undertaken for a major European Structure of Earnings Survey (SES) compiled by the CSO using administrative data only,  and thus eliminating the need for an expensive business survey to be conducted (NES, 20073,4,5). A description of how the Identity Correlation Approach was developed is given in this paper. Data matching results and conclusions are presented here in relation to the Structure of Earnings Survey (SES)6 results for 2011.


Author(s):  
Victor Chang ◽  
Yujie Shi ◽  
Yan Zhang

With the development of big data technology, such as data mining and data matching, many industries have started a revolution, including medical field. Big data not only strengthens the accuracy of medical diagnosis, but it also enhances the efficiency of the entire medical system and relevant medical staff. Additionally, with the rethinking of innovation, the application of wearable intelligent device, RFID technology and sensor technology play positive roles in promoting medical interaction between hospital system and wearer. Smart medical provides effective methods for individual health management and promotes the progress of medical information. However, there are also some inevitable ethical problems, e.g., the leakage of privacy information, which cannot be avoided to some extents. The authors recommend some suggestions to reduce the possibilities of ethical problems happened during the data flow process.


2019 ◽  
Vol 2 (2) ◽  
pp. 35-43
Author(s):  
Victor Chang ◽  
Yujie Shi ◽  
Yan Zhang

With the development of big data technology, such as data mining and data matching, many industries have started a revolution, including medical field. Big data not only strengthens the accuracy of medical diagnosis, but it also enhances the efficiency of the entire medical system and relevant medical staff. Additionally, with the rethinking of innovation, the application of wearable intelligent device, RFID technology and sensor technology play positive roles in promoting medical interaction between hospital system and wearer. Smart medical provides effective methods for individual health management and promotes the progress of medical information. However, there are also some inevitable ethical problems, e.g., the leakage of privacy information, which cannot be avoided to some extents. The authors recommend some suggestions to reduce the possibilities of ethical problems happened during the data flow process.


2019 ◽  
Vol 8 (1) ◽  
pp. 38
Author(s):  
Yunfei Zhang ◽  
Jincai Huang ◽  
Min Deng ◽  
Chi Chen ◽  
Fangbin Zhou ◽  
...  

With the increasingly urgent demand for map conflation and timely data updating, data matching has become a crucial issue in big data and the GIS community. However, non-rigid deviation, shape homogenization, and uncertain scale differences occur in crowdsourced and official building data, causing challenges in conflating heterogeneous building datasets from different sources and scales. This paper thus proposes an automated building data matching method based on relaxation labelling and pattern combinations. The proposed method first detects all possible matching objects and pattern combinations to create a matching table, and calculates four geo-similarities for each candidate-matching pair to initialize a probabilistic matching matrix. After that, the contextual information of neighboring candidate-matching pairs is explored to heuristically amend the geo-similarity-based matching matrix for achieving a contextual matching consistency. Three case studies are conducted to illustrate that the proposed method obtains high matching accuracies and correctly identifies various 1:1, 1:M, and M:N matching. This indicates the pattern-level relaxation labelling matching method can efficiently overcome the problems of shape homogeneity and non-rigid deviation, and meanwhile has weak sensitivity to uncertain scale differences, providing a functional solution for conflating crowdsourced and official building data.


2019 ◽  
Vol 46 (6) ◽  
pp. 867-880 ◽  
Author(s):  
Andrew Whelan

Since 2016, welfare recipients in Australia have been subject to the Online Compliance Intervention (OCI), implemented through the national income support agency, Centrelink. This is a big data initiative, matching reported income to tax records to recoup welfare overpayments. The OCI proved controversial, notably for a “reverse onus,” requiring that claimants disprove debts, and for data-matching design leading frequently to incorrect debts. As algorithmic governance, the OCI directs attention to the chronopolitics of contemporary welfare bureaucracies. It outsources labor previously conducted by Centrelink to clients, compelling them to submit documentation lest debts be raised against them. It imposes an active wait against a deadline on those issued debt notifications. Belying government rhetoric about the accessibility of the digital state, the OCI demonstrates how automation exacerbates punitive welfare agendas, through transfers of time, money, and labor whose combined effects are such as to occupy the time of people experiencing poverty.


ASHA Leader ◽  
2013 ◽  
Vol 18 (2) ◽  
pp. 59-59
Keyword(s):  

Find Out About 'Big Data' to Track Outcomes


2014 ◽  
Vol 35 (3) ◽  
pp. 158-165 ◽  
Author(s):  
Christian Montag ◽  
Konrad Błaszkiewicz ◽  
Bernd Lachmann ◽  
Ionut Andone ◽  
Rayna Sariyska ◽  
...  

In the present study we link self-report-data on personality to behavior recorded on the mobile phone. This new approach from Psychoinformatics collects data from humans in everyday life. It demonstrates the fruitful collaboration between psychology and computer science, combining Big Data with psychological variables. Given the large number of variables, which can be tracked on a smartphone, the present study focuses on the traditional features of mobile phones – namely incoming and outgoing calls and SMS. We observed N = 49 participants with respect to the telephone/SMS usage via our custom developed mobile phone app for 5 weeks. Extraversion was positively associated with nearly all related telephone call variables. In particular, Extraverts directly reach out to their social network via voice calls.


2017 ◽  
Vol 225 (3) ◽  
pp. 287-288
Keyword(s):  

An associated conference will take place at ZPID – Leibniz Institute for Psychology Information in Trier, Germany, on June 7–9, 2018. For further details, see: http://bigdata2018.leibniz-psychology.org


Sign in / Sign up

Export Citation Format

Share Document