Discovering related data at scale

2021 ◽  
Vol 14 (8) ◽  
pp. 1392-1400
Author(s):  
Sagar Bharadwaj ◽  
Praveen Gupta ◽  
Ranjita Bhagwan ◽  
Saikat Guha

Analysts frequently require data from multiple sources for their tasks, but finding these sources is challenging in exabyte-scale data lakes. In this paper, we address this problem for our enterprise's data lake by using machine-learning to identify related data sources. Leveraging queries made to the data lake over a month, we build a relevance model that determines whether two columns across two data streams are related or not. We then use the model to find relations at scale across tens of millions of column-pairs and thereafter construct a data relationship graph in a scalable fashion, processing a data lake that has 4.5 Petabytes of data in approximately 80 minutes. Using manually labeled datasets as ground-truth, we show that our techniques show improvements of at least 23% when compared to state-of-the-art methods.

2021 ◽  
Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>


Author(s):  
Gil-sung Park ◽  
Jintae Bae ◽  
Jong Hun Lee ◽  
Byung Yeon Yun ◽  
Byunghwee Lee ◽  
...  

This study merges multiple COVID-19 data sources from news articles and social media to propose an integrated infodemic surveillance system (IISS) that implements infodemiology for a well-tailored epidemic management policy. IISS is an à-la-carte infodemic surveillance solution that enables users to gauge the epidemic related consensus, which compiles epidemic-related data from multiple sources and equipped with various methodological toolkits – topic modeling, Word2Vec, and social network analysis. IISS can provide reliable empirical evidence for proper policymaking. We demonstrate the heuristic utilities of IISS using empirical data from the first wave of COVID-19 in South Korea. Measuring discourse congruence allows us to gauge the distance between the discourse corpus from different sources, which can highlight consensus and conflicts in epidemic discourse. Furthermore, IISS detects discrepancies between social concerns and main actors.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0256858
Author(s):  
Giovanni De Toni ◽  
Cristian Consonni ◽  
Alberto Montresor

Influenza is an acute respiratory seasonal disease that affects millions of people worldwide and causes thousands of deaths in Europe alone. Estimating in a fast and reliable way the impact of an illness on a given country is essential to plan and organize effective countermeasures, which is now possible by leveraging unconventional data sources like web searches and visits. In this study, we show the feasibility of exploiting machine learning models and information about Wikipedia’s page views of a selected group of articles to obtain accurate estimates of influenza-like illnesses incidence in four European countries: Italy, Germany, Belgium, and the Netherlands. We propose a novel language-agnostic method, based on two algorithms, Personalized PageRank and CycleRank, to automatically select the most relevant Wikipedia pages to be monitored without the need for expert supervision. We then show how our model can reach state-of-the-art results by comparing it with previous solutions.


2016 ◽  
Vol 27 (2) ◽  
pp. 146-166 ◽  
Author(s):  
Stella Androulaki ◽  
Haris Doukas ◽  
Vangelis Marinakis ◽  
Leandro Madrazo ◽  
Nikoletta-Zabbeta Legaki

Purpose – The purpose of this paper is to identify the most appropriate multidisciplinary data sources related with energy optimization decision support as well as the related methodologies, tools and techniques for data capturing and processing for each of them. Design/methodology/approach – A review is conducted on the state-of-play of decision support systems for energy optimization, focussing on the municipal sector, followed by an identification of the most appropriate multidisciplinary data sources related with energy optimization decision support. An innovative methodology is outlined to integrate semantically modeled data from multiple sources, to assist city authorities in energy management. Findings – City authorities need to lead relevant actions toward energy-efficient neighborhoods. Although there are more and more energy and other related data available at the city level, there are no established methods and tools integrating and analyzing them in a smart way, with the purpose to support the decision-making process on energy use optimization. Originality/value – A novel multidimensional approach is proposed, using semantic technologies to integrate data from multiple sources, to assist city authorities to produce short-term energy plans in an integrated, transparent and comprehensive way.


Author(s):  
Svitlana Volkova ◽  
Dustin Arendt ◽  
Emily Saldanha ◽  
Maria Glenski ◽  
Ellyn Ayton ◽  
...  

AbstractGround Truth program was designed to evaluate social science modeling approaches using simulation test beds with ground truth intentionally and systematically embedded to understand and model complex Human Domain systems and their dynamics Lazer et al. (Science 369:1060–1062, 2020). Our multidisciplinary team of data scientists, statisticians, experts in Artificial Intelligence (AI) and visual analytics had a unique role on the program to investigate accuracy, reproducibility, generalizability, and robustness of the state-of-the-art (SOTA) causal structure learning approaches applied to fully observed and sampled simulated data across virtual worlds. In addition, we analyzed the feasibility of using machine learning models to predict future social behavior with and without causal knowledge explicitly embedded. In this paper, we first present our causal modeling approach to discover the causal structure of four virtual worlds produced by the simulation teams—Urban Life, Financial Governance, Disaster and Geopolitical Conflict. Our approach adapts the state-of-the-art causal discovery (including ensemble models), machine learning, data analytics, and visualization techniques to allow a human-machine team to reverse-engineer the true causal relations from sampled and fully observed data. We next present our reproducibility analysis of two research methods team’s performance using a range of causal discovery models applied to both sampled and fully observed data, and analyze their effectiveness and limitations. We further investigate the generalizability and robustness to sampling of the SOTA causal discovery approaches on additional simulated datasets with known ground truth. Our results reveal the limitations of existing causal modeling approaches when applied to large-scale, noisy, high-dimensional data with unobserved variables and unknown relationships between them. We show that the SOTA causal models explored in our experiments are not designed to take advantage from vasts amounts of data and have difficulty recovering ground truth when latent confounders are present; they do not generalize well across simulation scenarios and are not robust to sampling; they are vulnerable to data and modeling assumptions, and therefore, the results are hard to reproduce. Finally, when we outline lessons learned and provide recommendations to improve models for causal discovery and prediction of human social behavior from observational data, we highlight the importance of learning data to knowledge representations or transformations to improve causal discovery and describe the benefit of causal feature selection for predictive and prescriptive modeling.


Author(s):  
Diego De Uña ◽  
Nataliia Rümmele ◽  
Graeme Gange ◽  
Peter Schachte ◽  
Peter J. Stuckey

The problem of integrating heterogeneous data sources into an ontology is highly relevant in the database field. Several techniques exist to approach the problem, but side constraints on the data cannot be easily implemented and thus the results may be inconsistent. In this paper we improve previous work by Taheriyan et al. [2016a] using Machine Learning (ML) to take into account inconsistencies in the data (unmatchable attributes) and encode the problem as a variation of the Steiner Tree, for which we use work by De Uña et al. [2016] in Constraint Programming (CP). Combining ML and CP achieves state-of-the-art precision, recall and speed, and provides a more flexible framework for variations of the problem.


Author(s):  
Julian Hatwell ◽  
Mohamed Medhat Gaber ◽  
R. Muhammad Atif Azad

Abstract Background Computer Aided Diagnostics (CAD) can support medical practitioners to make critical decisions about their patients’ disease conditions. Practitioners require access to the chain of reasoning behind CAD to build trust in the CAD advice and to supplement their own expertise. Yet, CAD systems might be based on black box machine learning models and high dimensional data sources such as electronic health records, magnetic resonance imaging scans, cardiotocograms, etc. These foundations make interpretation and explanation of the CAD advice very challenging. This challenge is recognised throughout the machine learning research community. eXplainable Artificial Intelligence (XAI) is emerging as one of the most important research areas of recent years because it addresses the interpretability and trust concerns of critical decision makers, including those in clinical and medical practice. Methods In this work, we focus on AdaBoost, a black box model that has been widely adopted in the CAD literature. We address the challenge – to explain AdaBoost classification – with a novel algorithm that extracts simple, logical rules from AdaBoost models. Our algorithm, Adaptive-Weighted High Importance Path Snippets (Ada-WHIPS), makes use of AdaBoost’s adaptive classifier weights. Using a novel formulation, Ada-WHIPS uniquely redistributes the weights among individual decision nodes of the internal decision trees of the AdaBoost model. Then, a simple heuristic search of the weighted nodes finds a single rule that dominated the model’s decision. We compare the explanations generated by our novel approach with the state of the art in an experimental study. We evaluate the derived explanations with simple statistical tests of well-known quality measures, precision and coverage, and a novel measure stability that is better suited to the XAI setting. Results Experiments on 9 CAD-related data sets showed that Ada-WHIPS explanations consistently generalise better (mean coverage 15%-68%) than the state of the art while remaining competitive for specificity (mean precision 80%-99%). A very small trade-off in specificity is shown to guard against over-fitting which is a known problem in the state of the art methods. Conclusions The experimental results demonstrate the benefits of using our novel algorithm for explaining CAD AdaBoost classifiers widely found in the literature. Our tightly coupled, AdaBoost-specific approach outperforms model-agnostic explanation methods and should be considered by practitioners looking for an XAI solution for this class of models.


2020 ◽  
Author(s):  
Peter Kettig ◽  
Eduardo Sanchez-Diaz ◽  
Simon Baillarin ◽  
Olivier Hagolle ◽  
Jean-Marc Delvit ◽  
...  

<p>Pixels covered by clouds in optical Earth Observation images are not usable for most applications. For this reason, only images delivered with reliable cloud masks are eligible for an automated or massive analysis. Current state of the art cloud detection algorithms, both physical models and machine learning models, are specific to a mission or a mission type, with limited transferability. A new model has to be developed every time a new mission is launched. Machine Learning may overcome this problem and, in turn obtain state of the art, or even better performances by training a same algorithm on datasets from different missions. However, simulating products for upcoming missions is not always possible and available actual products are not enough to create a training dataset until well after the launch. Furthermore, labelling data is time consuming. Therefore, even by the time when enough data is available, manually labelled data might not be available at all.</p><p> </p><p>To solve this bottleneck, we propose a transfer learning based method using the available products of the current generation of satellites. These existing products are gathered in a database that is used to train a deep convolutional neural network (CNN) solely on those products. The trained model is applied to images from other - unseen - sensors and the outputs are evaluated. We avoid labelling manually by automatically producing the ground data with existing algorithms. Only a few semi-manually labelled images are used for qualifying the model. Even those semi-manually labelled samples need very few user inputs. This drastic reduction of user input limits subjectivity and reduce the costs.</p><p> </p><p>We provide an example of such a process by training a model to detect clouds in Sentinel-2 images, using as ground-truth the masks of existing state-of-the-art processors. Then, we apply the trained network to detect clouds in previously unseen imagery of other sensors such as the SPOT family or the High-Resolution (HR) Pleiades imaging system, which provide a different feature space.</p><p>The results demonstrate that the trained model is robust to variations within the individual bands resulting from different acquisition methods and spectral responses. Furthermore, the addition of geo-located auxiliary data that is independent from the platform, such as digital elevation models (DEMs), as well as simple synthetic bands such as the NDVI or NDSI, further improves the results.</p><p>In the future, this approach opens up the possibility to be used on new CNES’ missions, such as Microcarb or CO3D.</p>


Genes ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 87 ◽  
Author(s):  
Bilal Mirza ◽  
Wei Wang ◽  
Jie Wang ◽  
Howard Choi ◽  
Neo Christopher Chung ◽  
...  

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.


2020 ◽  
Author(s):  
Julian Hatwell ◽  
Mohamed Medhat Gaber ◽  
R.M. Atif Azad

Abstract Background Computer Aided Diagnostics (CAD) can support medical practitioners to make critical decisions about their patients' disease conditions. Practitioners require access to the chain of reasoning behind CAD to build trust in the CAD advice and to supplement their own expertise. Yet, CAD systems might be based on black box machine learning (ML) models and high dimensional data sources (electronic health records, MRI scans, cardiotocograms, etc). These foundations make interpretation and explanation of the CAD advice very challenging. This challenge is recognised throughout the machine learning research community. eXplainable Artificial Intelligence (XAI) is emerging as one of the most important research areas of recent years because it addresses the interpretability and trust concerns of critical decision makers, including those in clinical and medical practice. Methods In this work, we focus on AdaBoost, a black box ML model that has been widely adopted in the CAD literature. We address the challenge -- to explain AdaBoost classification -- with a novel algorithm that extracts simple, logical rules from AdaBoost models. Our algorithm, Adaptive-Weighted High Importance Path Snippets (Ada-WHIPS), makes use of AdaBoost's adaptive classifier weights. Using a novel formulation, Ada-WHIPS uniquely redistributes the weights among individual decision nodes of the internal decision trees (DT) of the AdaBoost model. Then, a simple heuristic search of the weighted nodes finds a single rule that dominated the model's decision. We compare the explanations generated by our novel approach with the state of the art in an experimental study. We evaluate the derived explanations with simple statistical tests of well-known quality measures, precision and coverage, and a novel measure stability that is better suited to the XAI setting.Results Experiments on 9 CAD-related data sets showed that Ada-WHIPS explanations consistently generalise better (mean coverage 15%-68%) than the state of the art while remaining competitive for specificity (mean precision 80%-99%). A very small trade-off in specificity is shown to guard againstover-fitting which is a known problem in the state of the art methods.Conclusions The experimental results demonstrate the benefits of using our novel algorithm for explaining CAD AdaBoost classifiers widely found in the literature. Our tightly coupled, AdaBoost-specific approach outperforms model-agnostic explanation methods and should be considered by practitioners looking for an XAI solution for this class of models.


Sign in / Sign up

Export Citation Format

Share Document