Causal Discovery from Nonstationary/Heterogeneous Data: Skeleton Estimation and Orientation Determination

It is commonplace to encounter nonstationary or heterogeneous data, of which the underlying generating process changes over time or across data sets (the data sets may have different experimental conditions or data collection conditions). Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from Nonstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independence changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.

Download Full-text

Causal Discovery from Multiple Data Sets with Non-Identical Variable Sets

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i06.6575 ◽

2020 ◽

Vol 34 (06) ◽

pp. 10153-10161

Author(s):

Biwei Huang ◽

Kun Zhang ◽

Mingming Gong ◽

Clark Glymour

Keyword(s):

Causal Structure ◽

Estimation Procedure ◽

Causal Discovery ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Multiple Data ◽

Multiple Data Sets ◽

Distribution Shifts ◽

Non Gaussian

A number of approaches to causal discovery assume that there are no hidden confounders and are designed to learn a fixed causal model from a single data set. Over the last decade, with closer cooperation across laboratories, we are able to accumulate more variables and data for analysis, while each lab may only measure a subset of them, due to technical constraints or to save time and cost. This raises a question of how to handle causal discovery from multiple data sets with non-identical variable sets, and at the same time, it would be interesting to see how more recorded variables can help to mitigate the confounding problem. In this paper, we propose a principled method to uniquely identify causal relationships over the integrated set of variables from multiple data sets, in linear, non-Gaussian cases. The proposed method also allows distribution shifts across data sets. Theoretically, we show that the causal structure over the integrated set of variables is identifiable under testable conditions. Furthermore, we present two types of approaches to parameter estimation: one is based on maximum likelihood, and the other is likelihood free and leverages generative adversarial nets to improve scalability of the estimation procedure. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.

Download Full-text

The Northeast Big Data Innovation Hub: "Enabling Seamless Data Sharing in Industry and Academia" Workshop Report

10.17918/d8159v ◽

2017 ◽

Author(s):

Jane Greenberg ◽

◽

Samantha Grabus ◽

Florence Hudson ◽

Tim Kraska ◽

...

Keyword(s):

Big Data ◽

Data Sharing ◽

Technology Policy ◽

Data Sets ◽

Real World Data ◽

Academic Impact ◽

Analysis Technique ◽

Challenges And Opportunities ◽

Technological Barriers ◽

Access To Data

Increasingly, both industry and academia, in fields ranging from biology and social sciences to computing and engineering, are driven by data (Provost & Fawcett, 2013; Wixom, et al, 2014); and both commercial success and academic impact are dependent on having access to data. Many organizations collecting data lack the expertise required to process it (Hazen, et al, 2014), and, thus, pursue data sharing with researchers who can extract more value from data they own. For example, a biosciences company may benefit from a specific analysis technique a researcher has developed. At the same time, researchers are always on the search for real-world data sets to demonstrate the effectiveness of their methods. Unfortunately, many data sharing attempts fail, for reasons ranging from legal restrictions on how data can be used—to privacy policies, different cultural norms, and technological barriers. In fact, many data sharing partnerships that are vital to addressing pressing societal challenges in cities, health, energy, and the environment are not being pursued due to such obstacles. Addressing these data sharing challenges requires open, supportive dialogue across many sectors, including technology, policy, industry, and academia. Further, there is a crucial need for well-defined agreements that can be shared among key stakeholders, including researchers, technologists, legal representatives, and technology transfer officers. The Northeast Big Data Innovation Hub (NEBDIH) took an important step in this area with the recent "Enabling Seamless Data Sharing in Industry and Academia" workshop, held at Drexel University September 29-30, 2016. The workshop brought together representatives from these critical stakeholder communities to launch a national dialogue on challenges and opportunities in this complex space.

Download Full-text

Predicting Causal Relationships from Biological Data: Applying Automated Casual Discovery on Mass Cytometry Data of Human Immune Cells

10.1101/122572 ◽

2017 ◽

Author(s):

Sofia Triantafillou ◽

Vincenzo Lagani ◽

Christina Heinze-Deml ◽

Angelika Schmidt ◽

Jesper Tegner ◽

...

Keyword(s):

Systems Biology ◽

Molecular System ◽

Biological Data ◽

Causal Discovery ◽

Data Sets ◽

Causal Relationships ◽

Randomized Experiments ◽

Mass Cytometry ◽

Experimental Conditions ◽

Human Immune

ABSTRACTLearning the causal relationships that define a molecular system allows us to predict how the system will respond to different interventions. Distinguishing causality from mere association typically requires randomized experiments. Methods for automated causal discovery from limited experiments exist, but have so far rarely been tested in systems biology applications. In this work, we apply state-of-the art causal discovery methods on a large collection of public mass cytometry data sets, measuring intra-cellular signaling proteins of the human immune system and their response to several perturbations. We show how different experimental conditions can be used to facilitate causal discovery, and apply two fundamental methods that produce context-specific causal predictions. Causal predictions were reproducible across independent data sets from two different studies, but often disagree with the KEGG pathway databases. Within this context, we discuss the caveats we need to overcome for automated causal discovery to become a part of the routine data analysis in systems biology.

Download Full-text

Comment on "Estimating causal networks in biosphere–atmosphere interaction with the PCMCI approach"

10.5194/bg-2021-231 ◽

2021 ◽

Author(s):

Jarmo Mäkelä ◽

Laila Melkas ◽

Ivan Mammarella ◽

Tuomo Nieminen ◽

Suyog Chandramouli ◽

...

Keyword(s):

Domain Knowledge ◽

Prior Information ◽

Causal Model ◽

Concept Drift ◽

Causal Structure ◽

Causal Discovery ◽

Causal Networks ◽

Atmosphere Interaction ◽

Model Structures ◽

Discovery Algorithms

Abstract. This is a comment on "Estimating causal networks in biosphere–atmosphere interaction with the PCMCI approach" by Krich et al., Biogeosciences, 17, 1033–1061, 2020, which gives a good introduction to causal discovery, but conﬁnes the scope by investigating the outcome of a single algorithm. In this comment, we argue that the outputs of causal discovery algorithms should not usually be considered as end results but starting points and hypothesis for further study. We illustrate how not only different algorithms, but also different initial states and prior information of possible causal model structures, affect the outcome. We demonstrate how to incorporate expert domain knowledge with causal structure discovery and how to detect and take into account overﬁtting and concept drift.

Download Full-text

Evaluation of Acoustic Analyses of Voice in Nonoptimized Conditions

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-20-00212 ◽

2020 ◽

Vol 63 (12) ◽

pp. 3991-3999

Author(s):

Benjamin van der Woerd ◽

Min Wu ◽

Vijay Parsa ◽

Philip C. Doyle ◽

Kevin Fung

Keyword(s):

Repeated Measures ◽

Voice Quality ◽

Data Sets ◽

Acoustic Measurements ◽

Sample Collection ◽

Experimental Conditions ◽

Environment Analysis ◽

Acoustic Measures ◽

Recording Conditions ◽

Cepstral Peak Prominence

Objectives This study aimed to evaluate the fidelity and accuracy of a smartphone microphone and recording environment on acoustic measurements of voice. Method A prospective cohort proof-of-concept study. Two sets of prerecorded samples (a) sustained vowels (/a/) and (b) Rainbow Passage sentence were played for recording via the internal iPhone microphone and the Blue Yeti USB microphone in two recording environments: a sound-treated booth and quiet office setting. Recordings were presented using a calibrated mannequin speaker with a fixed signal intensity (69 dBA), at a fixed distance (15 in.). Each set of recordings (iPhone—audio booth, Blue Yeti—audio booth, iPhone—office, and Blue Yeti—office), was time-windowed to ensure the same signal was evaluated for each condition. Acoustic measures of voice including fundamental frequency ( f o ), jitter, shimmer, harmonic-to-noise ratio (HNR), and cepstral peak prominence (CPP), were generated using a widely used analysis program (Praat Version 6.0.50). The data gathered were compared using a repeated measures analysis of variance. Two separate data sets were used. The set of vowel samples included both pathologic ( n = 10) and normal ( n = 10), male ( n = 5) and female ( n = 15) speakers. The set of sentence stimuli ranged in perceived voice quality from normal to severely disordered with an equal number of male ( n = 12) and female ( n = 12) speakers evaluated. Results The vowel analyses indicated that the jitter, shimmer, HNR, and CPP were significantly different based on microphone choice and shimmer, HNR, and CPP were significantly different based on the recording environment. Analysis of sentences revealed a statistically significant impact of recording environment and microphone type on HNR and CPP. While statistically significant, the differences across the experimental conditions for a subset of the acoustic measures (viz., jitter and CPP) have shown differences that fell within their respective normative ranges. Conclusions Both microphone and recording setting resulted in significant differences across several acoustic measurements. However, a subset of the acoustic measures that were statistically significant across the recording conditions showed small overall differences that are unlikely to have clinical significance in interpretation. For these acoustic measures, the present data suggest that, although a sound-treated setting is ideal for voice sample collection, a smartphone microphone can capture acceptable recordings for acoustic signal analysis.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

The Power of Matrix Factorization: Methods for Deconvoluting Genetic Heterogeneous Data at Expression Level

Current Bioinformatics ◽

10.2174/1574893615666200120110205 ◽

2021 ◽

Vol 15 (8) ◽

pp. 841-853

Author(s):

Yuan Liu ◽

Zhining Wen ◽

Menglong Li

Keyword(s):

Artificial Intelligence ◽

Matrix Factorization ◽

In Silico ◽

Genetic Data ◽

Heterogeneous Data ◽

Expression Level ◽

Mathematical Methods ◽

Challenges And Opportunities ◽

Factorization Methods ◽

Genetic Profiles

Background:: The utilization of genetic data to investigate biological problems has recently become a vital approach. However, it is undeniable that the heterogeneity of original samples at the biological level is usually ignored when utilizing genetic data. Different cell-constitutions of a sample could differentiate the expression profile, and set considerable biases for downstream research. Matrix factorization (MF) which originated as a set of mathematical methods, has contributed massively to deconvoluting genetic profiles in silico, especially at the expression level. Objective: With the development of artificial intelligence algorithms and machine learning, the number of computational methods for solving heterogeneous problems is also rapidly abundant. However, a structural view from the angle of using MF to deconvolute genetic data is quite limited. This study was conducted to review the usages of MF methods on heterogeneous problems of genetic data on expression level. Methods: MF methods involved in deconvolution were reviewed according to their individual strengths. The demonstration is presented separately into three sections: application scenarios, method categories and summarization for tools. Specifically, application scenarios defined deconvoluting problem with applying scenarios. Method categories summarized MF algorithms contributed to different scenarios. Summarization for tools listed functions and developed web-servers over the latest decade. Additionally, challenges and opportunities of relative fields are discussed. Results and Conclusion: Based on the investigation, this study aims to present a relatively global picture to assist researchers to achieve a quicker access of deconvoluting genetic data in silico, further to help researchers in selecting suitable MF methods based on the different scenarios.

Download Full-text

Language change across a lifetime: A historical micro-perspective

Linguistics Vanguard ◽

10.1515/lingvan-2020-0029 ◽

2021 ◽

Vol 7 (s2) ◽

Author(s):

Alexander Bergs

Keyword(s):

Communities Of Practice ◽

Social Factors ◽

Language Use ◽

Language Change ◽

Historical Data ◽

Data Sets ◽

The Social ◽

Challenges And Opportunities ◽

Linguistic Behavior ◽

Micro Analysis

Abstract This paper focuses on the micro-analysis of historical data, which allows us to investigate language use across the lifetime of individual speakers. Certain concepts, such as social network analysis or communities of practice, put individual speakers and their social embeddedness and dynamicity at the center of attention. This means that intra-speaker variation can be described and analyzed in quite some detail in certain historical data sets. The paper presents some exemplary empirical analyses of the diachronic linguistic behavior of individual speakers/writers in fifteenth to seventeenth century England. It discusses the social factors that influence this behavior, with an emphasis on the methodological and theoretical challenges and opportunities when investigating intra-speaker variation and change.

Download Full-text

Ontology-Based Correlation Detection Among Heterogeneous Data Sets: A Case Study of University Campus Issues

2020 IEEE Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE) ◽

10.1109/aike48582.2020.00014 ◽

2020 ◽

Author(s):

Yuto Tsukagoshi ◽

Shusaku Egami ◽

Yuichi Sei ◽

Yasuyuki Tahara ◽

Akihiko Ohsuga

Keyword(s):

Heterogeneous Data ◽

Data Sets ◽

University Campus

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text