scholarly journals Causal Discovery from Nonstationary/Heterogeneous Data: Skeleton Estimation and Orientation Determination

Author(s):  
Kun Zhang ◽  
Biwei Huang ◽  
Jiji Zhang ◽  
Clark Glymour ◽  
Bernhard Schölkopf

It is commonplace to encounter nonstationary or heterogeneous data, of which the underlying generating process changes over time or across data sets (the data sets may have different experimental conditions or data collection conditions). Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from Nonstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independence changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.

2020 ◽  
Vol 34 (06) ◽  
pp. 10153-10161
Author(s):  
Biwei Huang ◽  
Kun Zhang ◽  
Mingming Gong ◽  
Clark Glymour

A number of approaches to causal discovery assume that there are no hidden confounders and are designed to learn a fixed causal model from a single data set. Over the last decade, with closer cooperation across laboratories, we are able to accumulate more variables and data for analysis, while each lab may only measure a subset of them, due to technical constraints or to save time and cost. This raises a question of how to handle causal discovery from multiple data sets with non-identical variable sets, and at the same time, it would be interesting to see how more recorded variables can help to mitigate the confounding problem. In this paper, we propose a principled method to uniquely identify causal relationships over the integrated set of variables from multiple data sets, in linear, non-Gaussian cases. The proposed method also allows distribution shifts across data sets. Theoretically, we show that the causal structure over the integrated set of variables is identifiable under testable conditions. Furthermore, we present two types of approaches to parameter estimation: one is based on maximum likelihood, and the other is likelihood free and leverages generative adversarial nets to improve scalability of the estimation procedure. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.


2017 ◽  
Author(s):  
Jane Greenberg ◽  
◽  
Samantha Grabus ◽  
Florence Hudson ◽  
Tim Kraska ◽  
...  

Increasingly, both industry and academia, in fields ranging from biology and social sciences to computing and engineering, are driven by data (Provost & Fawcett, 2013; Wixom, et al, 2014); and both commercial success and academic impact are dependent on having access to data. Many organizations collecting data lack the expertise required to process it (Hazen, et al, 2014), and, thus, pursue data sharing with researchers who can extract more value from data they own. For example, a biosciences company may benefit from a specific analysis technique a researcher has developed. At the same time, researchers are always on the search for real-world data sets to demonstrate the effectiveness of their methods. Unfortunately, many data sharing attempts fail, for reasons ranging from legal restrictions on how data can be used—to privacy policies, different cultural norms, and technological barriers. In fact, many data sharing partnerships that are vital to addressing pressing societal challenges in cities, health, energy, and the environment are not being pursued due to such obstacles. Addressing these data sharing challenges requires open, supportive dialogue across many sectors, including technology, policy, industry, and academia. Further, there is a crucial need for well-defined agreements that can be shared among key stakeholders, including researchers, technologists, legal representatives, and technology transfer officers. The Northeast Big Data Innovation Hub (NEBDIH) took an important step in this area with the recent "Enabling Seamless Data Sharing in Industry and Academia" workshop, held at Drexel University September 29-30, 2016. The workshop brought together representatives from these critical stakeholder communities to launch a national dialogue on challenges and opportunities in this complex space.


2017 ◽  
Author(s):  
Sofia Triantafillou ◽  
Vincenzo Lagani ◽  
Christina Heinze-Deml ◽  
Angelika Schmidt ◽  
Jesper Tegner ◽  
...  

ABSTRACTLearning the causal relationships that define a molecular system allows us to predict how the system will respond to different interventions. Distinguishing causality from mere association typically requires randomized experiments. Methods for automated causal discovery from limited experiments exist, but have so far rarely been tested in systems biology applications. In this work, we apply state-of-the art causal discovery methods on a large collection of public mass cytometry data sets, measuring intra-cellular signaling proteins of the human immune system and their response to several perturbations. We show how different experimental conditions can be used to facilitate causal discovery, and apply two fundamental methods that produce context-specific causal predictions. Causal predictions were reproducible across independent data sets from two different studies, but often disagree with the KEGG pathway databases. Within this context, we discuss the caveats we need to overcome for automated causal discovery to become a part of the routine data analysis in systems biology.


2021 ◽  
Author(s):  
Jarmo Mäkelä ◽  
Laila Melkas ◽  
Ivan Mammarella ◽  
Tuomo Nieminen ◽  
Suyog Chandramouli ◽  
...  

Abstract. This is a comment on "Estimating causal networks in biosphere–atmosphere interaction with the PCMCI approach" by Krich et al., Biogeosciences, 17, 1033–1061, 2020, which gives a good introduction to causal discovery, but confines the scope by investigating the outcome of a single algorithm. In this comment, we argue that the outputs of causal discovery algorithms should not usually be considered as end results but starting points and hypothesis for further study. We illustrate how not only different algorithms, but also different initial states and prior information of possible causal model structures, affect the outcome. We demonstrate how to incorporate expert domain knowledge with causal structure discovery and how to detect and take into account overfitting and concept drift.


2020 ◽  
Vol 63 (12) ◽  
pp. 3991-3999
Author(s):  
Benjamin van der Woerd ◽  
Min Wu ◽  
Vijay Parsa ◽  
Philip C. Doyle ◽  
Kevin Fung

Objectives This study aimed to evaluate the fidelity and accuracy of a smartphone microphone and recording environment on acoustic measurements of voice. Method A prospective cohort proof-of-concept study. Two sets of prerecorded samples (a) sustained vowels (/a/) and (b) Rainbow Passage sentence were played for recording via the internal iPhone microphone and the Blue Yeti USB microphone in two recording environments: a sound-treated booth and quiet office setting. Recordings were presented using a calibrated mannequin speaker with a fixed signal intensity (69 dBA), at a fixed distance (15 in.). Each set of recordings (iPhone—audio booth, Blue Yeti—audio booth, iPhone—office, and Blue Yeti—office), was time-windowed to ensure the same signal was evaluated for each condition. Acoustic measures of voice including fundamental frequency ( f o ), jitter, shimmer, harmonic-to-noise ratio (HNR), and cepstral peak prominence (CPP), were generated using a widely used analysis program (Praat Version 6.0.50). The data gathered were compared using a repeated measures analysis of variance. Two separate data sets were used. The set of vowel samples included both pathologic ( n = 10) and normal ( n = 10), male ( n = 5) and female ( n = 15) speakers. The set of sentence stimuli ranged in perceived voice quality from normal to severely disordered with an equal number of male ( n = 12) and female ( n = 12) speakers evaluated. Results The vowel analyses indicated that the jitter, shimmer, HNR, and CPP were significantly different based on microphone choice and shimmer, HNR, and CPP were significantly different based on the recording environment. Analysis of sentences revealed a statistically significant impact of recording environment and microphone type on HNR and CPP. While statistically significant, the differences across the experimental conditions for a subset of the acoustic measures (viz., jitter and CPP) have shown differences that fell within their respective normative ranges. Conclusions Both microphone and recording setting resulted in significant differences across several acoustic measurements. However, a subset of the acoustic measures that were statistically significant across the recording conditions showed small overall differences that are unlikely to have clinical significance in interpretation. For these acoustic measures, the present data suggest that, although a sound-treated setting is ideal for voice sample collection, a smartphone microphone can capture acceptable recordings for acoustic signal analysis.


Author(s):  
K Sobha Rani

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.


2021 ◽  
Vol 15 (8) ◽  
pp. 841-853
Author(s):  
Yuan Liu ◽  
Zhining Wen ◽  
Menglong Li

Background:: The utilization of genetic data to investigate biological problems has recently become a vital approach. However, it is undeniable that the heterogeneity of original samples at the biological level is usually ignored when utilizing genetic data. Different cell-constitutions of a sample could differentiate the expression profile, and set considerable biases for downstream research. Matrix factorization (MF) which originated as a set of mathematical methods, has contributed massively to deconvoluting genetic profiles in silico, especially at the expression level. Objective: With the development of artificial intelligence algorithms and machine learning, the number of computational methods for solving heterogeneous problems is also rapidly abundant. However, a structural view from the angle of using MF to deconvolute genetic data is quite limited. This study was conducted to review the usages of MF methods on heterogeneous problems of genetic data on expression level. Methods: MF methods involved in deconvolution were reviewed according to their individual strengths. The demonstration is presented separately into three sections: application scenarios, method categories and summarization for tools. Specifically, application scenarios defined deconvoluting problem with applying scenarios. Method categories summarized MF algorithms contributed to different scenarios. Summarization for tools listed functions and developed web-servers over the latest decade. Additionally, challenges and opportunities of relative fields are discussed. Results and Conclusion: Based on the investigation, this study aims to present a relatively global picture to assist researchers to achieve a quicker access of deconvoluting genetic data in silico, further to help researchers in selecting suitable MF methods based on the different scenarios.


2021 ◽  
Vol 7 (s2) ◽  
Author(s):  
Alexander Bergs

Abstract This paper focuses on the micro-analysis of historical data, which allows us to investigate language use across the lifetime of individual speakers. Certain concepts, such as social network analysis or communities of practice, put individual speakers and their social embeddedness and dynamicity at the center of attention. This means that intra-speaker variation can be described and analyzed in quite some detail in certain historical data sets. The paper presents some exemplary empirical analyses of the diachronic linguistic behavior of individual speakers/writers in fifteenth to seventeenth century England. It discusses the social factors that influence this behavior, with an emphasis on the methodological and theoretical challenges and opportunities when investigating intra-speaker variation and change.


Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 507
Author(s):  
Piotr Białczak ◽  
Wojciech Mazurczyk

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.


Sign in / Sign up

Export Citation Format

Share Document