The limitations of data perturbation for ASR of learner data in under-resourced languages

Author(s):  
Jaco Badenhorst ◽  
Febe de Wet
Keyword(s):  
2008 ◽  
pp. 1550-1561
Author(s):  
Rick L. Wilson ◽  
Peter A. Rosen

Data perturbation is a data security technique that adds ‘noise’ to databases allowing individual record confidentiality. This technique allows users to ascertain key summary information about the data that is not distorted and does not lead to a security breach. Four bias types have been proposed which assess the effectiveness of such techniques. However, these biases only deal with simple aggregate concepts (averages, etc.) found in the database. To compete in today’s business environment, it is critical that organizations utilize data mining approaches to discover additional knowledge about themselves ‘hidden’ in their databases. Thus, database administrators are faced with competing objectives: protection of confidential data versus data disclosure for data mining applications. This paper empirically explores whether data protection provided by perturbation techniques adds a so-called data mining bias to the database. The results find initial support for the existence of this bias.


2006 ◽  
Vol 2 ◽  
pp. 117693510600200 ◽  
Author(s):  
G. Alexe ◽  
G.S. Dalgin ◽  
R. Ramaswamy ◽  
C. Delisi ◽  
G. Bhanot

Molecular stratification of disease based on expression levels of sets of genes can help guide therapeutic decisions if such classifications can be shown to be stable against variations in sample source and data perturbation. Classifications inferred from one set of samples in one lab should be able to consistently stratify a different set of samples in another lab. We present a method for assessing such stability and apply it to the breast cancer (BCA) datasets of Sorlie et al. 2003 and Ma et al. 2003. We find that within the now commonly accepted BCA categories identified by Sorlie et al. Luminal A and Basal are robust, but Luminal B and ERBB2+ are not. In particular, 36% of the samples identified as Luminal B and 55% identified as ERBB2+ cannot be assigned an accurate category because the classification is sensitive to data perturbation. We identify a “core cluster” of samples for each category, and from these we determine “patterns” of gene expression that distinguish the core clusters from each other. We find that the best markers for Luminal A and Basal are (ESR1, LIV1, GATA-3) and (CCNE1, LAD1, KRT5), respectively. Pathways enriched in the patterns regulate apoptosis, tissue remodeling and the immune response. We use a different dataset (Ma et al. 2003) to test the accuracy with which samples can be allocated to the four disease subtypes. We find, as expected, that the classification of samples identified as Luminal A and Basal is robust but classification into the other two subtypes is not.


2013 ◽  
Vol 2013 ◽  
pp. 1-7 ◽  
Author(s):  
Serdal Pamuk

We present a mathematical model for capillary formation in tumor angiogenesis and solve it by linearizing it using an initial data perturbation method. This method is highly effective to obtain solutions of nonlinear coupled differential equations. We also provide a specific example resulting, that even a few terms of the obtained series solutions are enough to have an idea for the endothelial cell movement in a capillary. MATLAB-generated figures are provided, and the stability criteria are determined for the steady-state solution of the cell equation.


Author(s):  
Amanda M. Y. Chu ◽  
Benson S. Y. Lam ◽  
Agnes Tiwari ◽  
Mike K. P. So

Patient data or information collected from public health and health care surveys are of great research value. Usually, the data contain sensitive personal information. Doctors, nurses, or researchers in the public health and health care sector do not analyze the available datasets or survey data on their own, and may outsource the tasks to third parties. Even though all identifiers such as names and ID card numbers are removed, there may still be some occasions in which an individual can be re-identified via the demographic or particular information provided in the datasets. Such data privacy issues can become an obstacle in health-related research. Statistical disclosure control (SDC) is a useful technique used to resolve this problem by masking and designing released data based on the original data. Whilst ensuring the released data can satisfy the needs of researchers for data analysis, there is high protection of the original data from disclosure. In this research, we discuss the statistical properties of two SDC methods: the General Additive Data Perturbation (GADP) method and the Gaussian Copula General Additive Data Perturbation (CGADP) method. An empirical study is provided to demonstrate how we can apply these two SDC methods in public health research.


Sign in / Sign up

Export Citation Format

Share Document