scholarly journals Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

2019 ◽  
Vol 11 (2) ◽  
pp. 1-22
Author(s):  
Alina Lazar ◽  
Ling Jin ◽  
C. Anna Spurlock ◽  
Kesheng Wu ◽  
Alex Sim ◽  
...  
GigaScience ◽  
2020 ◽  
Vol 9 (11) ◽  
Author(s):  
Sergey E Golovenkin ◽  
Jonathan Bac ◽  
Alexander Chervov ◽  
Evgeny M Mirkes ◽  
Yuliya V Orlova ◽  
...  

Abstract Background Large observational clinical datasets are becoming increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete disease state develops through stereotypical routes, characterized by “points of no return" and “final states" (such as lethal or recovery states). Extracting this information directly from the data remains challenging, especially in the case of synchronic (with a short-term follow-up) observations. Results Here we suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values, through modeling the geometrical data structure as a bouquet of bifurcating clinical trajectories. The methodology is based on application of elastic principal graphs, which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection, and quantifying the geodesic distances (pseudo-time) in partially ordered sequences of observations. The methodology allows a patient to be positioned on a particular clinical trajectory (pathological scenario) and the degree of progression along it to be characterized with a qualitative estimate of the uncertainty of the prognosis. We developed a tool ClinTrajan for clinical trajectory analysis implemented in the Python programming language. We test the methodology in 2 large publicly available datasets: myocardial infarction complications and readmission of diabetic patients data. Conclusions Our pseudo-time quantification-based approach makes it possible to apply the methods developed for dynamical disease phenotyping and illness trajectory analysis (diachronic data analysis) to synchronic observational data.


Author(s):  
QingXiang Wu ◽  
Martin McGinnity ◽  
Girijesh Prasad ◽  
David Bell

Data mining and knowledge discovery aim at finding useful information from typically massive collections of data, and then extracting useful knowledge from the information. To date a large number of approaches have been proposed to find useful information and discover useful knowledge; for example, decision trees, Bayesian belief networks, evidence theory, rough set theory, fuzzy set theory, kNN (k-nearest-neighborhood) classifier, neural networks, and support vector machines. However, these approaches are based on a specific data type. In the real world, an intelligent system often encounters mixed data types, incomplete information (missing values), and imprecise information (fuzzy conditions). In the UCI (University of California – Irvine) Machine Learning Repository, it can be seen that there are many real world data sets with missing values and mixed data types. It is a challenge to enable machine learning or data mining approaches to deal with mixed data types (Ching, 1995; Coppock, 2003) because there are difficulties in finding a measure of similarity between objects with mixed data type attributes. The problem with mixed data types is a long-standing issue faced in data mining. The emerging techniques targeted at this issue can be classified into three classes as follows: (1) Symbolic data mining approaches plus different discretizers (e.g., Dougherty et al., 1995; Wu, 1996; Kurgan et al., 2004; Diday, 2004; Darmont et al., 2006; Wu et al., 2007) for transformation from continuous data to symbolic data; (2) Numerical data mining approaches plus transformation from symbolic data to numerical data (e.g.,, Kasabov, 2003; Darmont et al., 2006; Hadzic et al., 2007); (3) Hybrid of symbolic data mining approaches and numerical data mining approaches (e.g.,, Tung, 2002; Kasabov, 2003; Leng et al., 2005; Wu et al., 2006). Since hybrid approaches have the potential to exploit the advantages from both symbolic data mining and numerical data mining approaches, this chapter, after discassing the merits and shortcomings of current approaches, focuses on applying Self-Organizing Computing Network Model to construct a hybrid system to solve the problems of knowledge discovery from databases with a diversity of data types. Future trends for data mining on mixed type data are then discussed. Finally a conclusion is presented.


The previous chapter overviewed big data including its types, sources, analytic techniques, and applications. This chapter briefly discusses the architecture components dealing with the huge volume of data. The complexity of big data types defines a logical architecture with layers and high-level components to obtain a big data solution that includes data sources with the relation to atomic patterns. The dimensions of the approach include volume, variety, velocity, veracity, and governance. The diverse layers of the architecture are big data sources, data massaging and store layer, analysis layer, and consumption layer. Big data sources are data collected from various sources to perform analytics by data scientists. Data can be from internal and external sources. Internal sources comprise transactional data, device sensors, business documents, internal files, etc. External sources can be from social network profiles, geographical data, data stores, etc. Data massage is the process of extracting data by preprocessing like removal of missing values, dimensionality reduction, and noise removal to attain a useful format to be stored. Analysis layer is to provide insight with preferred analytics techniques and tools. The analytics methods, issues to be considered, requirements, and tools are widely mentioned. Consumption layer being the result of business insight can be outsourced to sources like retail marketing, public sector, financial body, and media. Finally, a case study of architectural drivers is applied on a retail industry application and its challenges and usecases are discussed.


Biometrika ◽  
2020 ◽  
Vol 107 (3) ◽  
pp. 609-625 ◽  
Author(s):  
Grace Yoon ◽  
Raymond J Carroll ◽  
Irina Gaynanova

Summary Canonical correlation analysis investigates linear relationships between two sets of variables, but it often works poorly on modern datasets because of high dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach to sparse canonical correlation analysis based on the Gaussian copula. The main result of this paper is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings, as demonstrated via numerical studies, and when applied to the analysis of association between gene expression and microRNA data from breast cancer patients.


2006 ◽  
Vol 15 (04) ◽  
pp. 571-606
Author(s):  
SAMIR M. KORIEM

As real-time systems continue to grow, performance evaluation plays a critical role in the design of these systems since the computation time, the service time, and the responsive actions must satisfy the time constraints. One of these systems is the real-time distributed multimedia-on-demand (MOD) service system. The MOD system usually fails when it misses a task deadline. The main units of the MOD system usually communicate with each other and work concurrently under timing constraints. The MOD system is designed to store, retrieve, schedule, synchronize and communicate objects comprising of mixed data types including images, text, video and audio, in real-time. In the MOD system, such data types represent the main concept of movie files. Modeling of such concurrency, communication, timing, and multimedia service (e.g., store, retrieve) is essential for evaluating the real-time MOD system. To illustrate how to model and analyze the important multimedia aspects of the MOD system, we use the Real-net (R-net) modeling technique. We choose R-net as an extension of Time Petri Net due to its ability to specify hard real-time process interaction, represent the synchronization of multimedia entities, describe concurrent multimedia activities, and illustrate the inter-process timing relationships as required for multimedia presentation. Based on modular techniques, we build three R-net performance models for describing the dynamic behavior of the MOD service system. The first model adopts the Earliest Deadline First (EDF) disk scheduling algorithm. The other models adopt the Scan-EDF algorithm. These algorithms help us to illustrate how the real-time user requests can be satisfied within the specified deadline times. Since R-nets are amenable to analysis including Markov process modeling, the interesting performance measures of the MOD service system such as the quality of service, the request response time, the disk scheduling algorithm time, and the actual retrieval time can be easily computed. In the performance analysis of the MOD models, we use our R-NET package.


2013 ◽  
Vol 27 (2) ◽  
pp. 685-700 ◽  
Author(s):  
I. Sánchez-Borrego ◽  
J. D. Opsomer ◽  
M. Rueda ◽  
A. Arcos

2018 ◽  
Vol 2018 ◽  
pp. 1-9 ◽  
Author(s):  
Min-Wei Huang ◽  
Wei-Chao Lin ◽  
Chih-Fong Tsai

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.


Sign in / Sign up

Export Citation Format

Share Document