Alternative Clustering

In the last couple of decades, clustering has become a very crucial research problem in the data mining research community. Clustering refers to the partitioning of data objects such as records and documents into groups or clusters of similar characteristics. Clustering is unsupervised learning, because of unsupervised nature there is no unique solution for all problems. Most of the time complex data sets require explanation in multiple clustering sets. All the Traditional clustering approaches generate single clustering. There is more than one pattern in a dataset; each of patterns can be interesting in from different perspectives. Alternative clustering intends to find all unlike groupings of the data set such that each grouping has high quality and distinct from each other. This chapter gives you an overall view of alternative clustering; it's various approaches, related work, comparing with various confusing related terms like subspace, multi-view, and ensemble clustering, applications, issues, and challenges.

Download Full-text

Comparison of Multivariate Data Analysis Strategies for High-Content Screening

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057110395390 ◽

2011 ◽

Vol 16 (3) ◽

pp. 338-347 ◽

Cited By ~ 19

Author(s):

Anne Kümmel ◽

Paul Selzer ◽

Martin Beibel ◽

Hanspeter Gubler ◽

Christian N. Parker ◽

...

Keyword(s):

Data Analysis ◽

High Content Screening ◽

Data Sets ◽

Complex Data ◽

Data Set ◽

Processing Strategies ◽

Analysis Strategies ◽

Complex Data Sets ◽

High Degree ◽

Cell Data

High-content screening (HCS) is increasingly used in biomedical research generating multivariate, single-cell data sets. Before scoring a treatment, the complex data sets are processed (e.g., normalized, reduced to a lower dimensionality) to help extract valuable information. However, there has been no published comparison of the performance of these methods. This study comparatively evaluates unbiased approaches to reduce dimensionality as well as to summarize cell populations. To evaluate these different data-processing strategies, the prediction accuracies and the Z′ factors of control compounds of a HCS cell cycle data set were monitored. As expected, dimension reduction led to a lower degree of discrimination between control samples. A high degree of classification accuracy was achieved when the cell population was summarized on well level using percentile values. As a conclusion, the generic data analysis pipeline described here enables a systematic review of alternative strategies to analyze multiparametric results from biological systems.

Download Full-text

Uncovering Proximity of Chromosome Territories using Classical Algebraic Statistics

Journal of Algebraic Statistics ◽

10.18409/jas.v6i2.45 ◽

2015 ◽

Vol 6 (2) ◽

Author(s):

Javier Arsuaga ◽

Ido Heskia ◽

Serkan Hosten ◽

Tatiana Maskalevich

Keyword(s):

Statistical Significance ◽

Three Dimensional ◽

Spatial Proximity ◽

Published Data ◽

Algebraic Statistics ◽

Data Sets ◽

Complex Data ◽

Data Set ◽

Complex Data Sets ◽

Human Blood Lymphocyte

Exchange type chromosome aberrations (ETCAs) are rearrangements of the genome that occur when chromosomes break and the resulting fragments rejoin with fragments from other chromosomes or from other regions within the same chromosome. ETCAs are commonly observed in cancer cells and in cells exposed to radiation. The frequency of these chromosome rearrangements is correlated with their spatial proximity, therefore it can be used to infer the three dimensional organization of the genome. Extracting statistical significance of spatial proximity from cancer and radiation data has remained somewhat elusive because of the sparsity of the data. We here propose a new approach to study the three dimensional organization of the genome using algebraic statistics. We test our method on a published data set of irradiated human blood lymphocyte cells. We provide a rigorous method for testing the overall organization of the genome, and in agreement with previous results we find a random relative positioning of chromosomes with the exception of the chromosome pairs {1,22} and {13,14} that have a significantly larger number of ETCAs than the rest of the chromosome pairs suggesting their spatial proximity. We conclude that algebraic methods can successfully be used to analyze genetic data and have potential applications to larger and more complex data sets.

Download Full-text

Searching for best lower dimensional visualization angles for high dimensional RNA-Seq data

PeerJ ◽

10.7717/peerj.5199 ◽

2018 ◽

Vol 6 ◽

pp. e5199

Author(s):

Wanli Zhang ◽

Yanming Di

Keyword(s):

Selection Criterion ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Rna Seq ◽

High Dimensions ◽

Data Set ◽

Complex Data Sets ◽

Hidden Patterns ◽

Scatterplot Matrix

The accumulation of RNA sequencing (RNA-Seq) gene expression data in recent years has resulted in large and complex data sets of high dimensions. Exploratory analysis, including data mining and visualization, reveals hidden patterns and potential outliers in such data, but is often challenged by the high dimensional nature of the data. The scatterplot matrix is a commonly used tool for visualizing multivariate data, and allows us to view multiple bivariate relationships simultaneously. However, the scatterplot matrix becomes less effective for high dimensional data because the number of bivariate displays increases quadratically with data dimensionality. In this study, we introduce a selection criterion for each bivariate scatterplot and design/implement an algorithm that automatically scan and rank all possible scatterplots, with the goal of identifying the plots in which separation between two pre-defined groups is maximized. By applying our method to a multi-experimentArabidopsisRNA-Seq data set, we were able to successfully pinpoint the visualization angles where genes from two biological pathways are the most separated, as well as identify potential outliers.

Download Full-text

Representing Complex Data Sets as Virtual Structures

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/154193120204601724 ◽

2002 ◽

Vol 46 (17) ◽

pp. 1641-1644

Author(s):

Adam R. Richardson ◽

Marvin J. Dainoff ◽

Leonard S. Mark ◽

James L. Smart ◽

Niles Davis

Keyword(s):

Relative Size ◽

Conceptual Structure ◽

Research Strategy ◽

Data Sets ◽

Complex Data ◽

Initial Condition ◽

Data Set ◽

Complex Data Sets ◽

Viewing Perspective ◽

Virtual Structures

This paper describes a research strategy in which complex data sets are represented as physical objects in a virtual 3-D environment. The advantage of such a representation is that it allows the observer to actively explore the virtual environment so that potential ambiguities found in typical 3-D projections could be resolved by the transformation resulting from the change in viewing perspective. The study reported here constitutes an initial condition in which subjects compared relative size of virtual cubes from two different viewpoints. These results can serve as a basis for construction of cube-like objects representing the underlying conceptual structure of a data set

Download Full-text

Using Hadoop Technology to Overcome Big Data Problems by Choosing Proposed Cost-efficient Scheduler Algorithm for Heterogeneous Hadoop System (BD3)

Journal of Scientific Research and Reports ◽

10.9734/jsrr/2020/v26i930310 ◽

2020 ◽

pp. 58-84

Author(s):

Abou_el_ela Abdou Hussein

Keyword(s):

Big Data ◽

Data Processing ◽

Data Storage ◽

Database Management System ◽

Data Sets ◽

Complex Data ◽

Daily Data ◽

Complex Data Sets ◽

Cost Efficient ◽

Hadoop System

Day by day advanced web technologies have led to tremendous growth amount of daily data generated volumes. This mountain of huge and spread data sets leads to phenomenon that called big data which is a collection of massive, heterogeneous, unstructured, enormous and complex data sets. Big Data life cycle could be represented as, Collecting (capture), storing, distribute, manipulating, interpreting, analyzing, investigate and visualizing big data. Traditional techniques as Relational Database Management System (RDBMS) couldn’t handle big data because it has its own limitations, so Advancement in computing architecture is required to handle both the data storage requisites and the weighty processing needed to analyze huge volumes and variety of data economically. There are many technologies manipulating a big data, one of them is hadoop. Hadoop could be understand as an open source spread data processing that is one of the prominent and well known solutions to overcome handling big data problem. Apache Hadoop was based on Google File System and Map Reduce programming paradigm. Through this paper we dived to search for all big data characteristics starting from first three V's that have been extended during time through researches to be more than fifty six V's and making comparisons between researchers to reach to best representation and the precise clarification of all big data V’s characteristics. We highlight the challenges that face big data processing and how to overcome these challenges using Hadoop and its use in processing big data sets as a solution for resolving various problems in a distributed cloud based environment. This paper mainly focuses on different components of hadoop like Hive, Pig, and Hbase, etc. Also we institutes absolute description of Hadoop Pros and cons and improvements to face hadoop problems by choosing proposed Cost-efficient Scheduler Algorithm for heterogeneous Hadoop system.

Download Full-text

Science Communication with Dinosaurs

Handbook of Research on Computational Science and Engineering - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-61350-116-0.ch024 ◽

2012 ◽

pp. 587-611

Author(s):

Phillip L. Manning ◽

Peter L. Falkingham

Keyword(s):

High Performance Computing ◽

Public Engagement ◽

Science Communication ◽

High Performance ◽

Computational Science ◽

Data Sets ◽

Complex Data ◽

Multidisciplinary Research ◽

Complex Data Sets ◽

Performance Computing

Dinosaurs successfully conjure images of lost worlds and forgotten lives. Our understanding of these iconic, extinct animals now comes from many disciplines, not just the science of palaeontology. In recent years palaeontology has benefited from the application of new and existing techniques from physics, biology, chemistry, engineering, but especially computational science. The application of computers in palaeontology is highlighted in this chapter as a key area of development in studying fossils. The advances in high performance computing (HPC) have greatly aided and abetted multiple disciplines and technologies that are now feeding paleontological research, especially when dealing with large and complex data sets. We also give examples of how such multidisciplinary research can be used to communicate not only specific discoveries in palaeontology, but also the methods and ideas, from interrelated disciplines to wider audiences. Dinosaurs represent a useful vehicle that can help enable wider public engagement, communicating complex science in digestible chunks.

Download Full-text

Anomaly Detection for Inferring Social Structure

Social Computing ◽

10.4018/978-1-60566-984-7.ch118 ◽

2010 ◽

pp. 1797-1803

Author(s):

Lisa Friedland

Keyword(s):

Data Analysis ◽

Anomaly Detection ◽

Social Structure ◽

Small Groups ◽

Analysis Data ◽

Data Sets ◽

Complex Data ◽

Detection Approach ◽

Complex Data Sets ◽

Data Points

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure (3), we use an anomaly detection approach: develop a model to describe the data (1), then identify outliers (2).

Download Full-text

Introduction to Big Data and Business Analytics

10.4018/978-1-6684-3662-2.ch004 ◽

2022 ◽

pp. 67-76

Author(s):

Dineshkumar Bhagwandas Vaghela

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Data Sets ◽

Complex Data ◽

Business Analytics ◽

Database Applications ◽

Complex Data Sets ◽

History Of ◽

Rapid Generation

The term big data has come due to rapid generation of data in various organizations. In big data, the big is the buzzword. Here the data are so large and complex that the traditional database applications are not able to process (i.e., they are inadequate to deal with such volume of data). Usually the big data are described by 5Vs (volume, velocity, variety, variability, veracity). The big data can be structured, semi-structured, or unstructured. Big data analytics is the process to uncover hidden patterns, unknown correlations, predict the future values from large and complex data sets. In this chapter, the following topics will be covered more in detail. History of big data and business analytics, big data analytics technologies and tools, and big data analytics uses and challenges.

Download Full-text

Improving Quality and Safety Through Use of Secondary Data: Methods Case Study

Western Journal of Nursing Research ◽

10.1177/0193945916672449 ◽

2016 ◽

Vol 39 (11) ◽

pp. 1477-1501 ◽

Cited By ~ 1

Author(s):

Victoria Goode ◽

Nancy Crego ◽

Michael P. Cary ◽

Deirdre Thornlow ◽

Elizabeth Merwin

Keyword(s):

Research Question ◽

Secondary Data ◽

Data Access ◽

Data Sets ◽

Complex Data ◽

Management Skills ◽

Data Set ◽

Large Numbers ◽

Need To Evaluate

Researchers need to evaluate the strengths and weaknesses of data sets to choose a secondary data set to use for a health care study. This research method review informs the reader of the major issues necessary for investigators to consider while incorporating secondary data into their repertoire of potential research designs and shows the range of approaches the investigators may take to answer nursing research questions in a variety of context areas. The researcher requires expertise in locating and judging data sets and in the development of complex data management skills for managing large numbers of records. There are important considerations such as firm knowledge of the research question supported by the conceptual framework and the selection of appropriate databases, which guide the researcher in delineating the unit of analysis. Other more complex issues for researchers to consider when conducting secondary data research methods include data access, management and security, and complex variable construction.

Download Full-text

Bayesian Modelling for Machine Learning

Encyclopedia of Information Science and Technology, First Edition ◽

10.4018/978-1-59140-553-5.ch044 ◽

2005 ◽

pp. 236-242

Author(s):

Paul Rippon ◽

Kerrie Mengersen

Keyword(s):

Machine Learning ◽

Process Model ◽

Bayesian Learning ◽

Data Sets ◽

Complex Data ◽

Sources Of Information ◽

Multiple Sources ◽

Complex Data Sets ◽

Complex Phenomena ◽

Learning Data

Learning algorithms are central to pattern recognition, artificial intelligence, machine learning, data mining, and statistical learning. The term often implies analysis of large and complex data sets with minimal human intervention. Bayesian learning has been variously described as a method of updating opinion based on new experience, updating parameters of a process model based on data, modelling and analysis of complex phenomena using multiple sources of information, posterior probabilistic expectation, and so on. In all of these guises, it has exploded in popularity over recent years.

Download Full-text