scholarly journals Characterising phase variations in MALDI-TOF data and correcting them by peak alignment

2005 ◽  
Vol 1 ◽  
pp. 117693510500100 ◽  
Author(s):  
Simon M Lin ◽  
Richard P Haney ◽  
Michael J Campa ◽  
Michael C Fitzgerald ◽  
Edward F Patz

The use of MALDI-TOF mass spectrometry as a means of analyzing the proteome has been evaluated extensively in recent years. One of the limitations of this technique that has impeded the development of robust data analysis algorithms is the variability in the location of protein ion signals along the x-axis. We studied technical variations of MALDI-TOF measurements in the context of proteomics profiling. By acquiring a benchmark data set with five replicates, we estimated 76% to 85% of the total variance is due to phase variation. We devised a lobster plot, so named because of the resemblance to a lobster claw, to help detect the phase variation in replicates. We also investigated a peak alignment algorithm to remove the phase variation. This operation is analogous to the normalization step in microarray data analysis. Only after this critical step can features of biological interest be clearly revealed. With the help of principal component analysis, we demonstrated that after peak alignment, the differences among replicates are reduced. We compared this approach to peak alignment with a model-based calibration approach in which there was known information about peaks in common among all spectra. Finally, we examined the potential value at each point in an analysis pipeline of having a set of methods available that includes parametric, semiparametric and nonparametric methods; among such methods are those that benefit from the use of prior information.

2008 ◽  
Vol 06 (02) ◽  
pp. 261-282 ◽  
Author(s):  
AO YUAN ◽  
WENQING HE

Clustering is a major tool for microarray gene expression data analysis. The existing clustering methods fall mainly into two categories: parametric and nonparametric. The parametric methods generally assume a mixture of parametric subdistributions. When the mixture distribution approximately fits the true data generating mechanism, the parametric methods perform well, but not so when there is nonnegligible deviation between them. On the other hand, the nonparametric methods, which usually do not make distributional assumptions, are robust but pay the price for efficiency loss. In an attempt to utilize the known mixture form to increase efficiency, and to free assumptions about the unknown subdistributions to enhance robustness, we propose a semiparametric method for clustering. The proposed approach possesses the form of parametric mixture, with no assumptions to the subdistributions. The subdistributions are estimated nonparametrically, with constraints just being imposed on the modes. An expectation-maximization (EM) algorithm along with a classification step is invoked to cluster the data, and a modified Bayesian information criterion (BIC) is employed to guide the determination of the optimal number of clusters. Simulation studies are conducted to assess the performance and the robustness of the proposed method. The results show that the proposed method yields reasonable partition of the data. As an illustration, the proposed method is applied to a real microarray data set to cluster genes.


2018 ◽  
Vol 8 (10) ◽  
pp. 1766 ◽  
Author(s):  
Arthur Leroy ◽  
Andy MARC ◽  
Olivier DUPAS ◽  
Jean Lionel REY ◽  
Servane Gey

Many data collected in sport science come from time dependent phenomenon. This article focuses on Functional Data Analysis (FDA), which study longitudinal data by modelling them as continuous functions. After a brief review of several FDA methods, some useful practical tools such as Functional Principal Component Analysis (FPCA) or functional clustering algorithms are presented and compared on simulated data. Finally, the problem of the detection of promising young swimmers is addressed through a curve clustering procedure on a real data set of performance progression curves. This study reveals that the fastest improvement of young swimmers generally appears before 16 years old. Moreover, several patterns of improvement are identified and the functional clustering procedure provides a useful detection tool.


2021 ◽  
Vol 28 (3) ◽  
Author(s):  
Christian Capezza ◽  
Fabio Centofanti ◽  
Antonio Lepore ◽  
Biagio Palumbo

Abstract Sensing networks provide nowadays massive amounts of data that in many applications provide information about curves, surfaces and vary over a continuum, usually time, and thus, can be suitably modelled as functional data. Their proper modelling by means of functional data analysis approaches naturally addresses new challenges also arising in the statistical process monitoring (SPM). Motivated by an industrial application, the objective of the present paper is to provide the reader with a very transparent set of steps for the SPM of functional data in real-world case studies: i) identifying a finite dimensional model for the functional data, based on functional principal component analysis; ii) estimating the unknown parameters; iii) designing control charts on the estimated parameters, in a nonparametric framework. The proposed SPM procedure is applied to a real-case study from the maritime field in monitoring CO2 emissions from real navigation data of a roll-on/roll-off passenger cruise ship, i.e., a ship designed to carry both passengers and wheeled vehicles that are driven on and off the ship on their own wheels. We show different scenarios highlighting clear and interpretable indications that can be extracted from the data set and support the detection of anomalous voyages.


2010 ◽  
Vol 9 (3) ◽  
pp. 217-226 ◽  
Author(s):  
Yolande V. Tra ◽  
Irene M. Evans

BIO2010 put forth the goal of improving the mathematical educational background of biology students. The analysis and interpretation of microarray high-dimensional data can be very challenging and is best done by a statistician and a biologist working and teaching in a collaborative manner. We set up such a collaboration and designed a course on microarray data analysis. We started using Genome Consortium for Active Teaching (GCAT) materials and Microarray Genome and Clustering Tool software and added R statistical software along with Bioconductor packages. In response to student feedback, one microarray data set was fully analyzed in class, starting from preprocessing to gene discovery to pathway analysis using the latter software. A class project was to conduct a similar analysis where students analyzed their own data or data from a published journal paper. This exercise showed the impact that filtering, preprocessing, and different normalization methods had on gene inclusion in the final data set. We conclude that this course achieved its goals to equip students with skills to analyze data from a microarray experiment. We offer our insight about collaborative teaching as well as how other faculty might design and implement a similar interdisciplinary course.


2021 ◽  
Author(s):  
Moritz Heusinger ◽  
Christoph Raab ◽  
Frank-Michael Schleif

AbstractIn recent years social media became an important part of everyday life for many people. A big challenge of social media is, to find posts, that are interesting for the user. Many social networks like Twitter handle this problem with so-called hashtags. A user can label his own Tweet (post) with a hashtag, while other users can search for posts containing a specified hashtag. But what about finding posts which are not labeled by the creator? We provide a way of completing hashtags for unlabeled posts using classification on a novel real-world Twitter data stream. New posts will be created every second, thus this context fits perfectly for non-stationary data analysis. Our goal is to show, how labels (hashtags) of social media posts can be predicted by stream classifiers. In particular, we employ random projection (RP) as a preprocessing step in calculating streaming models. Also, we provide a novel real-world data set for streaming analysis called NSDQ with a comprehensive data description. We show that this dataset is a real challenge for state-of-the-art stream classifiers. While RP has been widely used and evaluated in stationary data analysis scenarios, non-stationary environments are not well analyzed. In this paper, we provide a use case of RP on real-world streaming data, especially on NSDQ dataset. We discuss why RP can be used in this scenario and how it can handle stream-specific situations like concept drift. We also provide experiments with RP on streaming data, using state-of-the-art stream classifiers like adaptive random forest and concept drift detectors. Additionally, we experimentally evaluate an online principal component analysis (PCA) approach in the same fashion as we do for RP. To obtain higher dimensional synthetic streams, we use random Fourier features (RFF) in an online manner which allows us, to increase the number of dimensions of low dimensional streams.


2007 ◽  
Vol 20 (3) ◽  
Author(s):  
J. Wim M. van Breukelen ◽  
Bart Zandbergen ◽  
Frank M.T.A. Busing

Structure and importance of work values: a comparison between three data analysis techniques Structure and importance of work values: a comparison between three data analysis techniques J.W.M. van Breukelen, B. Zandbergen & F.M.T.A. Busing, Gedrag & Organisatie, volume 20, September 2007, nr. 3, pp. 272-302 Work values refer to the importance people attribute to the various aspects of a job, such as work content, salary, and colleagues. Generally, in studies on work values two questions are being answered: a) How important are the various aspects of a job for a certain sample of respondents as a whole or for certain subgroups, and b) What is the structure underlying the total set of work values under study? In this study three data analytic techniques are being investigated in analysing the answers of 417 respondents (299 men and 118 women). The three data analytic methods were principal component analysis (PCA), multidimensional scaling (MDS), and multidimensional unfolding (MDU). Of these three, PCA and MDS give information about the structure of the work values, whereas MDU shows the importance of the various work values in a visual plot. We describe and discuss the pro's and cons of these techniques using the data set mentioned above as an illustration. Our conclusion is that MDU is a welcome addition to both PCA and MDS in studies on work values. Firstly, MDU makes visible the importance of work values for the group respondents as a whole and for subgroups, if needed. At the same time, MDU gives indications about the structure of the work values under study.


2011 ◽  
Vol 58 (4) ◽  
Author(s):  
Marcin T Schmidt ◽  
Luiza Handschuh ◽  
Joanna Zyprych ◽  
Alicja Szabelska ◽  
Agnieszka K Olejnik-Schmidt ◽  
...  

Two-color DNA microarrays are commonly used for the analysis of global gene expression. They provide information on relative abundance of thousands of mRNAs. However, the generated data need to be normalized to minimize systematic variations so that biologically significant differences can be more easily identified. A large number of normalization procedures have been proposed and many softwares for microarray data analysis are available. Here, we have applied two normalization methods (median and loess) from two packages of microarray data analysis softwares. They were examined using a sample data set. We found that the number of genes identified as differentially expressed varied significantly depending on the method applied. The obtained results, i.e. lists of differentially expressed genes, were consistent only when we used median normalization methods. Loess normalization implemented in the two software packages provided less coherent and for some probes even contradictory results. In general, our results provide an additional piece of evidence that the normalization method can profoundly influence final results of DNA microarray-based analysis. The impact of the normalization method depends greatly on the algorithm employed. Consequently, the normalization procedure must be carefully considered and optimized for each individual data set.


Author(s):  
Prince Nathan S

Abstract: Cryptocurrency has drastically increased its growth in recent years and Bitcoin (BTC) is a very popular type of currency among all the other types of cryptocurrencies which is been used in most of the sectors nowadays for trading, transactions, bookings, etc. In this paper, we aim to predict the change in bitcoin prices by using machine learning techniques on data from Investing.com. We interpret the output and accuracy rate using various machine learning models. To see whether to buy or sell the bitcoin we created exploratory data analysis from a year of data set and predict the next 5 days change using machine learning models like logistic Regression, Logistic Regression with PCA (Principal Component Analysis), and Neural network. Keywords: Data Science, Machine Learning, Regression, PCA, Neural Network, Data Analysis


2021 ◽  
Vol 8 (4) ◽  
pp. 372-384
Author(s):  
Sarada Ghosh ◽  
◽  
Guruprasad Samanta ◽  
Manuel De la Sen ◽  
◽  
...  

<abstract> <p>DNA microarray technology with biological data-set can monitor the expression levels of thousands of genes simultaneously. Microarray data analysis is important in phenotype classification of diseases. In this work, the computational part basically predicts the tendency towards mortality using different classification techniques by identifying features from the high dimensional dataset. We have analyzed the breast cancer transcriptional genomic data of 1554 transcripts captured over from 272 samples. This work presents effective methods for gene classification using Logistic Regression (LR), Random Forest (RF), Decision Tree (DT) and constructs a classifier with an upgraded rate of accuracy than all features together. The performance of these underlying methods are also compared with dimension reduction method, namely, Principal Component Analysis (PCA). The methods of feature reduction with RF, LR and decision tree (DT) provide better performance than PCA. It is observed that both techniques LR and RF identify TYMP, ERS1, C-MYB and TUBA1a genes. But some features corresponding to the genes such as ARID4B, DNMT3A, TOX3, RGS17 and PNLIP are uniquely pointed out by LR method which are leading to a significant role in breast cancer. The simulation is based on <italic>R</italic>-software.</p> </abstract>


Sign in / Sign up

Export Citation Format

Share Document