scholarly journals JACOBI4 software for multivariate analysis of biological data

2019 ◽  
Author(s):  
Denis Polunin ◽  
Irina Shtaiger ◽  
Vadim Efimov

AbstractBiologists more and more have to deal with objects with non-numeric descriptions: texts (e.g. genetic sequences or even whole genomes), graphs, images, etc. There even could be no variables or descriptions at all when variability of objects is defined by similarity matrix. It is also possible to have too many variables (e.g. a magnitude of millions is reachable in mass spectrometry or genome research). In this case it is necessary to switch to object similarity matrices which drastically reduces dimensionality to hundreds or thousands. It is software developer’s responsibility to keep this use cases in mind and provide means for working with such data instead of shifting the problem to the users. Software should be more convenient for them and allow solving wider range of problems with fairly simple mathematical apparatus. In particular principal component analysis (PCA) is rather popular among biologists. But, the necessity of variables is an illusion. It’s enough to have a matrix of Euclidean distances between objects and apply method of the principal coordinates (PCo) (or multidimensional scaling for dissimilarity matrix, MDS) [1].In the late 70s of the last century B. Efron proposed generating a set of new samples from the source sample EDF as a model for sample’s general distribution to get confidence estimation. He called it “bootstrap” [2]. For the statistical software developers this primarily means that PCo, MDS, and bootstrap should be implemented. Further, the use of bootstrap results in huge increase of repetitions of data analysis (from hundreds to millions of times) which is impossible to do in interactive mode. Therefore a part of the analysis requiring bootstrap should be written as a script in its entirety. Further user interaction should be eliminated. Obviously this process could be efficiently done in parallel.There are multitude of tools for doing it varying from scripting languages like R or Python to specialized software packages like PAST, CANOCO, Chemostat, STATISTICA, and MATLAB. Researchers who are not versed in software development tend to use tools like PAST, even if they may not cover all their needs, including automating frequently performed tasks. However, automatic analysis is a key element for the upcoming era of bootstrap analysis.We developed a simple and convenient package JACOBI4, which allows researchers without programming experience to automate multidimensional statistical analysis. Package and methods implemented in it can be useful in studies of both medical (gene expression for various diseases) and biological (regularities of molecular sequence variability) data. It goes without saying that the use of JACOBI4 is in no way limited to these examples. The package can be used directly, taking already developed scripts and editing them to fit own needs. Package JACOBI4 is freely available at [w1]. There are also articles available in which JACOBI4 is used to process real world data, as well as supplemental files containing JACOBI4 scripts and data for them.

2019 ◽  
Vol 51 (2) ◽  
pp. 225-227
Author(s):  
Gianluca Nardi ◽  
Jiří Háva

The first infestation of a museum entomological collection in Italy by Anthrenus (Anthrenops) coloratus Reitter, 1881 is recorded; it was detected in Rome (Lazio Region) in 2014. General distribution and biological data on this pest are summarized.


2021 ◽  
Vol 80 (2) ◽  
Author(s):  
Mostafa Ebadi ◽  
Rosa Eftekharian

Senecio vulgaris L., an annual herb belonging to the Asteraceae, is widely distributed in different regions of the world. There is no information on the intraspecific variations of the morphological and molecular features of this species. In the present investigation, we studied the morphological and genetic diversity of 81 accessions of S. vulgaris collected from 10 geographical populations. Eleven inter simple sequence repeat (ISSR) primers were used for the examination of genetic variations among the populations. Analysis of molecular variance (AMOVA) and GST analyses revealed significant differences among the investigated populations. A significant correlation between genetic distance and geographical distance was revealed by the Mantel test. However, reticulation analysis indicated the occurrence of gene flow among most of the populations studied. Principal component analysis (PCA) plot showed that the number of capitula, length of the cauline leaf and plant height were the most variable morphological characters. Principal coordinates analysis (PCoA) plot revealed two groups of populations, according to molecular and morphological data. The results suggested the existence of possible intraspecific taxonomic ranks within this species.


Author(s):  
Jasper John A. Obico ◽  
Julie F. Barcelona ◽  
Vincent Bonhomme ◽  
Marie Hale ◽  
Pieter B. Pelser

Tetrastigma loheri (Vitaceae) is a vine species native to Borneo and the Philippines. Because it is a commonly encountered forest species in the Philippines, T. loheri is potentially suitable for studying patterns of genetic diversity and connectivity among fragmented forestecosystems in various parts of this country. However, previous research suggests that T. loheri is part of a species complex in the Philippines (i.e. the T. loheri s. l. complex) that potentially also contains Philippine plants identified as T. diepenhorstii, T. philippinense, T. stenophyllum, andT. trifoliolatum. This uncertainty about its taxonomic delimitation can make it challenging to draw conclusions that are relevant to conservation from genetic studies using this species. Here, we tested the hypothesis that T. loheri s. l. is composed of more than one species in the Philippines.For this, we used generalized mixed Yule coalescent (GMYC) and Poisson tree process (PTP) species delimitation models to identify clades within DNA sequence phylogenies of T. loheri s. l. that might constitute species within this complex. Although these methods identified several putative species, these are statistically poorly supported and subsequent random forest analyses using a geometric morphometric leafshape dataset and several other vegetative characters did not result in the identification of characters that can be used to discriminate these putative species morphologically. Furthermore, the results of principal component and principal coordinates analyses of these data suggest the absence of morphological discontinuities within the species complex. Under a unified species concept that uses phylogenetic and morphological distinction as operational criteria for species recognition, we therefore conclude that the currently available data do not support recognizing multiple species in the T. loheri s. l. complex. This implies that T. loheri is best considered as a single, morphologically variable specieswhen used for studying patterns of genetic diversity and connectivity in the Philippines.


2018 ◽  
Vol 17 ◽  
pp. 117693511877108 ◽  
Author(s):  
Min Wang ◽  
Steven M Kornblau ◽  
Kevin R Coombes

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.


Plants ◽  
2019 ◽  
Vol 8 (5) ◽  
pp. 116 ◽  
Author(s):  
Fiore ◽  
Mercati ◽  
Spina ◽  
Blangiforti ◽  
Venora ◽  
...  

During the XX Century, the widespread use of modern wheat cultivars drastically reduced the cultivation of ancient landraces, which nowadays are confined to niche cultivation areas. Several durum wheat landraces adapted to the extreme environments of the Mediterranean region, are still being cultivated in Sicily, Italy. Detailed knowledge of the genetic diversity of this germplasm could lay the basis for their efficient management in breeding programs, for a wide-range range of traits. The aim of the present study was to characterize a collection of durum wheat landraces from Sicily, using single nucleotide polymorphisms (SNP) markers, together with agro-morphological, phenological and quality-related traits. Two modern cv. Simeto, Claudio, and the hexaploid landrace, Cuccitta, were used as outgroups. Cluster analysis and Principal Coordinates Analysis (PCoA) allowed us to identify four main clusters across the analyzed germplasm, among which a cluster included only historical and modern varieties. Likewise, structure analysis was able to distinguish the ancient varieties from the others, grouping the entries in seven cryptic genetic clusters. Furthermore, a Principal Component Analysis (PCA) was able to separate the modern testers from the ancient germplasm. This approach was useful to classify and evaluate Sicilian ancient wheat germplasm, supporting their safeguard and providing a genetic fingerprint that is necessary for avoiding commercial frauds to sustaining the economic profits of farmers resorting to landraces cultivation.


1977 ◽  
Vol 34 (8) ◽  
pp. 1196-1206 ◽  
Author(s):  
Grant A. Gardner

Canonical correlation, cluster, multiple regression, factor, and principal component analyses were used to examine zooplankton and hydrographic data over the period of unusual fluctuations in the overwintering population sizes of Calanus plumchrus, C. marshallae, and C. pacificus in the Strait of Georgia. Additional hydrographic data were examined for relationships between physical and biological data 3 and 6 mo out of phase. Analysis indicates a recent subtle temperature and salinity shift of uncertain biological significance. Canonical correlation and principal component analyses suggest that 15% of the variance in the zooplankton is related to a temporal trend paralleling that seen in the physical characteristics of the environment.Based on the factor analysis, C. plumchrus, Pseudocalanus minutas, Acartia longiremus, Sagitta elegans, Euphausia pacifica, Limacina spp, and Oithona spinirostris are suggested as "key" species for future zooplankton monitoring programs. Statistically supported species selection can reduce the time and expense of sorting zooplankton samples without an equivalent reduction of information yield. Key words: zooplankton, populations, Calanus, statistical analysis, temporal trends


2019 ◽  
Author(s):  
Philippe Boileau ◽  
Nima S. Hejazi ◽  
Sandrine Dudoit

AbstractMotivationStatistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances; however, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously.ResultsInspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis, sparse contrastive principal component analysis, that extracts sparse, stable, interpretable, and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study as well as via analyses of several publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets.AvailabilityA free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in the paper is also available via GitHub.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rong Zhu ◽  
Yong Wang ◽  
Jin-Xing Liu ◽  
Ling-Yun Dai

Abstract Background Identifying lncRNA-disease associations not only helps to better comprehend the underlying mechanisms of various human diseases at the lncRNA level but also speeds up the identification of potential biomarkers for disease diagnoses, treatments, prognoses, and drug response predictions. However, as the amount of archived biological data continues to grow, it has become increasingly difficult to detect potential human lncRNA-disease associations from these enormous biological datasets using traditional biological experimental methods. Consequently, developing new and effective computational methods to predict potential human lncRNA diseases is essential. Results Using a combination of incremental principal component analysis (IPCA) and random forest (RF) algorithms and by integrating multiple similarity matrices, we propose a new algorithm (IPCARF) based on integrated machine learning technology for predicting lncRNA-disease associations. First, we used two different models to compute a semantic similarity matrix of diseases from a directed acyclic graph of diseases. Second, a characteristic vector for each lncRNA-disease pair is obtained by integrating disease similarity, lncRNA similarity, and Gaussian nuclear similarity. Then, the best feature subspace is obtained by applying IPCA to decrease the dimension of the original feature set. Finally, we train an RF model to predict potential lncRNA-disease associations. The experimental results show that the IPCARF algorithm effectively improves the AUC metric when predicting potential lncRNA-disease associations. Before the parameter optimization procedure, the AUC value predicted by the IPCARF algorithm under 10-fold cross-validation reached 0.8529; after selecting the optimal parameters using the grid search algorithm, the predicted AUC of the IPCARF algorithm reached 0.8611. Conclusions We compared IPCARF with the existing LRLSLDA, LRLSLDA-LNCSIM, TPGLDA, NPCMF, and ncPred prediction methods, which have shown excellent performance in predicting lncRNA-disease associations. The compared results of 10-fold cross-validation procedures show that the predictions of the IPCARF method are better than those of the other compared methods.


Author(s):  
Hang Wei ◽  
Yong Xu ◽  
Bin Liu

Abstract Accumulated researches have revealed that Piwi-interacting RNAs (piRNAs) are regulating the development of germ and stem cells, and they are closely associated with the progression of many diseases. As the number of the detected piRNAs is increasing rapidly, it is important to computationally identify new piRNA-disease associations with low cost and provide candidate piRNA targets for disease treatment. However, it is a challenging problem to learn effective association patterns from the positive piRNA-disease associations and the large amount of unknown piRNA-disease pairs. In this study, we proposed a computational predictor called iPiDi-PUL to identify the piRNA-disease associations. iPiDi-PUL extracted the features of piRNA-disease associations from three biological data sources, including piRNA sequence information, disease semantic terms and the available piRNA-disease association network. Principal component analysis (PCA) was then performed on these features to extract the key features. The training datasets were constructed based on known positive associations and the negative associations selected from the unknown pairs. Various random forest classifiers trained with these different training sets were merged to give the predictive results via an ensemble learning approach. Finally, the web server of iPiDi-PUL was established at http://bliulab.net/iPiDi-PUL to help the researchers to explore the associated diseases for newly discovered piRNAs.


2018 ◽  
Vol 76 (3) ◽  
pp. 91-98
Author(s):  
Josipa Ferri ◽  
Karmen Bartulin ◽  
Frane Škeljo

Abstract Sagittae otoliths of eight juvenile species: Boops boops, Diplodus vulgaris, Diplodus puntazzo, Sarpa salpa (family Sparidae), Liza ramada, Liza aurata (family Mugilidae), Atherina boyeri, Atherina hepsetus (family Atherinidae) were analysed and compared using descriptive morphological characters and morphometric indices. The noticeable differences among the otoliths of the investigated species are in their overall shape, margins (i.e. irregular, sinuate or crenate) and anterior region. Otolith shape varied from elliptic to pentagonal in sparids, elliptic to rectangular in mugilids and elliptic in two atherinids. Aspect ratio (OW/OL), ratio of the sulcus length occupied by the cauda length (CL/SL) and ratio of the sulcus length occupied by the ostium length (OSL/SL) were calculated for all species. The otolith contour was described using wavelets. The Canonical Analysis of Principal Coordinates (CAP) gave an overview of the otolith shape differentiation between eight juveniles. Using the Wavelet coefficients, the first principal component (CAP1) explained 58.1% of the variation among species and the second principal component (CAP2) 25.2%.


Sign in / Sign up

Export Citation Format

Share Document