scholarly journals SigTools: Exploratory Visualization for Genomic Signals

2021 ◽  
Author(s):  
Shohre Masoumi ◽  
Maxwell W. Libbrecht ◽  
Kay C. Wiese

Motivation: With the advancement of sequencing technologies, genomic data sets are constantly being expanded by high volumes of different data types. One recently introduced data type in genomic science is genomic signals, which are usually short-read coverage measurements over the genome. An example of genomic signals is Epigenomic marks which are utilized to locate functional and nonfunctional elements in genome annotation studies. To understand and evaluate the results of such studies, one needs to understand and analyze the characteristics of the input data. Results: SigTools is an R-based genomic signals visualization package developed with two objectives: 1) to facilitate genomic signals exploration in order to uncover insights for later model training, refinement, and development by including distribution and autocorrelation plots. 2) to enable genomic signals interpretation by including correlation, and aggregation plots. Moreover, Sigtools also provides text-based descriptive statistics of the given signals which can be practical when developing and evaluating learning models. We also include results from 2 case studies. The first examines several previously studied genomic signals called histone modifications. This use case demonstrates how SigTools can be beneficial for satisfying scientists curiosity in exploring and establishing recognized datasets. The second use case examines a dataset of novel chromatin state features which are novel genomic signals generated by a learning model. This use case demonstrates how SigTools can assist in exploring the characteristics and behavior of novel signals towards their interpretation. In addition, our corresponding web application, SigTools-Shiny, extends the accessibility scope of these modules to people who are more comfortable working with graphical user interfaces instead of command-line tools.

2020 ◽  
Author(s):  
Annika Tjuka ◽  
Robert Forkel ◽  
Johann-Mattis List

Psychologists and linguists have collected a great diversity of data for word and concept properties. In psychology, many studies accumulate norms and ratings such as word frequencies or age-of-acquisition often for a large number of words. Linguistics, on the other hand, provides valuable insights into relations of word meanings. We present a collection of those data sets for norms, ratings, and relations that cover different languages: ‘NoRaRe.’ To enable a comparison between the diverse data types, we established workflows that facilitate the expansion of the database. A web application allows convenient access to the data (https://digling.org/norare/). Furthermore, a software API ensures consistent data curation by providing tests to validate the data sets. The NoRaRe collection is linked to the database curated by the Concepticon project (https://concepticon.clld.org) which offers a reference catalog of unified concept sets. The link between words in the data sets and the Concepticon concept sets makes a cross-linguistic comparison possible. In three case studies, we test the validity of our approach, the accuracy of our workflow, and the applicability of our database. The results indicate that the NoRaRe database can be applied for the study of word properties across multiple languages. The data can be used by psychologists and linguists to benefit from the knowledge rooted in both research disciplines.


2020 ◽  
pp. 958-971
Author(s):  
Marcel Ramos ◽  
Ludwig Geistlinger ◽  
Sehyun Oh ◽  
Lucas Schiffer ◽  
Rimsha Azhar ◽  
...  

PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.


Author(s):  
Evgeniy Meyke

Complex projects that collect, curate and analyse biodiversity data are often presented with the challenge of accommodating diverse data types, various curation and output workflows, and evolving project logistics that require rapid changes in the applications and data structures. At the same time, sustainability concerns and maintenance overheads pose a risk to the long term viability of such projects. We advocate the use of flexible, multiplatform tools that adapt to operational, day-to-day challenges while providing a robust, cost efficient, and maintainable framework that serves the needs data collectors, managers and users. EarthCape is a highly versatile platform for managing biodiversity research and collections data, associated molecular laboratory data (Fig. 1), multimedia, structured ecological surveys and monitoring schemes, and more. The platform includes a fully functional Windows client as well as a web application. The data are stored in the cloud or on-premises and can be accessed by users with various access and editing rights. Ease of customization (making changes to user interface and functionality) is critical for most environments that deal with operational research processes. For active researchers and curators, there is rarely time to wait for a cycle of development that follows a change or feature request. In EarthCape, most of the changes to the default setup can be implemented by the end users with minimum effort and require no programming skills. High flexibility and a range of customisation options is complemented with mapping to Darwin Core standard and integration with GBIF, Geolocate, Genbank, and Biodiversity Heritage Library APIs. The system is currently used daily for rapid data entry, digitization and sample tracking, by such organisations as Imperial College, University of Cambridge, University of Helsinki, University of Oxford. Being an operational data entry and retrieval tool, EarthCape sits at the bottom of Virtual Research Environments ecosystem. It is not a software or platform to build data repositories, but rather a very focused tool falling under "back office" software category. Routine label printing, laboratory notebook maintenance, rapid data entry set up, or any other of relatively loaded user interfaces make use of any industry standard relational database back end. This opens a wide scope for IT designers to implement desired integrations within their institutional infrastructure. APIs and developer access to core EarthCape libraries to build own applications and modules are under development. Basic data visualisation (charts, pivots, dashboards), mapping (full featured desktop GIS module), data outputs (report and label designer) are tailored not only to research analyses, but also for managing logistics and communication when working on (data) papers. The presentation will focus on the software platform featuring most prominent use cases from two areas: ecological research (managing complex network data digitization project) and museum collections management (herbarium and insect collections).


2016 ◽  
Vol 12 (4) ◽  
pp. 1-19
Author(s):  
Anne Tchounikine ◽  
Maryvonne Miquel ◽  
Usman Ahmed

In this paper, the authors propose an approach and different tools to evaluate the performance and assess the effectiveness of a model in the field of dynamic cubing. Experimental evaluation, on one hand allows observing the behavior and the performance of the solution, while on the other hand it lets one compare the results with those of the other competing solutions. The authors' proposal includes an experimental workflow based on a set of configuration parameters to characterize the inputs (data sets, queries sets and algorithm input parameters) and a set of metrics to analyze and qualify the output (performance and behavior metrics) of the solution. They have identified a number of useful tools necessary to develop an experimental evaluation strategy. These monitoring tools allow elaborating the execution scenarios, collecting output metrics and storing and analyzing them online in real-time as well as later in off-line mode. Using a use-case model, the authors show that the framework and the proposed environment help carrying out a rigorous experimental evaluation of a dynamic cubing solution.


2020 ◽  
Vol 5 (3) ◽  
pp. 180-188
Author(s):  
Malika Adigezalova ◽  

The article is devoted to the features of female types in the tragedies of one of significant playwrights of the XX century Guseyn Javid. In the given article, they analyse and compare the characteristic features and behavior of the female figures of the author’s such literaryworks as «Mother»(Selma, Ismet), «Maral»(Maral, Humay), «Afet»(Afet, Alagoz), «Siyavush»(Farangiz, Sudaba). The basis of the article lies in the creative works of G.Javid, where special attention is attracted by several types of female characters, among which the types of a traditional eastern woman are most brightly represented


2021 ◽  
Vol 29 ◽  
pp. 115-124
Author(s):  
Xinlu Wang ◽  
Ahmed A.F. Saif ◽  
Dayou Liu ◽  
Yungang Zhu ◽  
Jon Atli Benediktsson

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.


2021 ◽  
Vol 13 (9) ◽  
pp. 4648
Author(s):  
Rana Muhammad Adnan ◽  
Kulwinder Singh Parmar ◽  
Salim Heddam ◽  
Shamsuddin Shahid ◽  
Ozgur Kisi

The accurate estimation of suspended sediments (SSs) carries significance in determining the volume of dam storage, river carrying capacity, pollution susceptibility, soil erosion potential, aquatic ecological impacts, and the design and operation of hydraulic structures. The presented study proposes a new method for accurately estimating daily SSs using antecedent discharge and sediment information. The novel method is developed by hybridizing the multivariate adaptive regression spline (MARS) and the Kmeans clustering algorithm (MARS–KM). The proposed method’s efficacy is established by comparing its performance with the adaptive neuro-fuzzy system (ANFIS), MARS, and M5 tree (M5Tree) models in predicting SSs at two stations situated on the Yangtze River of China, according to the three assessment measurements, RMSE, MAE, and NSE. Two modeling scenarios are employed; data are divided into 50–50% for model training and testing in the first scenario, and the training and test data sets are swapped in the second scenario. In Guangyuan Station, the MARS–KM showed a performance improvement compared to ANFIS, MARS, and M5Tree methods in term of RMSE by 39%, 30%, and 18% in the first scenario and by 24%, 22%, and 8% in the second scenario, respectively, while the improvement in RMSE of ANFIS, MARS, and M5Tree was 34%, 26%, and 27% in the first scenario and 7%, 16%, and 6% in the second scenario, respectively, at Beibei Station. Additionally, the MARS–KM models provided much more satisfactory estimates using only discharge values as inputs.


2021 ◽  
pp. 016555152199863
Author(s):  
Ismael Vázquez ◽  
María Novo-Lourés ◽  
Reyes Pavón ◽  
Rosalía Laza ◽  
José Ramón Méndez ◽  
...  

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.


2019 ◽  
Author(s):  
Patrick Monnahan ◽  
Yaniv Brandvain

AbstractSearching for population genomic signals left behind by positive selection is a major focus of evolutionary biology, particularly as sequencing technologies develop and costs decline. The effect of the number of chromosome copies (i.e. ploidy) on the manifestation of these signals remains an outstanding question, despite a wide appreciation of ploidy being a fundamental parameter governing numerous biological processes. We clarify the principal forces governing the differential manifestation and persistence of the signal of selection by separating the effects of polyploidy on rates of fixation versus rates of diversity (i.e. mutation and recombination) with a set of coalescent simulations. We explore what the major consequences of polyploidy, such as a more localized signal, greater dependence on dominance, and longer persistence of the signal following fixation, mean for within- and across-ploidy inference on the strength and prevalence of selective sweeps. As genomic advances continue to open doors for interrogating natural systems, studies such as this aid our ability to anticipate, interpret, and compare data across ploidy levels.


2017 ◽  
Author(s):  
James Hadfield ◽  
Colin Megill ◽  
Sidney M. Bell ◽  
John Huddleston ◽  
Barney Potter ◽  
...  

AbstractSummaryUnderstanding the spread and evolution of pathogens is important for effective public health measures and surveillance. Nextstrain consists of a database of viral genomes, a bioinformatics pipeline for phylodynamics analysis, and an interactive visualisation platform. Together these present a real-time view into the evolution and spread of a range of viral pathogens of high public health importance. The visualization integrates sequence data with other data types such as geographic information, serology, or host species. Nextstrain compiles our current understanding into a single accessible location, publicly available for use by health professionals, epidemiologists, virologists and the public alike.Availability and implementationAll code (predominantly JavaScript and Python) is freely available from github.com/nextstrain and the web-application is available at nextstrain.org.


Sign in / Sign up

Export Citation Format

Share Document