scholarly journals Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment

Molecules ◽  
2019 ◽  
Vol 24 (1) ◽  
pp. 179 ◽  
Author(s):  
Dariusz Mrozek ◽  
Tomasz Dąbek ◽  
Bożena Małysiak-Mrozek

Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics.

Author(s):  
Jared M Sagendorf ◽  
Nicholas Markarian ◽  
Helen M Berman ◽  
Remo Rohs

Abstract DNAproDB (https://dnaprodb.usc.edu) is a web-based database and structural analysis tool that offers a combination of data visualization, data processing and search functionality that improves the speed and ease with which researchers can analyze, access and visualize structural data of DNA–protein complexes. In this paper, we report significant improvements made to DNAproDB since its initial release. DNAproDB now supports any DNA secondary structure from typical B-form DNA to single-stranded DNA to G-quadruplexes. We have updated the structure of our data files to support complex DNA conformations, multiple DNA–protein complexes within a DNAproDB entry and model indexing for analysis of ensemble data. Support for chemically modified residues and nucleotides has been significantly improved along with the addition of new structural features, improved structural moiety assignment and use of more sequence-based annotations. We have redesigned our report pages and search forms to support these enhancements, and the DNAproDB website has been improved to be more responsive and user-friendly. DNAproDB is now integrated with the Nucleic Acid Database, and we have increased our coverage of available Protein Data Bank entries. Our database now contains 95% of all available DNA–protein complexes, making our tools for analysis of these structures accessible to a broad community.


2020 ◽  
Author(s):  
Alexander S. Leonard ◽  
Sebastian E. Ahnert

AbstractGene duplication, from single genes to whole genomes, has been observed in organisms across all taxa. Despite its prevalence, the evolutionary benefits of this mechanism are the subject of ongoing debate. Gene duplication can significantly alter the self-assembly of protein quaternary structures, impacting the dosage or interaction proclivity. Here we use a lattice model of self-assembly as a coarse-grained representation of protein complex assembly, and show that it can be used to examine potential evolutionary advantages of duplication. Duplication provides a unique mechanism for increasing the evolvability of protein complexes by enabling the transformation of symmetric homomeric interactions into heteromeric ones. This transformation is extensively observed in in silico evolutionary simulations of the lattice model, with duplication events significantly accelerating the rate at which structural complexity increases. These coarse-grained simulation results are corroborated with a large-scale analysis of complexes from the Protein Data Bank.


2012 ◽  
Vol 279 (1742) ◽  
pp. 3393-3400 ◽  
Author(s):  
Philine S. E. Zu Ermgassen ◽  
Mark D. Spalding ◽  
Brady Blake ◽  
Loren D. Coen ◽  
Brett Dumbauld ◽  
...  

Historic baselines are important in developing our understanding of ecosystems in the face of rapid global change. While a number of studies have sought to determine changes in extent of exploited habitats over historic timescales, few have quantified such changes prior to late twentieth century baselines. Here, we present, to our knowledge, the first ever large-scale quantitative assessment of the extent and biomass of marine habitat-forming species over a 100-year time frame. We examined records of wild native oyster abundance in the United States from a historic, yet already exploited, baseline between 1878 and 1935 (predominantly 1885–1915), and a current baseline between 1968 and 2010 (predominantly 2000–2010). We quantified the extent of oyster grounds in 39 estuaries historically and 51 estuaries from recent times. Data from 24 estuaries allowed comparison of historic to present extent and biomass. We found evidence for a 64 per cent decline in the spatial extent of oyster habitat and an 88 per cent decline in oyster biomass over time. The difference between these two numbers illustrates that current areal extent measures may be masking significant loss of habitat through degradation.


2019 ◽  
Author(s):  
Beatriz Seoane ◽  
Alessandra Carbone

The importance of unstructured biology has quickly grown during the last decades accompanying the explosion of the number of experimentally resolved structures. The idea that structural disorder might be a novel mechanism of protein interaction is widespread in the literature, although the number of statistically significant structural studies supporting this idea is surprisingly low. In this work, through a large-scale-analysis of all the crystallographic structures of the Protein Data Bank averaged over clusters of homologous sequences, we show clear evidences that both the (experimentally verified) interaction interfaces and the disordered regions are involving roughly the same amino-acids of the protein. And beyond, disordered regions appear to carry information about the location of alternative interfaces when the protein lies within complexes, thus playing an important role in determining the order of assembly of protein complexes.


2019 ◽  
Author(s):  
Zachary VanAernum ◽  
Florian Busch ◽  
Benjamin J. Jones ◽  
Mengxuan Jia ◽  
Zibo Chen ◽  
...  

It is important to assess the identity and purity of proteins and protein complexes during and after protein purification to ensure that samples are of sufficient quality for further biochemical and structural characterization, as well as for use in consumer products, chemical processes, and therapeutics. Native mass spectrometry (nMS) has become an important tool in protein analysis due to its ability to retain non-covalent interactions during measurements, making it possible to obtain protein structural information with high sensitivity and at high speed. Interferences from the presence of non-volatiles are typically alleviated by offline buffer exchange, which is timeconsuming and difficult to automate. We provide a protocol for rapid online buffer exchange (OBE) nMS to directly screen structural features of pre-purified proteins, protein complexes, or clarified cell lysates. Information obtained by OBE nMS can be used for fast (<5 min) quality control and can further guide protein expression and purification optimization.


2019 ◽  
Author(s):  
Mohammad Atif Faiz Afzal ◽  
Mojtaba Haghighatlari ◽  
Sai Prasad Ganesh ◽  
Chong Cheng ◽  
Johannes Hachmann

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>


2020 ◽  
Vol 27 (37) ◽  
pp. 6306-6355 ◽  
Author(s):  
Marian Vincenzi ◽  
Flavia Anna Mercurio ◽  
Marilisa Leone

Background:: Many pathways regarding healthy cells and/or linked to diseases onset and progression depend on large assemblies including multi-protein complexes. Protein-protein interactions may occur through a vast array of modules known as protein interaction domains (PIDs). Objective:: This review concerns with PIDs recognizing post-translationally modified peptide sequences and intends to provide the scientific community with state of art knowledge on their 3D structures, binding topologies and potential applications in the drug discovery field. Method:: Several databases, such as the Pfam (Protein family), the SMART (Simple Modular Architecture Research Tool) and the PDB (Protein Data Bank), were searched to look for different domain families and gain structural information on protein complexes in which particular PIDs are involved. Recent literature on PIDs and related drug discovery campaigns was retrieved through Pubmed and analyzed. Results and Conclusion:: PIDs are rather versatile as concerning their binding preferences. Many of them recognize specifically only determined amino acid stretches with post-translational modifications, a few others are able to interact with several post-translationally modified sequences or with unmodified ones. Many PIDs can be linked to different diseases including cancer. The tremendous amount of available structural data led to the structure-based design of several molecules targeting protein-protein interactions mediated by PIDs, including peptides, peptidomimetics and small compounds. More studies are needed to fully role out, among different families, PIDs that can be considered reliable therapeutic targets, however, attacking PIDs rather than catalytic domains of a particular protein may represent a route to obtain selective inhibitors.


Sign in / Sign up

Export Citation Format

Share Document