scholarly journals Data Processing at Scale

Author(s):  
Raju Singh

<p>The data generation and collection of data have gone through a series of improvements over the past several years. Now, we observe that both aspects of data (generation and collection) have evolved, it creates another dimension – how to process the data at scale, and how to manage it.</p><p> </p><p>Relational DBMS has been a widely accepted idea behind processing and managing data, but it has its own pros and cons, the constraints on data to prevent integrity violation is seen as a trade-off between performance and management. With the advent in the storage, compute and network technology, we have reliably transited the state of relational database management. It’s not yet done. Handling exceptions have been very poor with a single point of failure with traditional DB architecture. However, with distributed systems, it only multiplies the failure points. Failure is expected, and hence the solution for availability is designed around these expected failures. Distributed computing adds functionalities such as performance, availability, and reliability.</p><p>But, that’s not all. We are living in an era, where we communicate very now and then, through different devices. Not only this, we generate, collect, manage data which are of variant types (mostly unstructured, multi-dimensional, carries lots of noise and bias, etc.). NoSQL DBMS, Apache Spark, and Hadoop come to rescue.</p><p> </p><p>One such area that exemplifies the use of big data is the transportation industry, which can encompass shipping, airline data, trucking, and the context we refer to cabs. NYC taxi data is available in an open-dataset that stores, among other things, geospatial data collected from individual taxis as they navigate the streets of New York City. Processing of geospatial data at this scale is very time-consuming and resource-intensive, as anyone who has used ArcGIS on a large dataset can attest. Distributed and parallel data processing presents an opportunity for faster processing of this type of data. The Apache Spark framework is ideal for this task as it is highly efficient with fast performance times. Additionally, it has libraries and APIs built in that allow it to process SQL queries, which many users are likely to be familiar with given its ubiquity.</p><p> </p><p>In the following report, we demonstrate our approaches to perform hot spot analysis on the NYC Taxi data. Hot-zone analysis performs range-join on the rectangle and point, to identify the boundaries from where most pickups happen. Hot-cell analysis uses statistical parameters to identify the zones by also considering time as an additional dimension.</p>

2021 ◽  
Author(s):  
Raju Singh

<p>The data generation and collection of data have gone through a series of improvements over the past several years. Now, we observe that both aspects of data (generation and collection) have evolved, it creates another dimension – how to process the data at scale, and how to manage it.</p><p> </p><p>Relational DBMS has been a widely accepted idea behind processing and managing data, but it has its own pros and cons, the constraints on data to prevent integrity violation is seen as a trade-off between performance and management. With the advent in the storage, compute and network technology, we have reliably transited the state of relational database management. It’s not yet done. Handling exceptions have been very poor with a single point of failure with traditional DB architecture. However, with distributed systems, it only multiplies the failure points. Failure is expected, and hence the solution for availability is designed around these expected failures. Distributed computing adds functionalities such as performance, availability, and reliability.</p><p>But, that’s not all. We are living in an era, where we communicate very now and then, through different devices. Not only this, we generate, collect, manage data which are of variant types (mostly unstructured, multi-dimensional, carries lots of noise and bias, etc.). NoSQL DBMS, Apache Spark, and Hadoop come to rescue.</p><p> </p><p>One such area that exemplifies the use of big data is the transportation industry, which can encompass shipping, airline data, trucking, and the context we refer to cabs. NYC taxi data is available in an open-dataset that stores, among other things, geospatial data collected from individual taxis as they navigate the streets of New York City. Processing of geospatial data at this scale is very time-consuming and resource-intensive, as anyone who has used ArcGIS on a large dataset can attest. Distributed and parallel data processing presents an opportunity for faster processing of this type of data. The Apache Spark framework is ideal for this task as it is highly efficient with fast performance times. Additionally, it has libraries and APIs built in that allow it to process SQL queries, which many users are likely to be familiar with given its ubiquity.</p><p> </p><p>In the following report, we demonstrate our approaches to perform hot spot analysis on the NYC Taxi data. Hot-zone analysis performs range-join on the rectangle and point, to identify the boundaries from where most pickups happen. Hot-cell analysis uses statistical parameters to identify the zones by also considering time as an additional dimension.</p>


2021 ◽  
Vol 11 (5) ◽  
pp. 401
Author(s):  
Catherine A. Hoover ◽  
Kendahl L. Ott ◽  
Heather R. Manring ◽  
Trevor Dew ◽  
Maegen A. Borzok ◽  
...  

Desmoplakin (DSP) is a large (~260 kDa) protein found in the desmosome, a subcellular complex that links the cytoskeleton of one cell to its neighbor. A mutation ‘hot-spot’ within the NH2-terminal third of the DSP protein (specifically, residues 299–515) is associated with both cardiomyopathies and skin defects. In select DSP variants, disease is linked specifically to the uncovering of a previously-occluded calpain target site (residues 447–451). Here, we partially stabilize these calpain-sensitive DSP clinical variants through the addition of a secondary single point mutation—tyrosine for leucine at amino acid position 518 (L518Y). Molecular dynamic (MD) simulations and enzymatic assays reveal that this stabilizing mutation partially blocks access to the calpain target site, resulting in restored DSP protein levels. This ‘molecular band-aid’ provides a novel way to maintain DSP protein levels, which may lead to new strategies for treating this subset of DSP-related disorders.


Author(s):  
Vesna Jaksic ◽  
Vikram Pakrashi ◽  
Alan O’Connor

Damage detection and Structural Health Monitoring (SHM) for bridges employing bridge-vehicle interaction has created considerable interest in recent times. In this regard, a significant amount of work is present on the bridge-vehicle interaction models and on damage models. Surface roughness on bridges is typically used for detailing models and analyses are present relating surface roughness to the dynamic amplification of response of the bridge, the vehicle or to the ride quality. This paper presents the potential of using surface roughness for damage detection of bridge structures through bridge-vehicle interaction. The concept is introduced by considering a single point observation of the interaction of an Euler-Bernoulli beam with a breathing crack traversed by a point load. The breathing crack is treated as a nonlinear system with bilinear stiffness characteristics related to the opening and closing of crack. A uniform degradation of flexural rigidity of an Euler-Bernoulli beam traversed by a point load is also considered in this regard. The surface roughness of the beam is essentially a spatial representation of some spectral definition and is treated as a broadband white noise in this paper. The mean removed residuals of beam response are analyzed to estimate damage extent. Uniform velocity and acceleration conditions of the traversing load are investigated for the appropriateness of use. The detection and calibration of damage is investigated through cumulant based statistical parameters computed on stochastic, normalized responses of the damaged beam due to passages of the load. Possibilities of damage detection and calibration under benchmarked and non-benchmarked cases are discussed. Practicalities behind implementing this concept are also considered.


Author(s):  
J. Boehm ◽  
K. Liu ◽  
C. Alis

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.


1960 ◽  
Vol 64 (592) ◽  
pp. 199-210 ◽  
Author(s):  
Robert L. Cummings

SummaryTurbine-powered helicopters mark the most important step taken thus far in the transition from development to business for that segment of the transportation industry which utilises the capability for vertical flight.Completing more than a year ot detailed study and evaluation, New York Airways recently announced its commitment to purchase ten multi-turbine Vertol 107 aircraft designed to cruise in excess of 150 miles per hour with 25 passengers. They will be introduced into service in the spring of 1961. With these machines commercial revenues could, for the first time, offset all operating charges and produce a fair return on the capital investment—without government financial support. The availability of the Fairey Rotodyne in 1964 will place New York Airways in a position to offer the public a substantially enlarged and even more useful service operating on a business basis.


mSystems ◽  
2019 ◽  
Vol 4 (4) ◽  
Author(s):  
Susan M. Joseph ◽  
Thomas Battaglia ◽  
Julia M. Maritz ◽  
Jane M. Carlton ◽  
Martin J. Blaser

ABSTRACT Bacterial resistance to antibiotics is a pressing health issue around the world, not only in health care settings but also in the community and environment, particularly in crowded urban populations. The aim of our work was to characterize the microbial populations in sewage and the spread of antibiotic resistance within New York City (NYC). Here, we investigated the structure of the microbiome and the prevalence of antibiotic resistance genes in raw sewage samples collected from the fourteen NYC Department of Environmental Protection wastewater treatment plants, distributed across the five NYC boroughs. Sewage, a direct output of anthropogenic activity and a reservoir of microbes, provides an ecological niche to examine the spread of antibiotic resistance. Taxonomic diversity analysis revealed a largely similar and stable bacterial population structure across all the samples, which was found to be similar over three time points in an annual cycle, as well as in the five NYC boroughs. All samples were positive for the presence of the seven antibiotic resistance genes tested, based on real-time quantitative PCR assays, with higher levels observed for tetracycline resistance genes at all time points. For five of the seven genes, abundances were significantly higher in May than in February and August. This study provides characteristics of the NYC sewage resistome in the context of the overall bacterial populations. IMPORTANCE Urban sewage or wastewater is a diverse source of bacterial growth, as well as a hot spot for the development of environmental antibiotic resistance, which can in turn influence the health of the residents of the city. As part of a larger study to characterize the urban New York City microbial metagenome, we collected raw sewage samples representing three seasonal time points spanning the five boroughs of NYC and went on to characterize the microbiome and the presence of a range of antibiotic resistance genes. Through this study, we have established a baseline microbial population and antibiotic resistance abundance in NYC sewage which can prove to be very useful in studying the load of antibiotic usage, as well as for developing effective measures in antibiotic stewardship.


2019 ◽  
Vol 158 (1) ◽  
pp. 37 ◽  
Author(s):  
Petar Zečević ◽  
Colin T. Slater ◽  
Mario Jurić ◽  
Andrew J. Connolly ◽  
Sven Lončarić ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document