A MapReduce Clone Car Identification Model over Traffic Data Stream

Accompany the widely use of Intelligent Traffic in China, all traffic input data streams to the Traffic Surveillance Center (TSC). Some metropolitan TSC, such as in Beijing, produces up to 18 million records and 1T image data arriving every hour. Normally, the job of the TSC is to monitor and retain data. There is a tendency to put more capability into the TSC, such as ad-hoc query for clone car identification and feedback abnormal traffic information. Thus we definitely need to think about what can be kept in working storage and how to analysis it. Obviously, the ordinary database cannot handle the massive dataset and complex ad-hoc query. MapReduce is a popular and widely used fine grain parallel runtime, which is developed for high performance processing of large scale dataset. In this paper, we propose CarMR, a MapReduce Clone Car Identification system based on Hive/Hadoop frameworks. A distributed file system HDFS is used in CarMR for fast data sharing and query. CarMR supports fast locating clone car and also optimizes the route to catch fugitive. Our results show that the model achieves a higher efficiency.

Download Full-text

Research of a MapReduce Model to Process the Traffic Big Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.548-549.1853 ◽

2014 ◽

Vol 548-549 ◽

pp. 1853-1856 ◽

Cited By ~ 1

Author(s):

Wen Chuan Yang ◽

He Chen ◽

Qing Yi Qu

Keyword(s):

High Performance ◽

Large Scale ◽

Ad Hoc ◽

Processing System ◽

Traffic Information ◽

Traffic Data ◽

Fine Grain ◽

Large Scale Dataset ◽

Data Processing Center ◽

Mapreduce Model

Normally, the job of the Traffic Data Processing Center (TDPC) is to monitor and retain data. There is a tendency to put more capability into the TDPC, such as ad-hoc query for speeding car identification and feedback abnormal traffic information. Thus we definitely need to think about what can be kept in working storage and how to analysis it. Obviously, the ordinary database cannot handle the massive dataset and complex ad-hoc query. MapReduce is a popular and widely used fine grain parallel runtime, which is developed for high performance processing of large scale dataset. In this paper, we propose MRTP, a MapReduce Traffic Processing system based on Hive/Hadoop frameworks. A distributed file system HDFS is used in MRTP for fast data sharing and query. MRTP supports fast locating speeding car and also optimizes the route to catch fugitive. Our results show that the model achieves a higher efficiency.

Download Full-text

ON A NETWORK SENSING PROBLEM

Journal of Interconnection Networks ◽

10.1142/s0219265906001582 ◽

2006 ◽

Vol 07 (01) ◽

pp. 63-73 ◽

Cited By ~ 3

Author(s):

WEIZHEN GU ◽

D. FRANK HSU ◽

XINGDE JIA

Keyword(s):

Traffic Flow ◽

Wireless Ad Hoc Networks ◽

Information Needs ◽

Large Scale ◽

Ad Hoc ◽

Telecommunication Networks ◽

Traffic Information ◽

Other Information ◽

Hoc Networks ◽

Entire Network

Live traffic flow information can help improve the efficiency of a communication network. There are many ways available to monitor the traffic flow of a network. In this paper, we propose a very efficient monitoring strategy. This strategy not only reduces the number of nodes to be monitored but also determines the complete traffic information of the entire network using the information from the monitored nodes. The strategy is optimal for monitoring a network because it reduces the number of monitored nodes to a minimum. Fast algorithms are also presented in this paper to compute the traffic information for the entire network based on the information collected from the monitored nodes. The monitoring scheme discussed in this paper can be applied to the internet, telecommunication networks, wireless ad hoc networks, large scale multiprocessor computing systems, and other information systems where the transmission of information needs to be monitored.

Download Full-text

A Method for Detecting and Analyzing Facial Features of People with Drug Use Disorders

Diagnostics ◽

10.3390/diagnostics11091562 ◽

2021 ◽

Vol 11 (9) ◽

pp. 1562

Author(s):

Yongjie Li ◽

Xiangyu Yan ◽

Bo Zhang ◽

Zekun Wang ◽

Hexuan Su ◽

...

Keyword(s):

Drug Use ◽

High Performance ◽

Large Scale ◽

Facial Features ◽

Clinical Workflow ◽

Performance Accuracy ◽

Drug Use Disorders ◽

Large Scale Dataset ◽

Novel Method ◽

Facial Images

Drug use disorders caused by illicit drug use are significant contributors to the global burden of disease, and it is vital to conduct early detection of people with drug use disorders (PDUD). However, the primary care clinics and emergency departments lack simple and effective tools for screening PDUD. This study proposes a novel method to detect PDUD using facial images. Various experiments are designed to obtain the convolutional neural network (CNN) model by transfer learning based on a large-scale dataset (9870 images from PDUD and 19,567 images from GP (the general population)). Our results show that the model achieved 84.68%, 87.93%, and 83.01% in accuracy, sensitivity, and specificity in the dataset, respectively. To verify its effectiveness, the model is evaluated on external datasets based on real scenarios, and we found it still achieved high performance (accuracy > 83.69%, specificity > 90.10%, sensitivity > 80.00%). Our results also show differences between PDUD and GP in different facial areas. Compared with GP, the facial features of PDUD were mainly concentrated in the left cheek, right cheek, and nose areas (p < 0.001), which also reveals the potential relationship between mechanisms of drugs action and changes in facial tissues. This is the first study to apply the CNN model to screen PDUD in clinical practice and is also the first attempt to quantitatively analyze the facial features of PDUD. This model could be quickly integrated into the existing clinical workflow and medical care to provide capabilities.

Download Full-text

Measuring Traffic in Cities Through a Large-Scale Online Platform

Journal of Big Data Analytics in Transportation ◽

10.1007/s42421-019-00007-7 ◽

2019 ◽

Vol 1 (2-3) ◽

pp. 161-173 ◽

Cited By ~ 3

Author(s):

Vilhelm Verendel ◽

Sonia Yeh

Keyword(s):

Real Time ◽

Large Scale ◽

Data Availability ◽

Traffic Information ◽

Traffic Data ◽

Online Platform ◽

Real Time Traffic ◽

Data Source ◽

Road Segments

Abstract Online real-time traffic data services could effectively deliver traffic information to people all over the world and provide large benefits to the society and research about cities. Yet, city-wide road network traffic data are often hard to come by on a large scale over a longer period of time. We collect, describe, and analyze traffic data for 45 cities from HERE, a major online real-time traffic information provider. We sampled the online platform for city traffic data every 5 min during 1 year, in total more than 5 million samples covering more than 300 thousand road segments. Our aim is to describe some of the practical issues surrounding the data that we experienced in working with this type of data source, as well as to explore the data patterns and see how this data source provides information to study traffic in cities. We focus on data availability to characterize how traffic information is available for different cities; it measures the share of road segments with real-time traffic information at a given time for a given city. We describe the patterns of real-time data availability, and evaluate methods to handle filling in missing speed data for road segments when real-time information was not available. We conduct a validation case study based on Swedish traffic sensor data and point out challenges for future validation. Our findings include (i) a case study of validating the HERE data against ground truth available for roads and lanes in a Swedish city, showing that real-time traffic data tends to follow dips in travel speed but miss instantaneous higher speed measured in some sensors, typically at times when there are fewer vehicles on the road; (ii) using time series clustering, we identify four clusters of cities with different types of measurement patterns; and (iii) a k-nearest neighbor-based method consistently outperforms other methods to fill in missing real-time traffic speeds. We illustrate how to work with this kind of traffic data source that is increasingly available to researchers, travellers, and city planners. Future work is needed to broaden the scope of validation, and to apply these methods to use online data for improving our knowledge of traffic in cities.

Download Full-text

DAPT: A package enabling distributed automated parameter testing

Gigabyte ◽

10.46471/gigabyte.22 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Ben Duggan ◽

John Metzcar ◽

Paul Macklin

Keyword(s):

High Performance ◽

Large Scale ◽

Ad Hoc ◽

Simulation Models ◽

Power Combining ◽

Agent Based ◽

Tool Set ◽

Computational Resources ◽

Performance Computing ◽

Python Package

Modern agent-based models (ABM) and other simulation models require evaluation and testing of many different parameters. Managing that testing for large scale parameter sweeps (grid searches), as well as storing simulation data, requires multiple, potentially customizable steps that may vary across simulations. Furthermore, parameter testing, processing, and analysis are slowed if simulation and processing jobs cannot be shared across teammates or computational resources. While high-performance computing (HPC) has become increasingly available, models can often be tested faster with the use of multiple computers and HPC resources. To address these issues, we created the Distributed Automated Parameter Testing (DAPT) Python package. By hosting parameters in an online (and often free) “database”, multiple individuals can run parameter sets simultaneously in a distributed fashion, enabling ad hoc crowdsourcing of computational power. Combining this with a flexible, scriptable tool set, teams can evaluate models and assess their underlying hypotheses quickly. Here, we describe DAPT and provide an example demonstrating its use.

Download Full-text

Semantic segmentation of microscopic neuroanatomical data by combining topological priors with encoder-decoder deep networks

10.1101/2020.02.18.955237 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samik Banerjee ◽

Lucas Magee ◽

Dingkang Wang ◽

Xu Li ◽

Bingxing Huo ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Large Scale ◽

Image Data ◽

Semantic Segmentation ◽

Scientific Data ◽

Error Rates ◽

Topological Data Analysis ◽

Hybrid Architecture ◽

Deep Networks

Understanding of neuronal circuitry at cellular resolution within the brain has relied on tract tracing methods which involve careful observation and interpretation by experienced neuroscientists. With recent developments in imaging and digitization, this approach is no longer feasible with the large scale (terabyte to petabyte range) images. Machine learning based techniques, using deep networks, provide an efficient alternative to the problem. However, these methods rely on very large volumes of annotated images for training and have error rates that are too high for scientific data analysis, and thus requires a significant volume of human-in-the-loop proofreading. Here we introduce a hybrid architecture combining prior structure in the form of topological data analysis methods, based on discrete Morse theory, with the best-in-class deep-net architectures for the neuronal connectivity analysis. We show significant performance gains using our hybrid architecture on detection of topological structure (e.g. connectivity of neuronal processes and local intensity maxima on axons corresponding to synaptic swellings) with precision/recall close to 90% compared with human observers. We have adapted our architecture to a high performance pipeline capable of semantic segmentation of light microscopic whole-brain image data into a hierarchy of neuronal compartments. We expect that the hybrid architecture incorporating discrete Morse techniques into deep nets will generalize to other data domains.

Download Full-text

High performance computing for flood simulation using Telemac based on hybrid MPI/OpenMP parallel programming

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962314720015 ◽

2014 ◽

Vol 05 (04) ◽

pp. 1472001 ◽

Cited By ~ 5

Author(s):

Zhi Shang

Keyword(s):

Parallel Computing ◽

Parallel Programming ◽

High Performance ◽

Large Scale ◽

Flood Simulation ◽

Hybrid Programming ◽

Fine Grain ◽

Domain Partitioning ◽

Performance Computing ◽

Parallel Technique

Usually simulations on environment flood issues will face the scalability problem of large scale parallel computing. The plain parallel technique based on pure MPI is difficult to have a good scalability due to the large number of domain partitioning. Therefore, the hybrid programming using MPI and OpenMP is introduced to deal with the issue of scalability. This kind of parallel technique can give a full play to the strengths of MPI and OpenMP. During the parallel computing, OpenMP is employed by its efficient fine grain parallel computing and MPI is used to perform the coarse grain parallel domain partitioning for data communications. Through the tests, the hybrid MPI/OpenMP parallel programming was used to renovate the finite element solvers in the BIEF library of Telemac. It was found that the hybrid programming is able to provide helps for Telemac to deal with the scalability issue.

Download Full-text

Raythena: a vertically integrated scheduler for ATLAS applications on heterogeneous distributed resources

EPJ Web of Conferences ◽

10.1051/epjconf/202024505042 ◽

2020 ◽

Vol 245 ◽

pp. 05042

Author(s):

Miha Muškinja ◽

Paolo Calafiura ◽

Charles Leggett ◽

Illya Shapoval ◽

Vakho Tsulaia

Keyword(s):

High Performance ◽

Large Scale ◽

Ad Hoc ◽

Heterogeneous Systems ◽

Application Framework ◽

Event Processing ◽

Throughput Optimization ◽

Distributed Resources ◽

Processing Application ◽

Fine Grained

The ATLAS experiment has successfully integrated HighPerformance Computing resources (HPCs) in its production system. Unlike the current generation of HPC systems, and the LHC computing grid, the next generation of supercomputers is expected to be extremely heterogeneous in nature: different systems will have radically different architectures, and most of them will provide partitions optimized for different kinds of workloads. In this work we explore the applicability of concepts and tools realized in Ray (the high-performance distributed execution framework targeting large-scale machine learning applications) to ATLAS event throughput optimization on heterogeneous distributed resources, ranging from traditional grid clusters to Exascale computers. We present a prototype of Raythena, a Ray-based implementation of the ATLAS Event Service (AES), a fine-grained event processing workflow aimed at improving the efficiency of ATLAS workflows on opportunistic resources, specifically HPCs. The AES is implemented as an event processing task farm that distributes packets of events to several worker processes running on multiple nodes. Each worker in the task farm runs an event-processing application (Athena) as a daemon. The whole system is orchestrated by Ray, which assigns work in a distributed, possibly heterogeneous, environment. For all its flexibility, the AES implementation is currently comprised of multiple separate layers that communicate through ad-hoc command-line and filebased interfaces. The goal of Raythena is to integrate these layers through a feature-rich, efficient application framework. Besides increasing usability and robustness, a vertically integrated scheduler will enable us to explore advanced concepts such as dynamically shaping of workflows to exploit currently available resources, particularly on heterogeneous systems.

Download Full-text

GraphPEG

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3450440 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-24

Author(s):

Yashuai Lü ◽

Hui Guo ◽

Libo Huang ◽

Qi Yu ◽

Li Shen ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Graph Algorithm ◽

Graph Processing ◽

Graph Traversal ◽

Fine Grain ◽

Large Scale Data ◽

Load Imbalance ◽

Work Distribution ◽

Level Parallelism

Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address. In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8× and 2.5× on average, and up to 7.3× and 7.0× for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.

Download Full-text

DAPT: A Package Enabling Distributed Automated Parameter Testing

10.20944/preprints202103.0116.v1 ◽

2021 ◽

Author(s):

Ben Duggan ◽

John Metzcar ◽

Paul Macklin

Keyword(s):

High Performance ◽

Large Scale ◽

Ad Hoc ◽

Simulation Models ◽

Power Combining ◽

Agent Based ◽

Tool Set ◽

Computational Resources ◽

Performance Computing ◽

Python Package

Modern agent-based models (ABM) and other simulation models require evaluation and testing of many different parameters. Managing that testing for large scale parameter sweeps (grid searches) as well as storing simulation data requires multiple, potentially customizable steps that may vary across simulations. Furthermore, parameter testing, processing, and analysis are slowed if simulation and processing jobs cannot be shared across teammates or computational resources. While high-performance computing (HPC) has become increasingly available, models can often be tested faster through the use of multiple computers and HPC resources. To address these issues, we created the Distributed Automated Parameter Testing (DAPT) Python package. By hosting parameters in an online (and often free) "database", multiple individuals can run tests simultaneously in a distributed fashion, enabling ad hoc crowdsourcing of computational power. Combining this with a flexible, scriptable tool set, teams can evaluate models and assess their underlying hypotheses quickly. Here we describe DAPT and provide an example demonstrating its use.

Download Full-text