Scalable community detection via parallel correlation clustering

Graph clustering and community detection are central problems in modern data mining. The increasing need for analyzing billion-scale data calls for faster and more scalable algorithms for these problems. There are certain trade-offs between the quality and speed of such clustering algorithms. In this paper, we design scalable algorithms that achieve high quality when evaluated based on ground truth. We develop a generalized sequential and shared-memory parallel framework based on the LAMBDACC objective (introduced by Veldt et al.), which encompasses modularity and correlation clustering. Our framework consists of highly-optimized implementations that scale to large data sets of billions of edges and that obtain high-quality clusters compared to ground-truth data, on both unweighted and weighted graphs. Our empirical evaluation shows that this framework improves the state-of-the-art trade-offs between speed and quality of scalable community detection. For example, on a 30-core machine with two-way hyper-threading, our implementations achieve orders of magnitude speedups over other correlation clustering baselines, and up to 28.44× speedups over our own sequential baselines while maintaining or improving quality.

Download Full-text

Design and Evaluation of a Crowdsourcing Precision Agriculture Mobile Application for Lambsquarters, Mission LQ

Agronomy ◽

10.3390/agronomy11101951 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1951

Author(s):

Brianna B. Posadas ◽

Mamatha Hanumappa ◽

Kim Niewolny ◽

Juan E. Gilbert

Keyword(s):

Precision Agriculture ◽

Mobile Application ◽

Ground Truth ◽

High Quality ◽

Classification Rate ◽

Human Centered Design ◽

Ground Truth Data ◽

System Usability Scale ◽

Final Design ◽

Design Protocol

Precision agriculture is highly dependent on the collection of high quality ground truth data to validate the algorithms used in prescription maps. However, the process of collecting ground truth data is labor-intensive and costly. One solution to increasing the collection of ground truth data is by recruiting citizen scientists through a crowdsourcing platform. In this study, a crowdsourcing platform application was built using a human-centered design process. The primary goals were to gauge users’ perceptions of the platform, evaluate how well the system satisfies their needs, and observe whether the classification rate of lambsquarters by the users would match that of an expert. Previous work demonstrated a need for ground truth data on lambsquarters in the D.C., Maryland, Virginia (DMV) area. Previous social interviews revealed users who would want a citizen science platform to expand their skills and give them access to educational resources. Using a human-centered design protocol, design iterations of a mobile application were created in Kinvey Studio. The application, Mission LQ, taught people how to classify certain characteristics of lambsquarters in the DMV and allowed them to submit ground truth data. The final design of Mission LQ received a median system usability scale (SUS) score of 80.13, which indicates a good design. The classification rate of lambsquarters was 72%, which is comparable to expert classification. This demonstrates that a crowdsourcing mobile application can be used to collect high quality ground truth data for use in precision agriculture.

Download Full-text

SKT

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476287 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2369-2382

Author(s):

Monica Chiosa ◽

Thomas B. Preußer ◽

Gustavo Alonso

Keyword(s):

Frequency Distribution ◽

Empirical Evaluation ◽

Large Data ◽

Cloud Service ◽

Data Sets ◽

Data Set ◽

Single Pass ◽

Trade Offs ◽

Significant Performance ◽

Spatial Architecture

Data analysts often need to characterize a data stream as a first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter to characterize the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the increasing adoption of hardware accelerators, this paper proposes SKT , an FPGA-based accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures, such as Microsoft's Catapult or Amazon's AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. The empirical evaluation shows how SKT on an FPGA offers a significant performance gain over high-end, server-class CPUs.

Download Full-text

Globally Local: Hyper-local Modeling for Accurate Forecast of COVID-19

10.1101/2020.11.16.20232686 ◽

2020 ◽

Author(s):

Vishrawas Gopalakrishnan ◽

Sayali Pethe ◽

Sarah Kefayati ◽

Raman Srinivasan ◽

Paul Hake ◽

...

Keyword(s):

Public Health ◽

Infectious Disease ◽

State Level ◽

Ground Truth ◽

Prediction Errors ◽

Public Health Response ◽

Underlying Assumption ◽

Ground Truth Data ◽

Local Modeling ◽

Trade Offs

AbstractMultiple efforts to model the epidemiology of SARS-CoV-2 have recently been launched in support of public health response at the national, state, and county levels. While the pandemic is global, the dynamics of this infectious disease varies with geography, local policies, and local variations in demographics. An underlying assumption of most infectious disease compartment modeling is that of a well mixed population at the resolution of the areas being modeled. The implicit need to model at fine spatial resolution is impeded by the quality of ground truth data for fine scale administrative subdivisions. To understand the trade-offs and benefits of such modeling as a function of scale, we compare the predictive performance of a SARS-CoV-2 modeling at the county, county cluster, and state level for the entire United States. Our results demonstrate that accurate prediction at the county level requires hyper-local modeling with county resolution. State level modeling does not accurately predict community spread in smaller sub-regions because state populations are not well mixed, resulting in large prediction errors. As an important use case, leveraging high resolution modeling with public health data and admissions data from Hillsborough County Florida, we performed weekly forecasts of both hospital admission and ICU bed demand for the county. The repeated forecasts between March and August 2020 were used to develop accurate resource allocation plans for Tampa General Hospital.2010 MSC92-D30, 91-C20

Download Full-text

Gravitational community detection by predicting diameter

Discrete Mathematics Algorithms and Applications ◽

10.1142/s1793830921501457 ◽

2021 ◽

pp. 2150145

Author(s):

Himansu Sekhar Pattanayak ◽

Harsh K. Verma ◽

Amrit Lal Sangal

Keyword(s):

Community Detection ◽

Ground Truth ◽

Detection Algorithm ◽

Local Information ◽

Ground Truth Data ◽

Overlapping Communities ◽

Detection Algorithms ◽

Np Hard Problem ◽

Community Detection Algorithm ◽

The Individual

Community detection is a pivotal part of network analysis and is classified as an NP-hard problem. In this paper, a novel community detection algorithm is proposed, which probabilistically predicts communities’ diameter using the local information of random seed nodes. The gravitation method is then applied to discover communities surrounding the seed nodes. The individual communities are combined to get the community structure of the whole network. The proposed algorithm, named as Local Gravitational community detection algorithm (LGCDA), can also work with overlapping communities. LGCDA algorithm is evaluated based on quality metrics and ground-truth data by comparing it with some of the widely used community detection algorithms using synthetic and real-world networks.

Download Full-text

A Visual Tracking Scheme for Accurate Object Retrieval in Low Frame Rate Videos

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016400030 ◽

2016 ◽

Vol 25 (05) ◽

pp. 1640003 ◽

Cited By ~ 2

Author(s):

Yoav Liberman ◽

Adi Perry

Keyword(s):

Visual Tracking ◽

Template Matching ◽

Large Data ◽

Ground Truth ◽

Frame Rate ◽

Data Set ◽

Ground Truth Data ◽

Multiple Metrics ◽

Low Frame Rate ◽

Tracking Objects

Visual tracking in low frame rate (LFR) videos has many inherent difficulties for achieving accurate target recovery, such as occlusions, abrupt motions and rapid pose changes. Thus, conventional tracking methods cannot be applied reliably. In this paper, we offer a new scheme for tracking objects in low frame rate videos. We present a method of integrating multiple metrics for template matching, as an extension for the particle filter. By inspecting a large data set of videos for tracking, we show that our method not only outperforms other related benchmarks in the field, but it also achieves better results both visually and quantitatively, once compared to actual ground truth data.

Download Full-text

From Village to Globe: A Dynamic Real-Time Map of African Fields Through PlantVillage

Frontiers in Sustainable Food Systems ◽

10.3389/fsufs.2021.514785 ◽

2021 ◽

Vol 5 ◽

Author(s):

Annalyse Kehs ◽

Peter McCloskey ◽

John Chelal ◽

Derek Morr ◽

Stellah Amakove ◽

...

Keyword(s):

Real Time ◽

Ground Truth ◽

Cost Effective ◽

Quality Data ◽

Learning Tools ◽

High Quality ◽

Ground Truth Data ◽

Cost Effective Method ◽

High Throughput Method ◽

Major Bottleneck

A major bottleneck to the application of machine learning tools to satellite data of African farms is the lack of high-quality ground truth data. Here we describe a high throughput method using youth in Kenya that results in a cost-effective method for high-quality data in near real-time. This data is presented to the global community, as a public good and is linked to other data sources that will inform our understanding of crop stress, particularly in the context of climate change.

Download Full-text

Crowdsourcing Image Analysis for Plant Phenomics to Generate Ground Truth Data for Machine Learning

10.1101/265918 ◽

2018 ◽

Author(s):

Naihui Zhou ◽

Zachary D Siegel ◽

Scott Zarecor ◽

Nigel Lee ◽

Darwin A Campbell ◽

...

Keyword(s):

Machine Learning ◽

Image Analysis ◽

Best Practices ◽

Ground Truth ◽

Training Data ◽

Quality Data ◽

High Quality ◽

Ground Truth Data ◽

Plant Phenomics

AbstractThe accuracy of machine learning tasks critically depends on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We investigate the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, but with no significant difference between the two MTurk worker types. Furthermore, the quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.Author SummaryFood security is a growing global concern. Farmers, plant breeders, and geneticists are hastening to address the challenges presented to agriculture by climate change, dwindling arable land, and population growth. Scientists in the field of plant phenomics are using satellite and drone images to understand how crops respond to a changing environment and to combine genetics and environmental measures to maximize crop growth efficiency. However, the terabytes of image data require new computational methods to extract useful information. Machine learning algorithms are effective in recognizing select parts of images, butthey require high quality data curated by people to train them, a process that can be laborious and costly. We examined how well crowdsourcing works in providing training data for plant phenomics, specifically, segmenting a corn tassel – the male flower of the corn plant – from the often-cluttered images of a cornfield. We provided images to students, and to Amazon MTurkers, the latter being an on-demand workforce brokered by Amazon.com and paid on a task-by-task basis. We report on best practices in crowdsourcing image labeling for phenomics, and compare the different groups on measures such as fatigue and accuracy over time. We find that crowdsourcing is a good way of generating quality labeled data, rivaling that of experts.

Download Full-text

Improving Community Detection in Time-Evolving Networks Through Clustering Fusion

Cybernetics and Information Technologies ◽

10.1515/cait-2015-0029 ◽

2015 ◽

Vol 15 (2) ◽

pp. 63-74 ◽

Cited By ~ 2

Author(s):

Ran Jin ◽

Chunhai Kou ◽

Ruijuan Liu

Keyword(s):

Community Detection ◽

Maximum Likelihood Method ◽

Clustering Algorithms ◽

Likelihood Method ◽

High Quality ◽

Actual Distribution ◽

Evolving Networks ◽

Detection Algorithms ◽

Clustering Ensembles ◽

The Difference

Abstract Traditional community detection algorithms are easily interfered by noises and outliers. Therefore, we propose to leverage a clustering fusion method to improve the results of community detection. Usually, there are two issues in clustering ensembles: how to generate efficient diversified cluster members, and how to ensembles the results of all members. Specifically: (1) considering the time evolving characteristic of real world networks, we propose to generate clustering members based on the snapshot of networks, where the split based clustering algorithms are performed; (2) considering the difference in the distribution of the cluster centers in each clustering member and the actual distribution, we ensemble the results based on a maximum likelihood method. Moreover, we conduct experiments to show that our method can discover high quality communities.

Download Full-text

Community detection algorithm evaluation with ground-truth data

Physica A Statistical Mechanics and its Applications ◽

10.1016/j.physa.2017.10.018 ◽

2018 ◽

Vol 492 ◽

pp. 651-706 ◽

Cited By ~ 23

Author(s):

Malek Jebabli ◽

Hocine Cherifi ◽

Chantal Cherifi ◽

Atef Hamouda

Keyword(s):

Community Detection ◽

Ground Truth ◽

Detection Algorithm ◽

Ground Truth Data ◽

Algorithm Evaluation ◽

Community Detection Algorithm

Download Full-text

Artifact Formation in Single Molecule Localization Microscopy

10.1101/700955 ◽

2019 ◽

Author(s):

Jochen M. Reichel ◽

Thomas Vomhof ◽

Jens Michaelis

Keyword(s):

Single Molecule ◽

Ground Truth ◽

Image Artifacts ◽

Ground Truth Data ◽

Localization Accuracy ◽

Localization Microscopy ◽

Trade Offs ◽

Artifact Formation ◽

Localization Behavior ◽

Single Molecule Localization Microscopy

AbstractWe investigate the influence of different accuracy-detection rate trade-offs on image reconstruction in single molecule localization microscopy. Our main focus is the investigation of image artifacts experienced when using low localization accuracy, especially in the presence of sample drift and inhomogeneous background. In this context we present a newly developed SMLM software termed FIRESTORM which is optimized for high accuracy reconstruction. For our analysis we used in silico SMLM data and compared the reconstructed images to the ground truth data. We observe two discriminable reconstruction populations of which only one shows the desired localization behavior.

Download Full-text