Ht-index for empirical evaluation of the sampled graph-based Discrete Pulse Transform

The Discrete Pulse Transform decomposes a signal into pulses, with the most recent and effective implementation being a graph-base algorithm called the Roadmaker’s Pavage. Even though an efficient implementation, the theoretical structure results in a slow, deterministic algorithm. This paper examines the use of the spectral domain of graphs and designs graph filter banks to downsample the algorithm, investigating the extent to which this speeds up the algorithm. Converting graph signals to the spectral domain is costly, thus estimation for filter banks is examined, as well as the design of a reusable filter bank. The sampled version requires hyperparameters to reconstruct the same textures of the image as the original algorithm, preventing a large scale study. Here an objective and efficient way of deriving similar results between the original and our proposed Filtered Roadmaker’s Pavage is provided. The method makes use of the Ht-index, separating the distribution of information at scale intervals. Empirical research using benchmark datasets provides improved results, showing that using the proposed algorithm consistently runs faster, uses less computational resources, while having a positive SSIM with low variance. This provides an informative and faster approximation to the nonlinear DPT, a property not standardly achievable.

Download Full-text

Offline and Online Algorithms for SSD Management

Proceedings of the ACM on Measurement and Analysis of Computing Systems ◽

10.1145/3491045 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1-28

Author(s):

Tomer Lange ◽

Joseph (Seffi) Naor ◽

Gala Yadgar

Keyword(s):

Greedy Algorithm ◽

Prediction Error ◽

Large Scale ◽

Optimal Algorithm ◽

Empirical Evaluation ◽

Deterministic Algorithm ◽

Management Problem ◽

Solid State Drives ◽

Wide Range ◽

The Greedy Algorithm

Flash-based solid state drives (SSDs) have gained a central role in the infrastructure of large-scale datacenters, as well as in commodity servers and personal devices. The main limitation of flash media is its inability to support update-in-place: after data has been written to a physical location, it has to be erased before new data can be written to it. Moreover, SSDs support read and write operations in granularity of pages, while erasures are performed on entire blocks, which often contain hundreds of pages. When erasing a block, any valid data it stores must be rewritten to a clean location. As an SSD eventually wears out with progressing number of erasures, the efficiency of the management algorithm has a significant impact on its endurance. In this paper we first formally define the SSD management problem. We then explore this problem from an algorithmic perspective, considering it in both offline and online settings. In the offline setting, we present a near-optimal algorithm that, given any input, performs a negligible number of rewrites (relative to the input length). We also discuss the hardness of the offline problem. In the online setting, we first consider algorithms that have no prior knowledge about the input. We prove that no deterministic algorithm outperforms the greedy algorithm in this setting, and discuss the possible benefit of randomization. We then augment our model, assuming that each request for a page arrives with a prediction of the next time the page is updated. We design an online algorithm that uses such predictions, and show that its performance improves as the prediction error decreases. We also show that the performance of our algorithm is never worse than that guaranteed by the greedy algorithm, even when the prediction error is large. We complement our theoretical findings with an empirical evaluation of our algorithms, comparing them with the state-of-the-art scheme. The results confirm that our algorithms exhibit an improved performance for a wide range of input traces.

Download Full-text

Towards an optimized GROUP by abstraction for large-scale machine learning

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476284 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2327-2340

Author(s):

Side Li ◽

Arun Kumar

Keyword(s):

Machine Learning ◽

Large Scale ◽

Linear Models ◽

Hybrid Approach ◽

Empirical Evaluation ◽

Parallel Execution ◽

Task Parallelism ◽

Data Systems ◽

Benchmark Datasets ◽

Boosted Decision Trees

Many applications that use large-scale machine learning (ML) increasingly prefer different models for subgroups (e.g., countries) to improve accuracy, fairness, or other desiderata. We call this emerging popular practice learning over groups , analogizing to GROUP BY in SQL, albeit for ML training instead of SQL aggregates. From the systems standpoint, this practice compounds the already data-intensive workload of ML model selection (e.g., hyperparameter tuning). Often, thousands of models may need to be trained, necessitating high-throughput parallel execution. Alas, most ML systems today focus on training one model at a time or at best, parallelizing hyperparameter tuning. This status quo leads to resource wastage, low throughput, and high runtimes. In this work, we take the first step towards enabling and optimizing learning over groups from the data systems standpoint for three popular classes of ML: linear models, neural networks, and gradient-boosted decision trees. Analytically and empirically, we compare standard approaches to execute this workload today: task-parallelism and data-parallelism. We find neither is universally dominant. We put forth a novel hybrid approach we call grouped learning that avoids redundancy in communications and I/O using a novel form of parallel gradient descent we call Gradient Accumulation Parallelism (GAP). We prototype our ideas into a system we call Kingpin built on top of existing ML tools and the flexible massively-parallel runtime Ray. An extensive empirical evaluation on large ML benchmark datasets shows that Kingpin matches or is 4x to 14x faster than state-of-the-art ML systems, including Ray's native execution and PyTorch DDP.

Download Full-text

Effects of Trust and Threat Messaging on Academic Cheating: A Field Study

Psychological Science ◽

10.1177/0956797620977513 ◽

2021 ◽

pp. 095679762097751

Author(s):

Li Zhao ◽

Jiaxin Zheng ◽

Haiying Mao ◽

Xinyi Yu ◽

Jiacheng Ye ◽

...

Keyword(s):

Field Study ◽

Academic Integrity ◽

Large Scale ◽

Theoretical Foundation ◽

Empirical Evaluation ◽

Psychological Theory ◽

Field Studies ◽

Educational Institutions ◽

Academic Cheating ◽

College Classrooms

Morality-based interventions designed to promote academic integrity are being used by educational institutions around the world. Although many such approaches have a strong theoretical foundation and are supported by laboratory-based evidence, they often have not been subjected to rigorous empirical evaluation in real-world contexts. In a naturalistic field study ( N = 296), we evaluated a recent research-inspired classroom innovation in which students are told, just prior to taking an unproctored exam, that they are trusted to act with integrity. Four university classes were assigned to a proctored exam or one of three types of unproctored exam. Students who took unproctored exams cheated significantly more, which suggests that it may be premature to implement this approach in college classrooms. These findings point to the importance of conducting ecologically valid and well-controlled field studies that translate psychological theory into practice when introducing large-scale educational reforms.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text

Spectral Domain Spline Graph Filter Bank

IEEE Signal Processing Letters ◽

10.1109/lsp.2021.3059203 ◽

2021 ◽

Vol 28 ◽

pp. 469-473

Author(s):

Amir Miraki ◽

Hamid Saeedi-Sourck ◽

Nicola Marchetti ◽

Arman Farhang

Keyword(s):

Filter Bank ◽

Spectral Domain ◽

Graph Filter Bank

Download Full-text

Distributed learning with indefinite kernels

Analysis and Applications ◽

10.1142/s021953051850032x ◽

2019 ◽

Vol 17 (06) ◽

pp. 947-975 ◽

Cited By ~ 2

Author(s):

Lei Shi

Keyword(s):

Large Scale ◽

Substantial Reduction ◽

Computation Time ◽

Distributed Learning ◽

Rates Of Convergence ◽

Regression Problem ◽

Data Set ◽

Regularization Scheme ◽

Original Algorithm ◽

Indefinite Kernel

We investigate the distributed learning with coefficient-based regularization scheme under the framework of kernel regression methods. Compared with the classical kernel ridge regression (KRR), the algorithm under consideration does not require the kernel function to be positive semi-definite and hence provides a simple paradigm for designing indefinite kernel methods. The distributed learning approach partitions a massive data set into several disjoint data subsets, and then produces a global estimator by taking an average of the local estimator on each data subset. Easy exercisable partitions and performing algorithm on each subset in parallel lead to a substantial reduction in computation time versus the standard approach of performing the original algorithm on the entire samples. We establish the first mini-max optimal rates of convergence for distributed coefficient-based regularization scheme with indefinite kernels. We thus demonstrate that compared with distributed KRR, the concerned algorithm is more flexible and effective in regression problem for large-scale data sets.

Download Full-text

Vehicular Crowdsourcing for Congestion Support in Smart Cities

Smart Cities ◽

10.3390/smartcities4020034 ◽

2021 ◽

Vol 4 (2) ◽

pp. 662-685

Author(s):

Stephan Olariu

Keyword(s):

Smart City ◽

Large Scale ◽

Smart Cities ◽

Traffic Signal Control ◽

Signal Timing ◽

Adaptive Traffic Signal Control ◽

Traffic Conditions ◽

Direct Benefits ◽

Traffic Signal Control Systems ◽

Computational Resources

Under present-day practices, the vehicles on our roadways and city streets are mere spectators that witness traffic-related events without being able to participate in the mitigation of their effect. This paper lays the theoretical foundations of a framework for harnessing the on-board computational resources in vehicles stuck in urban congestion in order to assist transportation agencies with preventing or dissipating congestion through large-scale signal re-timing. Our framework is called VACCS: Vehicular Crowdsourcing for Congestion Support in Smart Cities. What makes this framework unique is that we suggest that in such situations the vehicles have the potential to cooperate with various transportation authorities to solve problems that otherwise would either take an inordinate amount of time to solve or cannot be solved for lack for adequate municipal resources. VACCS offers direct benefits to both the driving public and the Smart City. By developing timing plans that respond to current traffic conditions, overall traffic flow will improve, carbon emissions will be reduced, and economic impacts of congestion on citizens and businesses will be lessened. It is expected that drivers will be willing to donate under-utilized on-board computing resources in their vehicles to develop improved signal timing plans in return for the direct benefits of time savings and reduced fuel consumption costs. VACCS allows the Smart City to dynamically respond to traffic conditions while simultaneously reducing investments in the computational resources that would be required for traditional adaptive traffic signal control systems.

Download Full-text

Large-scale Semantic Parsing without Question-Answer Pairs

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00190 ◽

2014 ◽

Vol 2 ◽

pp. 377-392 ◽

Cited By ~ 40

Author(s):

Siva Reddy ◽

Mirella Lapata ◽

Mark Steedman

Keyword(s):

Natural Language ◽

Large Scale ◽

Graph Matching ◽

State Of The Art ◽

The State ◽

Semantic Parsing ◽

Matching Problem ◽

Weak Supervision ◽

Benchmark Datasets

In this paper we introduce a novel semantic parsing approach to query Freebase in natural language without requiring manual annotations or question-answer pairs. Our key insight is to represent natural language via semantic graphs whose topology shares many commonalities with Freebase. Given this representation, we conceptualize semantic parsing as a graph matching problem. Our model converts sentences to semantic graphs using CCG and subsequently grounds them to Freebase guided by denotations as a form of weak supervision. Evaluation experiments on a subset of the Free917 and WebQuestions benchmark datasets show our semantic parser improves over the state of the art.

Download Full-text

FINITE IMPULSE RESPONSE DOUBLE DENSITY FILTER BANKS AND FRAMELETS

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691313500100 ◽

2013 ◽

Vol 11 (01) ◽

pp. 1350010

Author(s):

ASHOKA JAYAWARDENA ◽

PAUL KWAN

Keyword(s):

Wavelet Transform ◽

Impulse Response ◽

Filter Bank ◽

Filter Banks ◽

Finite Impulse Response ◽

Wavelet Filters ◽

Density Filter ◽

Shift Invariance ◽

Factorization Methods ◽

Invariant Properties

In this paper, we focus on the design of oversampled filter banks and the resulting framelets. The framelets obtained exhibit improved shift invariant properties over decimated wavelet transform. Shift invariance has applications in many areas, particularly denoising, coding and compression. Our contribution here is on filter bank completion. In addition, we propose novel factorization methods to design wavelet filters from given scaling filters.

Download Full-text

Incremental Community Detection on Large Complex Attributed Network

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451216 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-20

Author(s):

Zhe Chen ◽

Aixin Sun ◽

Xiaokui Xiao

Keyword(s):

Community Detection ◽

Large Scale ◽

Network Data ◽

Topological Information ◽

Community Membership ◽

Attributed Network ◽

Benchmark Datasets ◽

Modularity Maximization ◽

Large Scale Networks

Community detection on network data is a fundamental task, and has many applications in industry. Network data in industry can be very large, with incomplete and complex attributes, and more importantly, growing. This calls for a community detection technique that is able to handle both attribute and topological information on large scale networks, and also is incremental. In this article, we propose inc-AGGMMR, an incremental community detection framework that is able to effectively address the challenges that come from scalability, mixed attributes, incomplete values, and evolving of the network. Through construction of augmented graph, we map attributes into the network by introducing attribute centers and belongingness edges. The communities are then detected by modularity maximization. During this process, we adjust the weights of belongingness edges to balance the contribution between attribute and topological information to the detection of communities. The weight adjustment mechanism enables incremental updates of community membership of all vertices. We evaluate inc-AGGMMR on five benchmark datasets against eight strong baselines. We also provide a case study to incrementally detect communities on a PayPal payment network which contains users with transactions. The results demonstrate inc-AGGMMR’s effectiveness and practicability.

Download Full-text