There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability.
In this article, we revisit the scalability problem of interprocedural static analysis from a “Big Data” perspective. That is, we turn sophisticated code analysis into
Big Data analytics
and leverage novel data processing techniques to solve this traditional programming language problem. We propose
Graspan
, a disk-based parallel graph system that uses an
edge-pair
centric computation model to compute
dynamic transitive closures
on very large program graphs. We develop two backends for Graspan, namely,
Graspan-C
running on CPUs and
Graspan-G
on GPUs, and present their designs in the article. Graspan-C can analyze large-scale systems code on any commodity PC, while, if GPUs are available, Graspan-G can be readily used to achieve orders of magnitude speedup by harnessing a GPU’s massive parallelism.
We have implemented
fully context-sensitive
pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases written in multiple languages such as Linux and Apache Hadoop demonstrates that their Graspan implementations are language-independent, scale to millions of lines of code, and are much simpler than their original implementations. Moreover, we show that these analyses can be used to uncover many real-world bugs in large-scale systems code.