ISMM 2008

ISMM is a forum for research in management of dynamically allocated memory. Areas of interest include but are not limited to: explicit storage allocation and deallocation; garbage collection algorithms and implementations; compiler analyses to aid memory management; interactions with languages, operating systems, and hardware, especially the memory system; memory management related tools; and empirical studies of allocation and referencing behavior in programs that make significant use of dynamic memory. ISMM solicits full-length papers on all areas of memory management. Survey papers that present an aspect of memory management with a new coherence are also welcome.

Registration

Registration is Open -- early registration deadline is May 14, 2008
Thanks to the generosity of our sponsors, ISMM is able to offer particularly attractive registration fees for students. We hope to encourage as wide a participation as possible

Hotel Reservations -- deadline for the conference rate is May 7, 2008

PROGRAMME

Saturday

9:00-9:15 Welcome

9:15-10:15 Keynote
David Bacon (IBM TJ Watson Research Center)

10:15-10:45 Break

10:45-12:15 Session 1: Garbage Collection & Resource Management Chair: Steve Blackburn
The CLOSER: Automating Resource Management in Java, Isil Dillig, Thomas Dillig, Eran Yahav and Satish Chandra
Parallel Generational-Copying Garbage Collection with a Block-Structured Heap, Simon Marlow, Tim Harris, Roshan James and Simon Peyton Jones
Limits of Parallel Marking Garbage Collection, Fridtjof Siebert

12:15-14:00 Lunch

14:00-15:30 Session 2: Domain-Specific Memory Management I Chair: David Detlefs
Efficient Dynamic Heap Allocation of Scratch-Pad Memory, Ross McIlroy, Peter Dickman and Joe Sventek
Supporting Superpage Allocation without Additional Hardware Support, Mel Gorman and Patrick Healy
Memory Management for Self-Adjusting Computation, Matthew Hammer and Umut Acar

15:30-16:00 Break

16:00-17:00 Session 3: Domain-Specific Memory Management II Chair: David Gay
Runtime Support for Region-Based Memory Management in Mercury, Quan Phan, Gerda Janssens and Zoltan Somogyi
A Reference Counting Garbage Collection Algorithm for Cyclical Functional Programming, Baltasar Trancon y Widemann

17:10-18:00 Student Lightning Talks Chair: Witawas Srisa-an
PhD students will give brief presentations (8 minutes + 2 minutes for questions) of their work in progress. This will provide an opportunity for students to get supportive feedback, to gain experience presenting to a major audience and exposure for themselves and their work, and will encourage interaction with fellow students and the community. A prize will be awarded for the best presentation. Students wishing to make a presentation should mail a brief abstract to Witawas Srisa-an by 30 May 2008

Sunday

8:30-10:00 Session 4: Locality, Performance and Optimization Chair: Richard Jones
Path Specialization: Reducing Phased Execution Overheads, Filip Pizlo, Erez Petrank and Bjarne Steensgaard
Sampling-based Program Locality Approximation, Yutao Zhong and Wentao Chang
Memory Pooling Assisted Data Splitting (MPADS), Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, Raul Silvera and Roch Archambault

10:00-10:30 Break

10:30-11:30 Wild and crazy ideas Chair: Tony Hosking
This session is a fun and informal forum for discussion of interesting ideas in the area of memory management. Presenters are given just 4 minutes in which to present their idea and 1 minute to take questions. Prizes are awarded for the idea most crazy and the one most worthy of implementation! Please contribute your wild and crazy ideas.
Please contact the WACI chair, Tony Hosking, before the event.

11:30-12:00 ISMM: Future plans

12:00-13:30 Lunch

13:30-15:00 Session 5: Heap Measurement and Analysis I Chair: Bjarne Steensgaard
No Bit Left Behind: Limits of Heap Data Compression, Jennifer B. Sartor, Martin Hirzel and Kathryn S. McKinley
A Study of Java Object Demographics, Richard Jones and Chris Ryder
Practical Memory Leak Detector Based on Parameterized Procedural Summaries, Yungbum Jung and Kwangkeun Yi

15:00-15:30 Break

15:30-16:30 Session 6: Heap Measurement and Analysis II Chair: Kathrun McKinley
Parametric Prediction of Heap Memory Requirements, Víctor Braberman, Federico Fernández, Diego Garbervetsky and Sergio Yovine
Analysing Memory Resource Bounds for Bytecode Programs, Wei-Ngan Chin, Huu Hai Nguyen, Corneliu Popeea and Shengchao Qin

16:30-17:00 Wrap-up

Abstracts

The CLOSER: Automating Resource Management in Java, Isil Dillig, Thomas Dillig, Eran Yahav and Satish Chandra

While automatic garbage collection has relieved programmers from manual memory management in Java-like languages, managing resources remains a considerable burden and a source of performance problems. In this paper, we present a novel technique for automatic resource management based on static approximation of resource lifetimes. Our source-to-source transformation tool, Closer, automatically transforms program code to guarantee that resources are properly disposed and handles arbitrary resource usage patterns. Closer generates code for directly disposing any resource whose lifetime can be statically determined; when this is not possible, Closer inserts conditional disposal code based on interest-reference counts that identify when the resource can be safely disposed. The programmer is only required to identify which types should be treated as resources, and what method to invoke to dispose each such resource. We successfully applied Closer on a moderate-sized graphics application that requires complex reasoning for resource management.
[pdf]
Parallel Generational-Copying Garbage Collection with a Block-Structured Heap, Simon Marlow, Tim Harris, Roshan James and Simon Peyton Jones

We present a parallel generational-copying garbage collector implemented for the Glasgow Haskell Compiler. We use a block-structured memory allocator, which provides a natural granularity for dividing the work of GC between many threads, leading to a simple yet effective method for parallelising copying GC. The results are encouraging: we demonstrate wall-clock speedups of on average a factor of 2 in GC time on a commodity 4-core machine with no programmer intervention, compared to our best sequential GC.
[ppt]
Limits of Parallel Marking Garbage Collection, Fridtjof Siebert

More and more, parallel multicore systems will be used even in low-end devices such as embedded controllers that require realtime guarantees. When garbage collection is used in these systems, parallel or concurrent garbage collection brings important performance advantages. In the context of realtime systems, it has to be shown that a parallel garbage collector implementation not only performs well in most cases, but guarantees on its performance in the worst case are required.
This paper analyses the difficulties a parallel mark-and-sweep garbage collector faces during a parallel mark phase. The performance of such a garbage collector degrades when only some of the available processors can perform scanning work in the mark phase. Whenever the grey set contains fewer elements than the number of available processors, some processors may be stalled waiting for new objects to be added to the grey set. This paper gives an upper bound for the number of stalls that may occur as a function of simple properties of the memory graph.
This upper bound is then determined for the Java applications that are part of the SPECjvm98 benchmark suite and the theoretical worst-case scalability of a parallel mark phase is analysed. The presented approach is then applied to a Java virtual machine that has uniform mark steps, which at first results in poor worst-case scalability. A small change in the implementation is then proposed and analysed to achieve good scalability even in the worst case.
[pdf]
Efficient Dynamic Heap Allocation of Scratch-Pad Memory, Ross McIlroy, Peter Dickman and Joe Sventek

An increasing number of processor architectures support scratch-pad memory - software managed on-chip memory. Scratch-pad memory provides low latency data storage, like on-chip caches, but under explicit software control. The simple design and predictable nature of scratchpad memories has seen them incorporated into a number of embedded and real-time system processors. They are also employed by multi-core architectures to isolate processor core local data and act as low latency inter-core shared memory.
Managing scratch-pad memory by hand is time consuming, error prone and potentially wasteful; tools that automatically manage this memory are essential for its use by general purpose software. While there has been promising work in compile time allocation of scratch-pad memory, there will always be applications which require run-time allocation. Modern dynamic memory management techniques are too heavy-weight for scratch-pad management.
This paper presents the Scratch-Pad Memory Allocator, a light-weight memory management algorithm, specifically designed to manage small on-chip memories. This algorithm uses a variety of techniques to reduce its memory footprint while still remaining effective, including: representing memory both as fixed-sized blocks and variable-sized regions within these blocks; coding of memory state in bitmap structures; and exploiting the layout of adjacent regions to dispense with boundary tags for split and coalesce operations. We compare the performance of this allocator against Doug Lea's malloc implementation for the management of core-local and inter-core shared scratchpad memories under real world memory traces. This algorithm manages small memories efficiently and scales well under load when multiple competing cores access shared memory.
[ppt]
Supporting Superpage Allocation without Additional Hardware Support, Mel Gorman and Patrick Healy

Today, many modern processors support more than one page size. The larger pages, called superpages, have been identified as one means of reducing the time spent servicing translation lookaside buffer (TLB) misses in the early 1990s by increasing TLB reach. Widespread usage of superpages has been limited by the requirement that superpages consist of physically contiguous and naturally-aligned small pages. This makes external fragmentation a serious problem for an operating system, one that is almost non-existent when processes use only one page size. Hardware solutions to mitigate this limitation such as sub-blocking, shadow page-tables and a variety of hybrid solutions have not seen wide-spread adoption. This has curtailed automatic superpage support as it is known that superpage availability will decrease during the system's lifetime as external fragmentation grows.
This paper presents a placement policy for an operating system's physical page allocator to mitigate external fragmentation problems by grouping pages based on the system's ability to relocate the data. Secondly, the necessary changes to the page reclamation algorithm for it to be contiguity-aware are described while minimising impact to the reclamation algorithms' normal decisions. The performance impact on different machine types is illustrated and it is shown that the superpage allocation success rate is improved. These mechanisms are complementary to any of the hardware solutions proposed in the past.
Memory Management for Self-Adjusting Computation, Matthew Hammer and Umut Acar

The cost of reclaiming space with traversal-based garbage collection is inversely proportional to the amount of free memory, i.e., O(1/(1-f)), where f is the fraction of memory that is live. Consequently, the cost of garbage collection can be very high when the size of the live data remains large relative to the available free space. Intuitively, this is because allocating a small amount of memory space will require the garbage collector to traverse a significant fraction of the memory only to discover little garbage. This is unfortunate because in some application domains the size of the memory-resident data can be generally high. This can cause high GC overheads, especially when generational assumptions do not hold. One such application domain is self-adjusting computation, where computations use memory-resident execution traces in order to respond to changes to their state (e.g., inputs) efficiently.
This paper proposes memory-management techniques for self-adjusting computation that remain efficient even when the size of the live data is large. More precisely, the proposed techniques guarantee O(1) amortized cost for each reclaimed memory object. We propose a set of primitives for self-adjusting computation that support the proposed memory management techniques. The primitives provide an operation for allocating memory; we reclaim unused memory automatically.
We implement a library for supporting the primitives in the C language and perform an experimental evaluation. Our experiments show that the approach can be implemented with reasonably small constant-factor overheads and that the programs written using the library behave optimally. Compared to previous implementations, we measure up to an order of magnitude improvement in performance and up to a 75% reduction in space usage.
[pdf]
Runtime Support for Region-Based Memory Management in Mercury, Quan Phan, Gerda Janssens and Zoltan Somogyi

Applying region-based memory management (RBMM) to logic programming languages poses a special challenge: backtracking can require regions removed during forward execution to be ``resurrected'', and any memory allocated during a computation that has been backtracked over must be recovered promptly, without waiting for the regions involved to come to the end of their life.
In this paper, we describe how we implemented runtime support for RBMM in the logic programming language Mercury, whose specialized implementation of the language constructs involved in backtracking required equally specialized support.
Our benchmark Mercury programs run about 25% faster on average with RBMM than with the usual Boehm garbage collector, and for some programs, RBMM achieves optimal memory consumption.
[odp] [pdf]
A Reference Counting Garbage Collection Algorithm for Cyclical Functional Programming, Baltasar Trancon y Widemann

Reference-counting garbage collection is known to have problems with the collection of cyclically connected data. There are two historically significant styles of cycle-aware algorithms: The style of Brownbridge that maintains a subset of marked edges and the invariant that every cycle contains at least one marked edge, and the style of Martinez-Lins-Wachenchauzer (MLW) that involves local mark-and-scan procedures to detect cycles. The former is known to be difficult to design and implement correctly, and the latter to have pathological efficiency for a number of very typical situations. We present a novel algorithm that combines both approaches to obtain reasonably efficient local mark-and-scan phases with a marking invariant that is rather cheap to maintain. We demonstrate that the assumptions of this algorithm about mutator activity patterns make it well-suited, but not limited, to a functional programming technique for cyclic data. We evaluate the approach in comparison with simple and more sophisticated MLW algorithms using a simple benchmark based on that functional paradigm.
[pdf]
Path Specialization: Reducing Phased Execution Overheads, Filip Pizlo, Erez Petrank and Bjarne Steensgaard

As garbage collected languages become widely used, the quest for reducing collection overheads becomes essential. In this paper, we propose a compiler optimization called path specialization that shrinks the cost of memory barriers for a wide variety of garbage collectors including concurrent, incremental, and real-time collectors. Path specialization provides a non-trivial decrease in write-barrier overheads and a drastic reduction of read-barrier overheads. It is effective when used with collectors that go through various phases each employing a different barrier behavior, and is most effective for collectors that have an idle phase, in which no barrier activity is required. We have implemented path specialization in the Bartok compiler and runtime for C# and tested it with state-of-the-art concurrent and real-time collectors, demonstrating its efficacy.
[pdf] [ppt]
Sampling-based Program Locality Approximation, Yutao Zhong and Wentao Chang

Reuse signature, or reuse distance pattern, is an accurate model for program memory accessing behaviors. It has been studied and shown to be effective in program analysis and optimizations by many recent works. However, the high overhead associated with reuse distance measurement restricts the scope of its application. This paper explores applying sampling in reuse signature collection to reduce the time overhead. We compare different sampling strategies and show that an enhanced systematic sampling with a uniform coverage of all distance ranges can be used to extrapolate the reuse distance distribution. Based on that analysis, we present a novel sampling method with a measurement accuracy of more than 99%. Our average speedup of reuse signature collection is 7.5 while the best improvement observed is 34. This is the first attempt to utilize sampling in measuring reuse signatures. Experiments with varied programs and instrumentation tools show that sampling has great potential in promoting the practical uses of reuse signatures and enabling more optimization opportunities.
[ppt]
Memory Pooling Assisted Data Splitting (MPADS), Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, Raul Silvera and Roch Archambault

This paper describes Memory-Pooling-Assisted Data Splitting (MPADS), a framework that combines data structure splitting with memory pooling --- Although it MPADS may call to mind memory padding, a distintion of this framework is that is does not insert padding. MPADS relies on pointer analysis to ensure that splitting is safe and applicable to type-unsafe language. MPADS makes no assumption about type safety. The analysis can identify cases in which the transformation could lead to incorrect code and thus MPADS abandons those cases.
To make data structure splitting efficient in a commercial compiler, MPADS is designed with great attention to reduce the number of instructions required to access the data after the data-structure splitting. Moreover the implementation of MPADS reveals that architecture details should be considered carefully when re-arranging data allocation. For instance one of the most significant gains from the introduction of data-structure splitting in code targetting the IBM POWER architecture is a dramatic decrease in the amount of data prefetched by the hardware prefetch engine without a noticeable decrease in the cache utilization. Triggering fewer hardware prefetch streams frees memory bandwidth and cache space. Fewer prefetching streams also reduce the interference between the data accessed by multiple cores in modern multicore processors.
[ppt]
No Bit Left Behind: Limits of Heap Data Compression, Jennifer B. Sartor, Martin Hirzel and Kathryn S. McKinley

On one hand, the high cost of memory continues to drive demand for memory efficiency on embedded and general purpose computers. On the other hand, programmers are increasingly turning to managed languages like Java for their functionality, programmability, and reliability. Managed languages, however, are not known for their memory efficiency, creating a tension between productivity and performance. This paper examines the sources and types of memory inefficiencies in a set of Java benchmarks. Although prior work has proposed specific heap data compression techniques, they are typically restricted to one model of inefficiency. This paper generalizes and quantitatively compares previously proposed memory-saving approaches and idealized heap compaction. It evaluates a variety of models based on strict and deep object equality, field value equality, removing bytes that are zero, and compressing fields and arrays with a limited number and range of values. The results show that substantial memory reductions are possible in the Java heap. For example, removing bytes that are zero from arrays is particularly effective, reducing the application's memory footprint by 44% on average. We are the first to combine multiple savings models on the heap, which very effectively reduces the application by up to 86%, on average 62%. These results demonstrate that future work should be able to combine a high productivity programming language with memory efficiency.
[ppt]
A Study of Java Object Demographics, Richard Jones and Chris Ryder

Researchers have long strived to exploit program behaviour in order to improve garbage collection efficiency. For example, by using a simple heuristic, generational GC manages short-lived objects well, although longer-lived objects will still be promoted to an older generation and may be processed repeatedly thereafter. In this paper, we provide a detailed study of Java object lifetimes which reveals a richer landscape than the generational view offers. Allocation site has been claimed to be a good predictor for object lifetime, but we show that object lifetime can be categorised more precisely than 'short-lived/long-lived/immortal'. We show that (i) sites allocate objects with lifetimes in only a small number of narrow ranges, and (ii) sites cluster strongly with respect to the lifetime distributions of the objects they allocate. Furthermore, (iii) these clusterings are robust against the size of the input given to the program and (iv) are likely to allocate objects that are live only in particular phases of the program's execution. Finally, we show that, in contrast to previous studies, (v) allocation site alone is not always sufficient as a predictor of object lifetime distribution but one further level of stack context suffices.
[ppt]
Practical Memory Leak Detector Based on Parameterized Procedural Summaries, Yungbum Jung and Kwangkeun Yi

We present a static analyzer that detects memory leaks in C programs. It achieves relatively high accuracy at a relatively low cost on SPEC2000 benchmarks and several open-source software packages, demonstrating its practicality and competitive edge against other reported analyzers: for a set of benchmarks totaling 1,777 KLOCs, it found 332 bugs with 47 additional false positives (a 12.4% false-positive ratio), and the average analysis speed was 720 LOC/sec.
We separately analyze each procedure's memory behavior into a summary that is used in analyzing its call sites. Each procedural summary is parameterized by the procedure's call context so that it can be instantiated at different call sites. What information to capture in each procedural summary has been carefully tuned so that the summary should not lose any common memory-leak-related behaviors in real-world C programs.
Because each procedure is summarized by conventional fixpoint iteration over the abstract semantics (a la abstract interpretation), the analyzer naturally handles arbitrary call cycles from direct or indirect recursive calls.
[pdf]
Parametric Prediction of Heap Memory Requirements, Víctor Braberman, Federico Fernández, Diego Garbervetsky and Sergio Yovine

This work presents a technique to compute symbolic polynomial approximations of the amount of dynamic memory required to safely execute a method without running out of memory, for Java-like imperative programs.
We consider object allocations and deallocations made by the method and the methods it transitively calls. More precisely, given an initial configuration of the stack and the heap, the peak memory consumption is the maximum space occupied by newly created objects in all states along a run from it.
We over-approximate the peak memory consumption using a scoped-memory management where objects are organized in regions associated with the lifetime of methods. We model the problem of computing the maximum memory occupied by any region configuration as a parametric polynomial optimization problem over a polyhedral domain and resort to Bernstein basis to solve it. We apply the developed tool to several benchmarks.
[pdf] [pptx]
Analysing Memory Resource Bounds for Bytecode Programs, Wei-Ngan Chin, Huu Hai Nguyen, Corneliu Popeea and Shengchao Qin

Embedded systems are becoming more widely used but these systems are often resource constrained. Programming models for these systems should take into formal consideration resources such as stack and heap. In this paper, we show how memory resource bounds can be inferred for assembly-level programs. Our inference process captures the memory needs of each method in terms of the symbolic values of its parameters. For better precision, we infer path-sensitive information through a novel guarded expression format. Our current proposal relies on a Presburger solver to capture memory requirements symbolically, and to perform fixpoint analysis for loops and recursion. Apart from safety in memory adequacy, our proposal can provide estimate on memory costs for embedded devices and improve performance via fewer runtime checks against memory bound.
[ppt]

Wild and Crazy presentations

This session was a fun and informal forum for discussion of interesting ideas in the area of memory management. Presenters were given just 4 minutes in which to present their idea and 1 minute to take questions.

Efficient Bump Pointer Allocation?, Dave Detlefs (1) [pptx]
Heap Fragmentation, Dave Detlefs (2) [pptx]
Seeing Is Doing, Richard Jones [ppt]
Fragment the Heap!, Fridtjof Siebert [pdf]
Answering the Hardware Architects, Bjarne Steensgaard [pdf]
Virtual compaction, Emery Berger [ppt]

Sponsors ISMM is grateful for the kind support of its sponsors
Sponsored by Microsoft Research Sponsored by Intel

The 2008 International Symposium on Memory Management

7-8 June, Tucson, Arizona

General Chair

Programme Chair

Programme Committee

Steering Committee

Review Committee