Rambles around computer science

Diverting trains of thought, wasting precious time

Sun, 17 Sep 2017

Project suggestions (for Part II and MPhil students)

I have several project ideas to suggest this year. Any of them I could be available to supervise, for the right student (and not too many!).

A caveat: most of these are researchy project ideas. Part II students should read the pink book carefully, and also read my advice about Part II projects from a student's perspective, including the potential dangers of getting too researchy. Nowadays as a researcher, I feel a strong incentive to suggest projects that verge on the over-challenging or overly research-led. We should discuss carefully any project idea before writing the proposal, to ensure it's within your interests and (projected) ability.

See also my rather cryptic and occasionally out-of-date list of other standing suggestions.

An observable OCaml

This is about an OCaml-to-C compiler that allows the generated code to profit from DWARF-based debugging support of C compilers, while still providing a coherent OCaml view of the debugged code. A very nice job of this project was done last year by Cheng Sun, but there is plenty of room for variation; please ask me.

A fuzz-tester over API traces

This is renamed from “DWARF-based”. I've done some more thinking about this project since the linked write-up a few years ago, so let me expand. Most fuzzers work by perturbing the data given to a particular piece of code. This project is about fuzzing library interfaces by perturbing the trace of calls. This may include fuzzing data, but is not limited to it.

A random test case generator for linkers

(This is probably more of an MPhil project than a Part II project, though it could be attempted as either.)

Tools like Csmith have found considerable use in testing C compilers. A similar tool could be useful for linkers, which continue to be buggily extended and reimplemented, e.g. in the LLVM project's LLD linker.. This project would be focused on ELF-based linking on Unix platforms.

Linking is rather simpler than compilation in some respects, but exhibits a large number of possible feature interactions. For example, how do symbol visibilities interact with weak definitions?

One challenge is to generate only link jobs that “should” work. For example, the job should usually (if linking an executable) contain non-weak undefined symbols that lack a matching defined symbol.

Another challenge is to generate code for which some test of behavioural correctness of the output program is not only possible, but also usefully exercises the linker's work in some way: we should obviously generate code that involves cross-module referencing, but it's not clear what else it should do.

A third challenge is to achieve a bias towards “interesting” intersections of features, to improve the probability of finding bugs.

Useful outcomes from a randomized test case would include crashing one or both linkers, or producing divergent behaviours between two existing linkers. Evaluation could be done in terms of coverage of the feature space, and also in bugs found (including known bugs, e.g. in older versions). A starting point would be to extract a corpus of bugs from the change history of LLD.

A testable specification of C and C++ ABI details

Compilers use documented ABI specifications, such as the System V ABI, its x86-64-specific adjunct, and the Itanium C++ ABI, to achieve compatibility at link time and run time. These fix standardised details for representing data, calling code, interfacing with the operating system and achieving link-time effects (such as uniquing of inlineable function instantiations, dynamically interposable bindings, etc..). Given a particular C or C++ source file, what linker symbols should we find in the output binary (.o file), with what attributes and satisfying what properties? What data layouts should be used for various data types? ABIs answer these questions, at least in theory.

Since ABIs are specified only informally and partially, and are not rigorously tested, divergence in these details currently causes compatibility problems between compilers, even if they are intended to implement the same ABIs. This project would involve writing a formal specification of (some of) these details, partly from ABI documentation and partly by observing the output of real compilers, then using differential testing to identify divergences and (perhaps) report bugs in real compilers. This includes such issues as how particular source language features map to particular linker-level features, across differnet code models and when built for static or dynamic linking. Some experimental work will be necessary particularly in the area of extended/non-standard C features (packed structures, bitfields, ...) and C++ (vtables, typeinfo, ...), with differential testing against gcc and LLVM toolchains.

A lightweight user-space virtualisation system using the dynamic linker

The dynamic linker (ld.so) is a critical component of commodity Unix platforms. It is used to interpret almost all binaries on the system, and contains the earliest-run user-space instructions. It is therefore a potential interface for virtualisation: a modified dynamic linker can provide unmodified application code with the illusion of an extended kernel interface and/or differing instruction semantics. As with hypervisors, this can be done with either trap-and-emulate or binary-rewriting techniques, or both. This project concerns building a usable dynamic linker which virtualises the user-space environment on a mainstream Linux or FreeBSD platform and adds one or more “added value” services or tooling features. The choice of added value is somewhat free, but a simple suggestion is a memory profiler, similar to Valgrind's massif but offering much lower run-time overhead. Another might be a filesystem extension something like flcow or fakeroot.

Either way, the project will involve implementing an extended version of some existing dynamic linker. To be useful, this should include (1) self-propagation support, so that child processes are executed under the new dynamic linker; (2) instrumentation of the necessary instructions or system calls to deliver the added value; and (3) an implementation of the added value itself. The starting point will include, should the student find it useful, a basic proof-of-concept tarball for extending the GNU C library's dynamic linker on Linux. This project will demonstrate understanding of operating systems, assembly-level programming, compilers, virtualisation, and instrumentation-based programming tools. There are lots of possible extensions, including instruction emulation, system call interposition (particularly challenging in multithreaded contexts), supporting statically linked binaries, supporting setuid binaries, adding simple instrumentation primitives (e.g. call relinking, GOT/PLT redirection), and integrating heavyweight instrumentation systems (by plumbing in a system like Valgrind, DynamoRIO or FastBT).

A testable formal specification for dynamic linking

(This one is probably more of an MPhil project.)

Dynamic linking exhibits subtle variations both across different Unix platforms (such as GNU/Linux, FreeBSD, OpenBSD etc.), and across different dynamic linkers available even within just the GNU/Linux platform (such as musl, uClibc and glibc).

This project will use a combination of documentation and experimental methods to produce a precise specification (yet “loose”, i.e. accommodating allowable variation) of what happens when a dynamic linker loads an executable and its constituent libraries.

Given an input binary, the output is a memory image in the address space containing the dynamic linker. A host of details matter to this transformation: symbol visibility, relocation semantics, program header semantics, and so on. We currently have a specification (in Lem) of some of these, which has been used as a testable specification for static linking. This project will most likely benefit by extending these, albeit in the context of a new executable specification simulating the dynamic linker rather than the (static) “batch” compile-time linker. Another useful component would be ptrace()-based tool for testing against the memory image actually produced by real dynamic linker.

Initially one would probably factor out the filesystem aspects, i.e. assume that the output of ldd is known and ignore features such as preloading and auditing that can affect this. These could then be added in later as extensions. One would probably also want to work with preexisting binaries; specifying and checking the (static) production of dynamically-linkable binaries could again be a useful optional extra, for which we have some initial work.

A machine-readable, testable system call specification usable by instrumentation-based tools

Many programming tools, including debuggers and tracers and also, especially, tools built with instrumentation frameworks such as Valgrind or DynamoRIO, require knowledge of the system call interface. Currently these use hand-maintained knowledge of syscalls which is often incomplete and/or incorrect. Pooling and automating these specifications would improve quality and completeness, reducing the costs of keeping tools up-to-date. Candidate tools/frameworks include qemu, DynamoRIO, Valgrind, gdb, strace, and our own ppcmem/rmem.

A system call's register footprint (arguments in/out, clobbers) is defined by the kernel ABI. Most system calls also have a memory footprint, which is much harder to describe. Some tooling for describing and testing Linux system call interface and a very partial specification of calls' read/written memory footprints can be supplied as a starting point.

Other useful information includes argument types (mostly done already), possible error codes and their necessary and/or sufficient preconditions, the latter perhaps initially limited to argument well-formedness but extending later to details of the existing memory map, file descriptor state, resource limits, signal dispositions, and so on. (Of course error conditions also depend on filesystem state, which is rather complex by itself, and subject to interesting race conditions; a thorough job of this is not feasible within one project, but some cases could be addressed.)

One extension of memory footprints concerns the well-definedness of the contents of memory written and/or read—for example Valgrind's Memcheck would ideally consume a specification of output (written) definedness as a function of input (read) definedness.

System calls are “downcalls”; the specification could also consider system “upcalls”, mostly meaning kernel-generated signals, the conditions on which they might occur and the properties expected to be true of the process state at those times.

This project would ideally pick a couple of tools and a couple of architectures (including AArch64) and produce a testable specification usable by both tools on both architectures as a replacement for current ad-hoc architectures, including specification of those aspects of the interface that matter to those tools. We assume the host tool/framework already has mechanisms for recognising when a system call is occurring, and what state the process memory/registers are in. We also assume the tool has a means to request system calls (typically emulated by issuing host system calls, perhaps doing a verbatim or modified version of the guest code's system call, e.g. to do guest/host address translation).

Maintaining testability of the overall spec is useful, e.g. using SystemTap or DTrace to check actual system state against what the specification says should happen. Again a proof-of-concept for memory footprints will be provided for extension.

[/research] permanent link contact

validate this page