Introduction

The aim of the CXXR project is gradually to refactor (reengineer) the interpreter of the R language, currently written for the most part in C, into C++, whilst as far as possible retaining full functionality. CXXR is being carried out independently of the main R development and maintenance effort.

Note: the CXXR documentation often uses the acronym CR to refer to the standard R interpreter, in contradistinction to CXXR.

Why do this?

It is hoped that by reorganising the code along object-oriented lines, by deploying the tighter code encapsulation that is possible in C++, and by improving the internal documentation, the project will make it easier for researchers to develop experimental versions of the R interpreter. An important subsidiary objective is to create a variant of R with built-in facilities for provenance tracking, so that for any R data object it will be possible to determine exactly which original data files it was derived from, and exactly which sequence of operations was used to produce it: if you remember the old S AUDIT facility, you will probably know how useful this can be.

Why C++?

C++, though perhaps somewhat unfashionable, is a strongly-typed language with a powerful range of facilities for object-oriented programming. In its design, constant attention has been paid to providing a smooth conversion pathway from C. Compilers, including free compilers, are readily available, and the language is well standardised. The current standard is ISO14882:2003, but the objective in CXXR is to require only that the compiler be able to cope with code conforming to the earlier standard, ISO14882:1998. And last but not least, it is a language that I have had years of experience with (though always learning more!).

Wouldn't it be better to use Java, Objective C, Concurrent Haskell, VB.NET... ?

Maybe you're right: if you have the time and the expertise, go right ahead!

What has been done so far?

  1. Memory allocation and garbage collection have now been decoupled from each other and from R-specific functionality, and encapsulated within C++ classes. Classes CellPool, MemoryBank and Allocator look after memory allocation; GCManager, GCNode, GCRoot and WeakRef look after garbage collection. (All CXXR classes are within the namespace CXXR.) Garbage collection is now based primarily on reference counting, with a (non-generational) mark-sweep algorithm as a backstop.
  2. The SEXPREC union of CR has been converted into an extensible hierarchy of classes rooted at a class RObject (which inherits from GCNode). The functionality of duplicate1() (in CR's file duplicate.c) has been reimplemented using class copy constructors and a virtual function RObject::clone(). Code associated with a particular R data type is progressively being shifted into the relevant class, and C++'s public/protected/private access controls used to defend class invariants.
  3. Any class in the RObject hierarchy can apply its own checks on how attributes are set, and override the default way in which attribute values are stored internally.
  4. All environments, including the base environment and the base namespace, are now implemented in essentially the same way, using the abstract C++ class Frame as the fundamental building block. Facilities such as those provided by the package RObjectTables can now be implemented more simply by inheriting from Frame. Hooks have been provided for monitoring the reading or writing of symbol bindings within environments.
  5. Refactoring of the evaluation logic into C++ is well advanced.
  6. The various functions associated with CR contexts (RCNTXT) have been separated and refactored using a variety of mechanisms. In particular, indirect flows of control are now much more in line with C++ idioms, in particular in relying on object destructors to restore necessary state as the stack is unwound.
  7. An increasing amount of internal functionality is being refactored using C++ generics, and made available to C++ package code via the $(R_HOME)/include/CXXR API. For example R's subscripting operations (subsetting and subassignment) are now carried out by algorithms implemented as C++ templates, so that they are applicable to generalised vectors of arbitrary element types, not just the R built-in vector types.

See the refactoring history for more information.

Can I help?

Certainly, most readily by trying out CXXR and reporting any bugs you find. Beware however that if you come across program faults, CXXR is likely to abort gracelessly without saving your work! (Control-C will also abort the interpreter at present.) Testing in a non-English locale would be particularly welcome.

If you want to contribute to coding, experience specifically of C++ would be a definite advantage: unfortunately, good C programmers tend to make bad C++ programmers (and vice versa); Java likewise. I would particularly welcome help in porting CXXR to platforms other than Linux, particularly Microsoft Windows (using mingw etc.).

My contact email is at the foot of this page.

Acknowledgements

CXXR would obviously not have been feasible without the work of the R core team in developing and maintaining R itself. The overwhelming majority of the code in CXXR is lifted directly from R (under the terms of the GNU General Public Licence). But equally important is the excellent test suite that the R team has developed, and to which I hope CXXR will in due course be able to contribute.

Particular thanks are owed to the following (in alphabetical order):