This page describes the phases so far completed within the CXXR project to refactor the R engine into C++. Each phase is placed within the Subversion tags directory, with a name of the form 0.00-2.5.0, where 0.00 indicates the phase, and 2.5.0 indicates the R release to which that phase is intended to correspond.

Phase 0: 0.00-2.5.0

In this phase all .cpp files within src/main are renamed to .cpp, with the following exceptions:

(Subsequently, RNG.c was also reverted to C, to respect Knuth's copyright statement.)

The result of this phase does not build correctly; however, it is useful as a baseline for seeing the subsequent changes.

Phase 1: 0.01-2.5.0

Make such changes to the result of Phase 0 to enable the .cpp files to compile without warning using -Wall with gcc-4.1.3, retaining C linkage conventions for everything defined in .h files. Ensure that the whole of R will build correctly and pass make check.

A desirable side effect of enforcing C linkage was that the linkage editor picked up several instances where the source file implementing a function failed to #include the appropriate header file, and consequently generated a function with C++ linkage: see below.

This needed to address the following issues:

Phase 2: 0.02-2.5.0

In a subsequent phases (possibly starting in Phase 3) it is our objective to replace the SEXPREC union by a hierarchy of C++ classes. This phase prepares for that by reorganising the material in the header files in src/include. This involves creating a new subdirectory src/include/CXXR, and within that creating a new header file RObject.h (ultimately to include a base class RObject for the new hierarchy), and further header files RClosure.h, REnvironment.h, RInternalFunction.h, RPairList.h, RPromise.h, RSymbol.h and RVector.h, corresponding respectively to closxp_struct, envsxp_struct, primsxp_struct, listsxp_struct, promsxp_struct, symsxp_struct and vecsxp_struct, which will eventually be derived classes. The material in these new headers comes predominantly from Rinternals.h, but to some extent (in the case of RInternalFunction.h) from Defn.h. All of the new header files, with the exception of RInternalFunction.h, are also installed in $(rincludedir)/CXXR.

Function prototypes moved into the new header files are documented using doxygen. Where is was clearly consistent with the semantics, some of the argument types of the functions were changed, either by adding const, or by converting int into Rboolean (however, see the issues below regarding the latter).

The following are implementational details and issues that arose:

Phase 3: 0.03-2.5.0

The primary objective of this phase was to redefine R_NilValue as a null (i.e. zero) pointer of type SEXP. R_NilValue is widely used within CR as a stub, i.e. to signify that something that might be present is absent, in much the same way that a null pointer is used within C or C++. However, in CR it is actually implemented in effect as an element of a pairlist (i.e. struct listsxp), whose CAR, CDR, TAG and attributes all point to itself. This would cause difficulties in CXXR when we reimplement the SEXPREC union as a type hierarchy, because pairlist elements will need to be of a specific type within the hierarchy. If R_NilValue were given this type, it would preclude its use as a general-purpose stub. But zero is a possible value for a pointer of any type, so if we equate R_NilValue to zero this will sidestep the problem.

Another disadvantage of the CR definition of R_NilValue is that it needlessly introduces a cyclic data structure.

The following are implementational details and issues that arose in carrying out this change:

A secondary objective of this phase was to get rid of C-style casts within the C++ code, wherever the appropriate remedy was reasonably obvious and straightforward. The following kinds of C-style casts were left in place pending further work:

Addendum 2007/08/06: although make check works with this release, make check-devel doesn't.

Phase 4: 0.04-2.5.1

The primary objective of this phase was to update the program to parallel release 2.5.1 of R. This proved to be straightforward, except that it was necessary to install a later version of svn_load_dirs.pl to cope with filenames containing @ signs. (However, I was surprised to discover that svn merge doesn't track renames.)

Other changes were as follows:

Phase 5: 0.05-2.5.1

The aim of this phase was to create a branch entitled const, to explore to what extent the R code is amenable to 'constifying': i.e. converting pointers and C++ references wherever possible to const pointers. Two preliminary steps, carried out in the trunk, were as follows:

Having established the const branch, constification was set in train by the brute force measure of redefining SEXP to mean const RObject* rather than simply RObject*; a new typedef mapped vSEXP onto plain RObject*. In the same spirit 'v' variants of many of the accessor functions were introduced: for example now CAR takes a SEXP argument and returns a SEXP, while vCAR takes and returns a vSEXP. (Since these accessor functions are required to be callable from C, we can't simply overload CAR.)

I then attempted to recompile various files, inserting 'v's wherever the compiler demanded it. It quickly became apparent that these 'v's were highly contagious: for example, both NA_STRING and R_EmptyEnv had to be declared as vSEXPs rather than SEXPs. This led me to the conclusion that it was premature to attempt constification until I understand the evaluation process better.

At the time of tagging this release, the following files compile without warnings in the const branch: memory.cpp, envir.cpp and names.cpp. eval.cpp gives one compilation error, when do_function attempts a non-const operation on its op argument: fixing this would mean changing the signature of all the do_ functions.

Phase 6: 0.06-2.5.1

In CR, each SEXPREC has a node class in the range 0 to 7. Nodes of non-vector SEXPTYPE (i.e. not of types CHARSXP, LGLSXP, INTSXP, REALSXP, CPLXSXP, STRSXP, VECSXP, EXPRSXP, WEAKREFSXP or RAWSXP) are all in class 0, and are 28 bytes long. Class 7 is used for vector nodes whose vector data amount to more than 128 bytes; the remaining classes are used for smaller vectors, classified according to their size. Nodes of class 7 are allocated directly using malloc; nodes of the remaining classes are allocated from 'pages' about 2 kB in size, with each node class having its own pages. In CXXR it is intended to replace SEXPRECs with an extensible class hierarchy (rooted at RObject), so it will not be feasible to put a tight upper bound on the size of non-vector nodes.

Another feature of CR is that in vector nodes, a single block of memory contains the data of the vector preceded by a SEXPREC and information about the length of the header. This is quite incompatible with the design philosophy of C++, which is that the size of an object must be deducible from its (C++) type: in particular ::operator delete relies on this.

The purpose of Phase 6 was to circumvent these problems, and at the same time to endeavour to decouple the code for allocating memory from the code managing garbage collection. This comprised the following changes:

Phase 7: 0.07-2.5.1

The purpose of this phase was to encapsulate all the garbage-collection logic within C++ classes. Five such classes were introduced, namely GCManager, GCNode, GCEdge, GCRoot and WeakRef, as now described.

Phase 8: 0.08-2.5.1

Phase 9: 0.09-2.6.1

The primary objective of this phase was to update the program to parallel release 2.6.1 of R.

Other changes were as follows:

Phase 10: 0.10-2.6.1

The primary objective of this phase was to reimplement all vector data types as C++ classes derived (directly or indirectly) from RObject, rather than using vecsxp_struct within the RObject::u union. vecsxp_struct has not yet been eliminated entirely, however, because of some straggling uses of truelength.

Other changes were as follows:

Phase 11: 0.11-2.6.2

The primary objective of this phase was to update the program to parallel release 2.6.2 of R. Errors and warnings given by make check-devel were also corrected.

Phase 12: 0.12-2.6.2

The primary objective of this phase was to eliminate the RObject::u union completely, replacing its remaining elements with classes derived from RObject. This entailed the creation of the following classes: BuiltInFunction, ByteCode, Closure, DottedArgs, Environment, Expression, ExternalPointer, PairList, Promise, SpecialSymbol and Symbol. Several loose ends remain to be tied up, however; in particular, the remaining data members of RObject ought all to be private.

Other changes were as follows:

Phase 13: 0.13-2.6.2

This phase was an attempt - less successful than was hoped! - to close the gap in speed between CR and CXXR. Principal changes were:

Phase 14: 0.14-2.7.1

The objective of this phase was to update CXXR to parallel release 2.7.1 of R. However, other changes are:

Phase 15: 0.15-2.7.1

The objective of this phase was to tidy up the class hierarchy rooted at RObject, and in particular to give RObject itself a more distinctive class identity, i.e. for it to be less of a ragbag for things that hadn't yet been accommodated elsewhere. Principal changes were:

Phase 16: 0.16-2.7.2

The objective of this phase was to update CXXR to parallel release 2.7.2 of R.

Phase 17: 0.17-2.7.2

The primary purpose of this phase was to reimplement the functionality of duplicate1() in duplicate.cpp using class copy constructors and a virtual function RObject::clone(), reimplemented as necessary in derived classes. The following changes were associated with this:

Phase 18: 0.18-2.8.1

The objective of this phase was to update CXXR to parallel release 2.8.1 of R.

Phase 19: 0.19-2.8.1

The primary purpose of this phase was to refactor environments, to pave the way for introducing provenance-tracking features into R. The following changes were associated with this:

Phase 20: 0.20-2.8.1

The purpose of this phase was extensively to reengineer garbage collection. This was to pave the way to experimentation with reference-counting approaches to garbage collection; however, release 0.20-2.8.1 itself still uses generational mark-sweep. A major change has been in the way of implementing 'infant immunity', whereby nodes that are under construction are not liable to garbage collection; the following is a summary of the way in which this has evolved. The phrase 'infant nodes' means nodes that are either under construction, or whose construction is complete but which have not yet been exposed to garbage collection by calling GCNode::expose().

Other changes are as follows:

Phase 21: 0.21-2.8.1

This phase changes the approach used for garbage collection. Previous phases used a generational mark-sweep collector, like CR itself. As of Phase 21, the principal method of garbage collection is reference counting. The principal motivation for this is to make better use of the processor caches: with reference counting, the memory occupied by objects that become garbage is quickly recycled into productive use, very likely while this memory is still mapped in cache.

To implement reference counting, each GCNode object contains a one-byte reference count, which is automatically adjusted by the GCEdge<T>, GCRoot<T> and GCStackRoot<T> smart pointers, and by the traditional CR PROTECT/UNPROTECT mechanism. (If a node's reference count ever reaches 255, it sticks at that value, and that node can only be garbage-collected by the mark-sweep mechanism.) When a GCNode's reference count falls to zero, it is declared 'moribund'. When GCNode::operator new is called upon to allocate memory for a new GCNode object, it first looks through class GCNode's internal list of moribund nodes. Any nodes on the list which still have a reference count of zero are deleted; nodes whose reference count has risen back above zero - accounting for about one in four of the nodes on the moribund list - are returned to the 'live' list.

To cope with cycles in the node graph (i.e. the directed graph whose nodes are GCNodes and whose edges are GCEdges), this reference counting scheme is backed up by a simple (i.e. non-generational) mark-sweep scheme. However, this runs much more rarely than CR's garbage collections, and uses a simpler logic to manipulate the threshold at which mark-sweep collection takes place. Not having node generations means that there is no longer a need to implement the 'write barrier'; this in turn means that the GCEdge<T> templated class can have a C++ assignment operator defined, which enables it to be more freely used in connection with the container types in the C++ standard library.

Weak reference (WeakRef) objects need special handling during garbage collection, and consequently each WeakRef object now includes a pointer to itself, to stop it being deleted by the reference counting mechanism.

Phase 22: 0.22-2.9.1

The purpose of this phase was to update CXXR to parallel release 2.9.1 of CR. (Unfortunately, it was overtaken by release 2.9.2 of CR.)

Phase 23: 0.23-2.9.2

The purpose of this phase was to update CXXR to parallel release 2.9.2 of CR. This proved straightforward.

Phase 24: 0.24-2.9.2

This phase represented the first stage of refactoring the interpreter's evaluation logic into C++, and included the following principal changes:

Phase 25: 0.25-2.9.2

This phase continued with refactoring the interpreter's evaluation logic into C++, and comprised the following principal changes:

Phase 26: 0.26-2.10.1

The purpose of this phase was to update CXXR to parallel release 2.10.1 of CR.

Phase 27: 0.27-2.10.1

This phase comprised the following principal changes:

Phase 28: 0.28-2.10.1

This phase was concerned with refactoring contexts (CR's RCNTXT), and involved teasing apart the numerous distinct functions that this struct plays in CR:

Other changes in this phase were:

Phase 29: 0.29-2.10.1

The primary purpose of this release was to define the baseline for the results on add-on packages reported at useR! 2010. The changes are mainly bugfixes, but with the following more substantive changes:

Phase 30: 0.30-2.11.1

The primary purpose of this phase was to update CXXR to parallel release 2.11.1 of CR. This included the following corrections to significant preexisting bugs:

Phase 31: 0.31-2.11.1

This phase included extensive changes:

Phase 32: 0.32-2.11.1

This phase consisted of changes to improve the speed of CXXR. The principal changes were as follows:

Phase 33: 0.33-2.12.1

The purpose of this phase was to update CXXR to parallel release 2.12.1 of CR. In the course of this, the use of UncachedString objects was largely replaced by the use of CachedString objects, a change that has lagged behind the corresponding change in CR.

Phase 34: 0.34-2.12.1

This phase was marked by a wider use of C++ generic programming techniques, both to simplify the internal code, and to make this code available in a flexible form to add-on packages. In particular:

Phase 35: 0.35-2.12.1

This release is intended to clear the decks prior to an upgrade to R 2.13.1, and includes only small changes in the development trunk:

(The main activity in the period leading up to this release has been the introduction of the lazycopy branch, which is exploring methods for managing object duplication automatically via the RHandle smart pointer, and eliminating the need for NAMED() and SET_NAMED(). Verdict so far is mixed: it basically works, but has performance issues, and breaks somewhat more existing code than I'd like. A plus point is that it better achieves C++ 'const correctness' than the development trunk.)

Phase 36: 0.36-2.13.1

The purpose of this phase was to upgrade CXXR to parallel release 2.13.1 of CR. This includes making bytecode interpretation available in CXXR for the first time, though not yet in the 'threaded code' implementation (which is the CR default when using gcc).

The code also now builds correctly when configured with --enable-memory-profiling. (Thanks to Doug Bates for pointing out that previously it didn't.) However, the functionality of tracemem and kindred R functions (untracemem and retracemem) is currently unavailable in CXXR even when it is configured with memory profiling enabled.

Phase 37: 0.37-2.13.1

This release contains only minor changes:

Phase 38: 0.38-2.13.1

This release clears the decks prior to an upgrade of CXXR to R 2.14.1.

The principal change regards garbage collection. The reference-counted approach to garbage collection primarily used by CXXR can bring speed advantages when dealing with large datasets, but the housekeeping involved in diddling reference counts up and down as required is surprisingly time-consuming, and this is a major contributor to the speed penalty of CXXR compared with CR when dealing with small datasets, a penalty that has grown greater with the advent of the bytecode interpreter. This release incorporates the following changes:

A side effect of the above changes is that when AGGRESSIVE_GC is defined, CXXR's garbage collection is even more aggressive than it was in previous releases, and this has revealed a number of GC-protection gaps (e.g. in code inherited from CR) that had previously 'slipped through the net'.

Another significant change is that the CXXR distribution no longer holds the 'Recommended' packages in compressed tar form (.tar.gz), but instead contains the untarred package directories themselves. This will make it easier to carry forward any CXXR-specific tweaks to these packages from one R release to the next. (Such tweaks are rare, and often due to a latent GC-protection bug in the CR package code.)

Phase 39: 0.39-2.14.1

The purpose of this phase was to upgrade CXXR to parallel release 2.14.1 of CR. This entailed substantial changes to the bytecode interpreter, both to track changes in CR and to correct errors in the previous CXXR implementation. In the course of preparing this release, numerous GC-protection gaps were discovered in the CR code (including the Recommended packages) and corrected within CXXR.

CXXR's bytecode interpreter does not yet implement the cache of symbol bindings used in CR.

Phase 40: 0.40-2.15.1

The purpose of this phase was to upgrade CXXR to parallel release 2.15.1 of CR. In the course of this upgrade, the class UncachedString was abolished, and the functionality of class CachedString was merged into its parent class CXXR::String.

Phase 41: 0.41-2.15.1

In this phase, the experimental provenance-tracking facilities and the experimental XML-based serialization facilities, both formerly in the provenance branch, have been merged into the development trunk.  Beware that documentation and in particular the testing of these features is still not up to standard, and there are known gaps in the serialization capability.  Moreover the interfaces of both are likely to change.  To enable provenance-tracking it is necessary to define PROVENANCE_TRACKING within src/include/CXXR/config.hpp before building the program, as the documentation of this file explains.

Phase 42: 0.42-2.15.1

This phase saw various extensions and corrections to the XML-based serialization facilities, including the introduction of automated tests, but beware that these are still subject to change.  The release incorporates work by Chris Silles on adapting the autoconf-based configuration facilities to CXXR: this addresses particularly locating a suitable installation of Boost, and enabling or disabling provenance tracking.  Previously there were some difficulties in building CXXR otherwise than in its source directory: these have now, it is hoped, been removed.