Performance

How does CXXR compare in speed with CR?

As distributed for compilation with gcc, CXXR specifies no C++ optimisation flags by default, and includes extensive consistency checking code. The resulting executable is considerably slower than CR, by a factor of as much as eight.

In particular, the default installation carries out thorough run-time type checking on the interface between code inherited from CR and 'native' CXXR code. For example, in a code fragment such as the following, where x is of type RObject* (i.e. SEXP):

double sum = 0.0;
unsigned int i;
for (i = 0; i < LENGTH(x); i++)
    sum += REAL(x)[i];

CXXR will check on each iteration that x actually points to a RealVector - a class derived from RObject - and cast it accordingly. (CR will apply an equivalent check, but only if REAL() is invoked from outside the main part of the interpreter.) In fact CXXR will make two type checks on each iteration, one for REAL() and one for LENGTH().

In newly written C++ code, the above fragment would preferably be rewritten along the following lines:

double sum = 0.0;
const RealVector& rv = *SEXP_downcast<RealVector*>(x);
for (unsigned int i = 0; i < rv.size(); ++i)
    sum += rv[i];

(or better still using std::accumulate()). The SEXP_downcast will normally report an error if x does not actually point to a RealVector, but this check is done only once, not every time around the loop.

Unfortunately it would not be practical to go through all the code inherited from CR, replacing the first idiom with the second; not only would this change be very time consuming, it would make it extremely difficult subsequently to update CXXR to reflect a new release of CR.

A cruder approach is to suppress altogether the run-time checks made by functions such as LENGTH() and REAL(). This can be accomplished by defining the preprocessor variable UNCHECKED_SEXP_DOWNCAST within src/include/CXXR/config.hpp when building CXXR.

To build CXXR for maximum speed it is recommended that definitions of UNCHECKED_SEXP_DOWNCAST and NDEBUG be added to config.hpp, and that the definition of CHECK_EXPOSURE be removed. (See here for other build options.) The C++ optimisation level should be raised to -O2: this can be accomplished by setting CXXFLAGS to -O2 in the shell from which make is invoked. (On 32-bit Intel I have also found the compiler flag --param inline-unit-growth=100 helpful.)

When built in this way, the performance of CXXR depends very much on the R script being run. In particular CXXR at present has a higher overhead than CR in setting up R function calls (whether to built-in functions or closures), and in the housekeeping related to garbage-collection. Consequently CXXR fares particularly badly running scripts that involve many R manipulations on small datasets; many of the R test scripts are of this kind, and on such scripts CXXR currently runs at down to about three-quarters the speed of CR, sometimes even less. (But this is a considerable improvement on earlier releases.) On the other hand, CXXR's more aggressive approach to garbage collection means that it has leaner memory requirements, and makes more effective use of the processor caches. Consequently it tends to come into its own working with larger datasets, and in many such cases CXXR runs somewhat faster than CR. Indeed Jens Oehlschlägel has produced an example where CXXR runs up to three times as fast as CR.

Code Organisation

What's the difference between the .h and the .hpp files in the CXXR directory?

The .hpp header files are intended to be #included only into C++ source files, and will probably give compilation errors if #included into a C source file. The .h files may be #included into both C++ and C source files, though C++ files will normally see additional content.

Why are function prototypes often duplicated in different header files?

From the CXXR point of view, many of the functions in the R API can be considered to provide a C interface to the facilities of a particular C++ class. Consequently it makes sense to gather together the prototypes that relate to a particular class into the header file for that class, and to include documentation for these C interface functions alongside the documentation for the class methods and for other class-related functions available only to C++ programs: all concentrated within the class's header file in the CXXR directory.

If that were the only consideration, Rinternals.h (which defines the R API) could simply be modified to #include all the relevant header files from the CXXR directory, and contain very little content of its own. However, that would cause problems when it came to update CXXR to reflect a new release of CR, because changes within CR to the prototypes of API functions would not automatically be picked up. Consequently, the approach taken is to keep CXXR's Rinternals.h as close as possible to its CR version, but to have it also #include the relevant header files from the CXXR directory. Then, if the prototype of an API function changes from one release of CR to the next, the change will be picked up in CXXR's Rinternals.h using svn merge, and then the compiler will immediately flag up the inconsistency between Rinternals.h and the relevant header file in the CXXR directory. Similar considerations apply to the other 'omnibus' header file, Defn.h.

Why isn't the implementation of a method of class Foo always to be found in Foo.cpp?

Obviously one reason may be that the method is inline, in which case its implementation is to be found in Foo.h or Foo.hpp.

Even for non-inline methods, however, the implementation may not be located in Foo.cpp, though in this case there will usually be a comment in Foo.cpp to say where it is to be found.

The most common reason for this is if the implementation of the method inherits a substantial amount of code from CR. In that case, it can make sense to leave that code in the same place in the same source file (subject to renaming from .c to .cpp) as its location within CR. Then there is a good chance that, when CXXR is updated to reflect a new release of CR, corrections and enhancements to that code will be automatically picked up during the code merge.

What are uncxxr.h and uncxxr.pl, and why the weird spacing?

Where a source file inherited from CR - foo.c, say - has been adapted for CXXR (and changed into a C++ file foo.cpp in the process), the script uncxxr.pl endeavours as far as possible to reverse systematic changes (e.g. the conversion of C-style casts into C++ casts, and casts that C++ requires to be explicit but C does not) to generate a quasi-C file foo.bakc. (We say 'quasi-C' file because the resulting file may not be syntactically correct C: it is intended for human eyes only.) Updating to a new release of R is facilitated by using a 3-way visual diff between the release of foo.c currently shadowed by CXXR, the new release of foo.c, and foo.bakc. This helps to highlight where the significant changes are in the new release of foo.c, and where they might conflict with changes made in CXXR. (A similar 3-way comparison using foo.cpp instead of foo.bakc throws up too much 'noise'.)

Some changes have been made to the program text of files such as foo.cpp, so that uncxxr.pl can make a better job of recovering the form of the file in CR. This includes inserting additional whitespace and redundant brackets, and the use of various macros defined in uncxxr.h.

Coding Practices

Why are some constructors of classes derived from GCNode declared explicit?

In C++, a constructor that is capable of being called with a single argument defines an implicit conversion from the type of that argument to the class being constructed. This default behaviour can be prevented by qualifying the constructor with the keyword explicit.

Perspicacious readers may have noticed that since GCNode and all classes derived from it have (or should have) private or protected destructors, it will be impossible for the compiler to create temporary objects of these classes, and hence no implicit conversions can be carried out anyway.

However, we consider it good practice to declare constructors explicit in cases where - even if an implicit conversion were feasible - it would not be desired. (This in fact covers the majority of constructors callable with a single argument.) Moreover, following this practice may lead to clearer compiler error messages, because the compiler need not even consider using implicit conversions.