School of Computing

Rchive: Towards provenance tracking in R

Andrew R. Runnalls

Royal Statistical Society Conference RSS2008, Nottingham, UK., September 2008.

Abstract

There is increasing interest within information systems in keeping track of the provenance of data objects such as files and database records, i.e. in determining what source data the data object is derived from, and exactly what sequence of operations was applied to the source data to generate the data object. Within the literature on provenance-aware computing (as it is called), it is widely recognised that a pioneer paper was Auditing of Data Analyses, published in 1988 by Becker and Chambers, in which they describe the S AUDIT facility. However, no comparable facility exists in R.

CXXR (http://www.cs.kent.ac.uk/projects/cxxr) is a project by the author to refactor the R interpreter into C++, and a major motivation for this is to facilitate architectural changes in the interpreter allowing the provenance of R data objects to be tracked at various levels of granularity.

The purpose of the proposed paper is to stimulate discussion among statisticians about the sorts of provenance-tracking features they would like to see in R. It will start with an overview of the current state of play in provenance-aware computing, in particular identifying any emerging standards and technologies that developments in R need to take account of. The paper will describe some of the problems that need to be addressed and technical choices that need to be made regarding R, for example questions about serialisation and deserialisation, interfacing with external provenance-tracking tools, or about the granularity with which data should be tracked: e.g. data frame, column of data frame, individual element of a column? Arising from this, the paper will propose that is important to set up an open and flexible underlying architecture, to enable a variety of researchers to try out numerous ideas. Finally the paper will summarise progress within CXXR towards such an open architecture.

The paper is intended to be accessible to statisticians with some familiarity with R or S-plus. Some knowledge of the basic concepts of object-oriented programming will be helpful, but no detailed knowledge of programming will be assumed.

Download publication 1178 kbytes (PDF)

Bibtex Record

@misc{3089,
author = {Andrew R. Runnalls},
title = {{R}chive: Towards Provenance Tracking in {R}},
month = {September},
year = {2008},
pages = {182-196},
keywords = {determinacy analysis, Craig interpolants},
note = {},
doi = {},
url = {http://www.cs.kent.ac.uk/pubs/2008/3089},
    publication_type = {misc},
    submission_id = {27263_1299492735},
    howpublished = {Royal Statistical Society Conference RSS2008, Nottingham, UK.},
}

School of Computing, University of Kent, Canterbury, Kent, CT2 7NF

Enquiries: +44 (0)1227 824180 or contact us.

Last Updated: 21/03/2014