OEP

0156

Title

Union data types

Summary

A mechanism for data types that may contain several types of data

Owner

Peter Welch

Status

Proposed

Date-Proposed

2006-04-26

Keywords

language types unions enum

Introduction

A union data type allows a variable to hold values of different types (its subtypes). Such types exist in C and Pascal and have some common ground with the notion of subtyping in object oriented languages. This note assumes the need for them is made elsewhere.

occam3 (1992) proposed a UNION data type with strong usage rules that prevent type abuse (the storing of data in one subtype and its processing as another), which is possible in C and Pascal.

This note proposes a similar mechanism - actually three! Proposal A0 follows the occam2.1 syntactic structures for CASE protocols. Proposal A1 extends this in a tempting, though slightly redundant, way. Proposal A2 cuts back on A1.

The occam3 proposal is summarised as proposal B. This follows the occam2.1 syntactic structures for RECORD types.

All proposals include the same anti-abuse usage rules.


Proposal A0

A union (or variant or CASE) data type has values with two fields: a tag, which is a programmer defined name, and data, which holds a value from a subtype identified by the tag. This property is invariant - data cannot be assigned with one tag and interpreted using another (except, of course, through explicit type conversion).

Example:

DATA TYPE FOO
  CASE
    sugar, BOOL
    salt, REAL32
    pepper, [8]BYTE
:

where the prefixing tag names are programmer chosen and must be distinct (same rule as for CASE protocols). We use comma, rather than semi-colon, since no notion of sequence is involved (unlike CASE protocols). Maybe we should use semi-colons for syntactic consistency?

Implementation note: a CASE type can be held as a record comprising a one byte tag and a data field. The data variants occupy the same memory; space must be reserved to hold the largest variant.

Union literals have the syntax of a two-field record. For example:

[sugar, TRUE]
[salt, 99.99]
[pepper, "Krakatoa"]

In cases where the compiler cannot resolve the type, the standard occam syntax for type decoration may be used:

[sugar, TRUE] (FOO)
[salt, 99.99] (FOO)
[pepper, "Krakatoa"] (FOO)

Union types are first class - their values may be freely assigned, communicated, abbreviated, passed as parameters and returned from functions, subject to the normal rules for strong typing. Union types (or variables) may also be MOBILE.

As an option, a special syntax for CASE variable L-values is proposed:

<CASE-variable> [ <tag> ]

For example:

PROC p (CHAN REAL32 in?, CHAN FOO out!)
  FOO a:
  SEQ
    a[pepper] := "Krakatoa"   -- a[sugar] := "Krakatoa" ... would not compile
    out ! a
    in ? a[salt]              -- in ? a[sugar] .. would not compile
    out ! a
:

has the same semantics as:

PROC p (CHAN REAL32 in?, CHAN FOO out!)
  FOO a:
  SEQ
    a := [pepper, "Krakatoa"]
    out ! a
    REAL32 tmp:
    SEQ
      in ? tmp
      a := [salt, tmp]
    out ! a
:

Using the L-value syntax is clear and enables the compiler to deliver the data directly. The alternative (immediately above) is less clear when inputting and implies an extra copy both for assignment and input (although an optimising compiler might be able to remove that).

Note: the tag name is always needed (somewhere) when assigning/inputting union variables. The required field cannot be inferred from the subtype of the value being assigned/input. This is because different variants of a union may have the same subtype -- for example:

DATA TYPE THREE.INT
  CASE
    hot, INT
    warm, INT
    cold, INT
:

which may have some interesting uses.

Using the L-value syntax for R-values may be a useful option. However, its use is only safe if the current value of the variable really does have the quoted tag. That would be checked, preferably, at compile time. If a run-time check must be generated by the compiler, a run-time failure means STOP.

The normal mechanism for processing data within a union type is through a CASE process. For example:

CASE a               -- 'a' could be an expression yielding a FOO type
  sugar
    ...  'a' has type BOOL in here
  salt
    ...  'a' has type REAL32 in here
  pepper
    ...  'a' has type [8]BYTE in here

The changing type of a has precedent (e.g. the way a SHARED channel becomes unshared within a CLAIM block). Implementation should be fairly simple (just reuse the same code generation needed for input from a CASE protocol). Note that data values (of the associated subtype) may be set within a tagged block (unless the subject of the CASE is a VAL) but that the tag cannot be changed -- there is no way to reference it!

This is the simplest proposal.

Proposal A1

This extends the above to allow CASE type variants to have zero, one or more data fields. For example:

DATA TYPE BAR
  CASE
    sugar, BOOL, REAL32, [8]BYTE
    salt, BYTE, BYTE
    pepper
:

Example BAR literals:

[sugar, TRUE, 99.99, "Krakatoa"]
[salt, 42, 'A']
[pepper]

which may sometimes need type decoration:

[sugar, TRUE, 99.99, "Krakatoa"] (BAR)
[salt, 42, 'A'] (BAR)
[pepper] (BAR)

If a is a BAR variable, then the assignment:

a := [sugar, TRUE, 99.99, "Krakatoa"]

could be written using the L-value syntax and occam2.1 parallel assignment:

a[sugar] := TRUE, 99.99, "Krakatoa"

with the same benefits as in proposal A.

CASE processing requires user names for the component fields of each variant subtype:

CASE a
  sugar, b, x, s
    ...  'b', 'x' and 's' are local names for the respective components
  salt, m, n
    ...  'm' and 'n' are local names for the respective components
  pepper
    ...

Within tagged blocks, the name a becomes hidden (so changing the tag value is not possible). The types of the local names are inferred from the declared fields of the tag variant.

(Added 18/06/2011) Alternatively, as with CASE protocols, we could insist that the local names for the fields must be declared:

CASE a
  BOOL b:
  REAL32 x:
  [8]BYTE s:
  sugar, b, x, s
    ...  'b', 'x' and 's' are local names for the respective components
  BYTE m, n:
  salt, m, n
    ...  'm' and 'n' are local names for the respective components
  pepper
    ...

However, the case for multiple fields is not strong. They can be managed with proposal A using a RECORD type for the data field. So, Ockham's Razor would appear to rule them out.

Note on CASE protocols versus union types

(Added 18/06/2011) On the other hand, union types might replace CASE protocols at some stage. Then, if in is a channel carrying BAR messages, a common pattern would be to input a message and, then, inspect and process it:

SEQ
  in ? a
  CASE a
    ...  as above (with field names explicitly declared)

But this pattern could be abbreviated to:

in ? CASE
  ...  as above (with field names explicitly declared)

i.e. we could retain the syntax for CASE inputs, even though CASE protocols have been dropped from the language.

The output syntax would be slightly different. For a CASE protocol output, we might have:

out ! sugar; TRUE; 99.99; "Krakatoa"

For a union type, this would become:

out ! [sugar, TRUE, 99.99, "Krakatoa"]

Proposal A2

Allowing zero fields has potential. It gives us what other languages (like Pascal) call user-defined 'enumerated' or 'scalar' types. For instance:

DATA TYPE PRIMARY.COLOUR
  CASE
    red
    green
    blue
:

with literals:

[red]
[green]
[blue]

or:

[red] (PRIMARY.COLOUR)
[green] (PRIMARY.COLOUR)
[blue] (PRIMARY.COLOUR)

So, this proposal is the same as proposal A0, but allows tag-only variants. We keep the same CASE process as before (where the CASE variable, or VAL, changes type inside the tag-blocks). In the case of a tag-only variant, the CASE variable, or VAL, no longer has any value and becomes hidden.

For example, if 'colour' is of type PRIMARY.COLOUR:

CASE colour
  red
    ...  deal with red case (the name 'colour' is hidden)
  green
    ...  deal with green case (the name 'colour' is hidden)
  blue
    ...  deal with blue case (the name 'colour' is hidden)

Proposal B (occam3)

As mentioned in the introduction, this follows the syntactic structures for RECORD types, rather than the CASE protocol structures of the A proposals. It only allows single field subtypes in each variant, so we shall only present the examples used in proposal A above.

A UNION data type has values with two fields: a tag, which is a programmer defined name, and data, which holds a value from a subtype identified by the tag. This property is invariant - data cannot be assigned with one tag and interpreted using another (except, of course, through explicit type conversion).

Example:

DATA TYPE FOO
  UNION
    BOOL sugar:
    REAL32 salt:
    [8]BYTE pepper:
:

So far, this has 1-1 correspondence with proposal A. This remains true for union literals:

(sugar :- TRUE) (FOO)
(salt :- 99.99) (FOO)
(pepper :- "Krakatoa") (FOO)

The occam3 Draft Reference Manual (section 5.3) seems to indicate that type decoration in such literals is compulsory. The syntax is a little strange, but cute!

Referencing a subtype data field is always made via an array/record field selector. But the following might run into trouble:

PROC p (CHAN REAL32 in?, CHAN FOO out!)
  FOO a:
  SEQ
    a[pepper] := "Krakatoa"
    out ! a
    in ? a[salt]
    out ! a
:

This is exactly the same as our proposal A (with the L-value option) and, arguably, has a better justification for the notation.

However, the above a[pepper] and a[salt] references may force a check (compile-time or run-time) that the a variable has the quoted tag? In the first instance, a is undefined ... so that would be an error. In the second, a has currently a pepper variant ... so that also is in error.

The occam3 manual is very terse in this section. Maybe, when a variable is being totally assigned or input like this, the check can be omitted? Otherwise, for the above we would have to write:

PROC p (CHAN REAL32 in?, CHAN FOO out!)
  FOO a:
  SEQ
    a := (pepper :- "Krakatoa") (FOO)
    out ! a
    a := (salt :- 0.0) (FOO)
    in ? a[salt]
    out ! a
:

which implies the same double copying as the proposal A example (when not using the L-value syntax option), as well as clumsiness and a redundant assignment.

The occam3 proposal also has a CASE process to guarantee safe access to subtype fields, though a new keyword is invented:

CASETAG a
  sugar
    ...  'a[sugar]' is allowed in here
  salt
    ...  'a[salt]' is allowed in here
  pepper
    ...  'a[pepper]' is allowed in here

References to a[salt] in the sugar-block would not compile (but see below). Similar points can be made about the other blocks.

Reference to the CASETAG variable (or VAL) is not forbidden inside these blcoks by the occam3 manual. Indeed, it is always needed (together with the selected tag name) to access and process the subtype data. But this allows us to switch the subtype in one of these blocks:

CASETAG a
  sugar
    a[sugar] := NOT a[sugar]
  salt
    SEQ
      ...  use/process 'a[salt]' for a while
      a := (pepper :- "Krakatoa") (FOO)
      ...  can now reference 'a[pepper]' but not 'a[salt]'
  pepper
    out.string (a[pepper], 0, screen!)

where the salt-block doesn't really seem to be playing a safe game?


Previous discussion (1)

PHW's favourite is proposal A2 ... now (18/06/2011) warming towards A1 ...

Previous discussion (2)

PHW remembers a heated discussion about scalar types and the occam3 UNION proposal in the early 1990s. Were Jon Kerridge and Geoff Barrett involved?

Under the occam3 proposal, we could have:

DATA TYPE PRIMARY.COLOUR
  UNION
    NONE red, green, blue:
:

where NONE was the "empty record" type:

DATA TYPE NONE
  RECORD
:

NONE was in the occam3 manual (but it was not put into occam2.1, whose records must have at least one field). NONE only has one value, [] (NONE), which requires no bits for storage.

For PRIMARY.COLOUR, a scalar literal would look something like:

(red :- [] (NONE)) (PRIMARY.COLOUR)

But I remember some compromise on a special (and simpler) syntax being reached??

OEP/156 (last edited 2011-06-18 12:46:37 by phw)