OEP

179

Title

UTF-8 Source Code Encoding

Summary

Define the encoding of Occam source code

Owner

Rick Beton <rick.beton@gmail.com>

Status

Proposed

Date-Proposed

2009-11-07

Keywords

language encoding utf8

Summary

Rationale

It is necessary to define the character encoding of byte sequences if any text other than 7-bit ASCII is to be represented. Given that Occam is currently ASCII based, UTF-8 is the only sensible step forward. All ISO-5589 encodings are limited in range. UTF-16 is not backward-compatible with ASCII and all other multi-byte encodings suffer from the same.

The purpose of this change is to allow text strings in any character that can be represented in Unicode.

Limitations

This proposal enhances Occam slightly without breaking its backward compatibility in any serious way. Unfortunately, it would be necessary to add a new CHAR type, as well as BYTE, to achieve complete internationalization support. This is because it is not always possible to represent a single character in a BYTE; often two or three (or more) bytes would be needed. So the single-quote byte literal syntax does not really support anything other than 7-bit ASCII properly.

To mitigate this, it would be necessary to add a new CHAR fundamental type which would have UTF-16 semantics like the char in Java (even this has the same limitation, but the number of code points that cannot be represented in 16 bits is tiny, whereas the number of code points that cannot be represented in 8 bits is very large). This is a big change and would need further discusssion, whereas defining UTF-8 as the standard encoding is a relatively minor change.

OEP/179 (last edited 2011-08-05 23:19:29 by frmb)