Character Encodings

Many of the external interfaces to QM support encoding to allow characters outside the 8 bit character set to be transmitted or stored as a sequence of 8 bit characters. QM has built in support for encodings such as UTF-8 but also allows users to add their own as described with the X (encoding) conversion code.

The QMBasic file operations that read or write directory files include an ENCODING clause to specify an encoding that will override any set by the VOC F-type entry for the file or when the file was opened.

An encoding name consists of two parts; a name related to the encoding style, and an optional series of modes, each represented by a single case insensitive character, that determine how the encoding will handle certain situations. The modes are separated from the name by a period (e.g. "UTF8.B"). Some mode settings affect only input conversion or output conversion. In such situations, the mode will be ignored by the conversion that does not use it.

Use of the "A" mode setting described below for UTF-8, UTF-16 and UCS-2 encoding allows an application to read data encoded using any of these methods by recognition of a leading byte order mark. This applies only to input conversion. If there is no leading byte order mark or when performing output conversion, the "A" mode is ignored, applying whatever encoding name is specified.

Specifying the encoding name as "NULL" disables encoding. This is the default behaviour of QM in the absence of an encoding name and hence should only required when overriding an encoding name set earlier. For example, an application might use the NULL encoding name in an OPEN where the VOC F-type record specifies an unwanted encoding. Note that a statement such as

does not disable an encoding set via the VOC entry or in the OPEN statement. A null string as an encoding name in this context is equivalent to not having an encoding clause at all.

M	Exchange the five accented characters displaced from the positions occupied by the mark characters (U+00FB to U+00FF) with their alternative location (U+F8FB to U+F8FF).

The JS encoding is for JavaScript or JSON. Encoding data in this way replaces certain characters with escape sequences:

M	Exchange the five accented characters displaced from the positions occupied by the mark characters (U+00FB to U+00FF) with their alternative location (U+F8FB to U+F8FF).

UTF-8 is one of the most widely used encodings. Each character is transformed into an encoded form that is made up from one, two or three bytes.

The QM encoding name is UTF-8 (the hyphen may be omittde). The optional modes are:

A	Applies only to input conversion. Automatically selects UTF-8 or UCS-2 encoding based on a leading byte order mark character. Because UCS-2 in a 16-bit character environment is effectively a superset of UTF-16, UTF-16 encoded data will also be recognised by this mode setting.

Applies only to output conversion. Adds a leading byte order mark code (internal character U+FEFF, encoded to three bytes as hexadecimal EF BB BF) if not already present. Although byte ordering is irrelevant in UTF-8, the byte order mark is frequently inserted by software, perhaps simply as a way to recognise that the data is encoded in UTF-8 format.

C	Applies only to input conversion. Performs a "carry" from successive uses of the conversion such that an incomplete UTF-8 sequence at the end of one conversion can be continued in the next conversion. This is of use where data may arrive in incomplete fragments such as when reading from a socket.

D	Applies only to input conversion. Discards a leading byte order mark, if present.

M	Exchange the five accented characters displaced from the positions occupied by the mark characters (U+00FB to U+00FF) with their alternative location (U+F8FB to U+F8FF).

Preserves characters 251 to 255 in the input data (input conversion only). Although strictly UTF-8 data cannot contain these characters, some other multivalue products allow their use as unencoded mark characters. This mode, intended only for use when importing data into QM, copies these characters unchanged so that the dynamic array structure in maintained.

W	Applies only to input conversion. Replaces embedded byte order marks not at the start of the data with the word joiner character (U+2060).

On an input conversion, characters that lie outside the supported character set (16 bit on ECS systems, 8-bit on non-ECS systems) are replaced with a substitute character. Unicode defines this as character U+FFFD. On non-ECS systems, invalid characters are replaced by a question mark.

UTF-16 encoding records Unicode data as byte pairs. The UTF-16 standard specifies that the default behaviour should be high byte first but QM allows either ordering. The byte order mark (codepoint U+FEFF) can be included as the first byte pair to allow programs reading the data to automatically detect the byte order.

UTF-16 cannot encode Unicode characters U+D800 to U+DFFF as these are reserved for use in "surrogate pairs" for characters outside the BMP. Data including these characters will result in an error status from the conversion but should never occur due to their reserved role.

The QM encoding name is UTF-16 (the hyphen may be omitted). The optional modes are:

A	Applies only to input conversion. Automatically selects UTF-8 or USC-2 encoding based on a leading byte order mark character. Because UCS-2 in a 16-bit character environment is effectively a superset of UTF-16, UTF-16 encoded data will also be recognised by this mode setting.

B	Applies only to output conversion. Adds a leading byte order mark code if not already present.

C	Applies only to input conversion. Performs a "carry" from successive uses of the conversion such that an incomplete UTF-16 sequence at the end of one conversion can be continued in the next conversion. This is of use where data may arrive in incomplete fragments such as when reading from a socket.

L	Specifies that the encoded data is low byte first (little endian). On an input conversion, a leading byte order mark may override this setting.

M	Exchange the five accented characters displaced from the positions occupied by the mark characters (U+00FB to U+00FF) with their alternative location (U+F8FB to U+F8FF).

UCS-2 encoding records Unicode data as byte pairs in a similar manner to UTF-16 but does not reserve characters U+D800 to U+DFFF for special use. The UCS-2 standard specifies that the default behaviour should be high byte first but QM allows either ordering. The byte order mark (codepoint U+FEFF) can be included as the first byte pair to allow programs reading the data to automatically detect the byte order.

The QM encoding name is UCS-2 (the hyphen may be omitted). The optional modes are:

A	Applies only to input conversion. Automatically selects UTF-8 or USC-2 encoding based on a leading byte order mark character. Because UCS-2 in a 16-bit character environment is effectively a superset of UTF-16, UTF-16 encoded data will also be recognised by this mode setting.

B	Applies only to output conversion. Adds a leading byte order mark code if not already present.

C	Applies only to input conversion. Performs a "carry" from successive uses of the conversion such that an incomplete UTF-16 sequence at the end of one conversion can be continued in the next conversion. This is of use where data may arrive in incomplete fragments such as when reading from a socket.

L	Specifies that the encoded data is low byte first (little endian). On an input conversion, a leading byte order mark may override this setting.

M	Exchange the five accented characters displaced from the positions occupied by the mark characters (U+00FB to U+00FF) with their alternative location (U+F8FB to U+F8FF).