Character Encodings |
|
|
Many of the external interfaces to QM support encoding to allow characters outside the 8 bit character set to be transmitted or stored as a sequence of 8 bit characters. QM has built in support for encodings such as UTF-8 but also allows users to add their own as described with the X (encoding) conversion code.
The QMBasic file operations that read or write directory files include an ENCODING clause to specify an encoding that will override any set by the VOC F-type entry for the file or when the file was opened.
An encoding name consists of two parts; a name related to the encoding style, and an optional series of modes, each represented by a single case insensitive character, that determine how the encoding will handle certain situations. The modes are separated from the name by a period (e.g. "UTF8.B"). Some mode settings affect only input conversion or output conversion. In such situations, the mode will be ignored by the conversion that does not use it.
Use of the "A" mode setting described below for UTF-8, UTF-16 and UCS-2 encoding allows an application to read data encoded using any of these methods by recognition of a leading byte order mark. This applies only to input conversion. If there is no leading byte order mark or when performing output conversion, the "A" mode is ignored, applying whatever encoding name is specified.
Null Encoding
Specifying the encoding name as "NULL" disables encoding. This is the default behaviour of QM in the absence of an encoding name and hence should only required when overriding an encoding name set earlier. For example, an application might use the NULL encoding name in an OPEN where the VOC F-type record specifies an unwanted encoding. Note that a statement such as READ REC ENCODING "" FROM FVAR does not disable an encoding set via the VOC entry or in the OPEN statement. A null string as an encoding name in this context is equivalent to not having an encoding clause at all.
The NULL encoding supports just one mode qualifier:
JS Encoding
The JS encoding is for JavaScript or JSON. Encoding data in this way replaces certain characters with escape sequences:
The JS encoding supports just one mode qualifier:
UTF-8 Encoding
UTF-8 is one of the most widely used encodings. Each character is transformed into an encoded form that is made up from one, two or three bytes.
The QM encoding name is UTF-8 (the hyphen may be omittde). The optional modes are:
On an input conversion, characters that lie outside the supported character set (16 bit on ECS systems, 8-bit on non-ECS systems) are replaced with a substitute character. Unicode defines this as character U+FFFD. On non-ECS systems, invalid characters are replaced by a question mark.
UTF-16 Encoding
UTF-16 encoding records Unicode data as byte pairs. The UTF-16 standard specifies that the default behaviour should be high byte first but QM allows either ordering. The byte order mark (codepoint U+FEFF) can be included as the first byte pair to allow programs reading the data to automatically detect the byte order.
UTF-16 cannot encode Unicode characters U+D800 to U+DFFF as these are reserved for use in "surrogate pairs" for characters outside the BMP. Data including these characters will result in an error status from the conversion but should never occur due to their reserved role.
The QM encoding name is UTF-16 (the hyphen may be omitted). The optional modes are:
UCS-2 Encoding
UCS-2 encoding records Unicode data as byte pairs in a similar manner to UTF-16 but does not reserve characters U+D800 to U+DFFF for special use. The UCS-2 standard specifies that the default behaviour should be high byte first but QM allows either ordering. The byte order mark (codepoint U+FEFF) can be included as the first byte pair to allow programs reading the data to automatically detect the byte order.
The QM encoding name is UCS-2 (the hyphen may be omitted). The optional modes are:
|