std.encoding

Classes and functions for handling and transcoding between various encodings.

For cases where the encoding is known at compile-time, functions are provided for arbitrary encoding and decoding of characters, arbitrary transcoding between strings of different type, as well as validation and sanitization.

Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1251 and WINDOWS-1252.

For cases where the encoding is not known at compile-time, but is known at run-time, the abstract class EncodingScheme and its subclasses is provided. To construct a run-time encoder/decoder, one does e.g.

---------------------------------------------------- auto e = EncodingScheme.create("utf-8"); ----------------------------------------------------

This library supplies EncodingScheme subclasses for ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, UTF-8, and (on little-endian architectures) UTF-16LE and UTF-32LE; or (on big-endian architectures) UTF-16BE and UTF-32BE.

This library provides a mechanism whereby other modules may add EncodingScheme subclasses for any other encoding.

Copyright

License

Boost License 1.0.

Authors

Janice Caron

Source: std/encoding.d

var bomTable INVALID_SEQUENCE utfBOM

tmpl EncoderFunctions EncoderInstance EncoderInstance EncoderInstance EncoderInstance EncoderInstance EncoderInstance EncoderInstance EncoderInstance EncoderInstance EncoderInstance GenericEncoder

alias AsciiString BOMSeq Latin1String Latin2String Windows1250String Windows1251String Windows1252String

enum AsciiChar BOM Latin1Char Latin2Char Windows1250Char Windows1251Char Windows1252Char

struct CodePoints CodeUnits

class EncodingException EncodingScheme EncodingSchemeASCII EncodingSchemeLatin1 EncodingSchemeLatin2 EncodingSchemeUtf16Native EncodingSchemeUtf32Native EncodingSchemeUtf8 EncodingSchemeWindows1250 EncodingSchemeWindows1251 EncodingSchemeWindows1252 UnrecognizedEncodingException

fn canEncode codePoints codeUnits decode decodeReverse encode encodedLength encodingName firstSequence getBOM index isValid isValidCodePoint isValidCodeUnit lastSequence safeDecode sanitize transcode validLength

Types 28

structCodePoints(E)

Fields

const(E)[] s

Methods

int opApply(scope int delegate(ref dchar) dg)

int opApply(scope int delegate(ref size_t, ref dchar) dg)

int opApplyReverse(scope int delegate(ref dchar) dg)

int opApplyReverse(scope int delegate(ref size_t, ref dchar) dg)

Constructors

this(const(E)[] s)

structCodeUnits(E)

Fields

E[] s

Methods

int opApply(scope int delegate(ref E) dg)

int opApplyReverse(scope int delegate(ref E) dg)

Constructors

this(dchar d)

enumAsciiChar : ubyte

Defines various character sets.

_init

aliasAsciiString = immutable(AsciiChar)[]

Ditto

enumLatin1Char : ubyte

Defines an Latin1-encoded character.

_init

aliasLatin1String = immutable(Latin1Char)[]

Defines an Latin1-encoded string (as an array of immutable(Latin1Char)).

enumLatin2Char : ubyte

Defines a Latin2-encoded character.

_init

aliasLatin2String = immutable(Latin2Char)[]

Defines an Latin2-encoded string (as an array of immutable(Latin2Char)).

enumWindows1250Char : ubyte

Defines a Windows1250-encoded character.

_init

aliasWindows1250String = immutable(Windows1250Char)[]

Defines an Windows1250-encoded string (as an array of immutable(Windows1250Char)).

enumWindows1251Char : ubyte

Defines a Windows1251-encoded character.

_init

aliasWindows1251String = immutable(Windows1251Char)[]

Defines an Windows1251-encoded string (as an array of immutable(Windows1251Char)).

enumWindows1252Char : ubyte

Defines a Windows1252-encoded character.

_init

aliasWindows1252String = immutable(Windows1252Char)[]

Defines an Windows1252-encoded string (as an array of immutable(Windows1252Char)).

classEncodingException : Exception

The base class for exceptions thrown by this module

Constructors

this(string msg)

classUnrecognizedEncodingException : EncodingException

Constructors

this(string msg)

classEncodingScheme

Abstract base class of all encoding schemes

Fields

EncodingScheme function()[string] supported

string[string] supportedFactories

Methods

void register(Klass: EncodingScheme)()Registers a subclass of EncodingScheme.

void register(string className)

EncodingScheme create(string encodingName)Obtains a subclass of EncodingScheme which is capable of encoding and decoding the named encoding scheme.

string toString()Returns the standard name of the encoding scheme

string[] names()Returns an array of all known names for this encoding scheme

bool canEncode(dchar c)Returns true if the character c can be represented in this encoding scheme.

size_t encodedLength(dchar c)Returns the number of ubytes required to encode this code point.

size_t encode(dchar c, ubyte[] buffer)Encodes a single code point into a user-supplied, fixed-size buffer.

dchar decode(ref const(ubyte)[] s)Decodes a single code point.

dchar safeDecode(ref const(ubyte)[] s)Decodes a single code point. The input does not have to be valid.

immutable(ubyte)[] replacementSequence() @propertyReturns the sequence of ubytes to be used to represent any character which cannot be represented in the encoding scheme.

bool isValid(const(ubyte)[] s)Returns true if the array is encoded correctly

size_t validLength()(const(ubyte)[] s)Returns the length of the longest possible substring, starting from the first element, which is validly encoded.

immutable(ubyte)[] sanitize()(immutable(ubyte)[] s)Sanitizes an array by replacing malformed ubyte sequences with valid ubyte sequences. The result is guaranteed to be valid for this encoding scheme.

size_t firstSequence()(const(ubyte)[] s)Returns the length of the first encoded sequence.

size_t count()(const(ubyte)[] s)Returns the total number of code points encoded in a ubyte array.

ptrdiff_t index()(const(ubyte)[] s, size_t n)Returns the array index at which the (n+1)th code point begins.

classEncodingSchemeASCII : EncodingScheme

EncodingScheme to handle ASCII

This scheme recognises the following names: "ANSI_X3.4-1968", "ANSI_X3.4-1986", "ASCII", "IBM367", "ISO646-US", "ISO_646.irv:1991", "US-ASCII", "cp367", "csASCII" "iso-ir-6", "us"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeLatin1 : EncodingScheme

EncodingScheme to handle Latin-1

This scheme recognises the following names: "CP819", "IBM819", "ISO-8859-1", "ISO_8859-1", "ISO_8859-1:1987", "csISOLatin1", "iso-ir-100", "l1", "latin1"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeLatin2 : EncodingScheme

EncodingScheme to handle Latin-2

This scheme recognises the following names: "Latin 2", "ISO-8859-2", "ISO_8859-2", "ISO_8859-2:1999", "Windows-28592"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeWindows1250 : EncodingScheme

EncodingScheme to handle Windows-1250

This scheme recognises the following names: "windows-1250"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeWindows1251 : EncodingScheme

EncodingScheme to handle Windows-1251

This scheme recognises the following names: "windows-1251"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeWindows1252 : EncodingScheme

EncodingScheme to handle Windows-1252

This scheme recognises the following names: "windows-1252"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeUtf8 : EncodingScheme

EncodingScheme to handle UTF-8

This scheme recognises the following names: "UTF-8"

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeUtf16Native : EncodingScheme

EncodingScheme to handle UTF-16 in native byte order

This scheme recognises the following names: "UTF-16LE" (little-endian architecture only) "UTF-16BE" (big-endian architecture only)

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

classEncodingSchemeUtf32Native : EncodingScheme

EncodingScheme to handle UTF-32 in native byte order

This scheme recognises the following names: "UTF-32LE" (little-endian architecture only) "UTF-32BE" (big-endian architecture only)

Methods

string[] names() @safe pure nothrow

string toString() @safe pure nothrow @nogc

bool canEncode(dchar c) @safe pure nothrow @nogc

size_t encodedLength(dchar c) @safe pure nothrow @nogc

size_t encode(dchar c, ubyte[] buffer) @safe pure nothrow @nogc

dchar decode(ref const(ubyte)[] s) @safe pure nothrow @nogc

dchar safeDecode(ref const(ubyte)[] s) @safe pure nothrow @nogc

immutable(ubyte)[] replacementSequence() @property @safe pure nothrow @nogc

enumBOM

Definitions of common Byte Order Marks. The elements of the enum can used as indices into bomTable to get matching BOMSeq.

none = 0no BOM was found

utf32be = 1[0x00, 0x00, 0xFE, 0xFF]

utf32le = 2[0xFF, 0xFE, 0x00, 0x00]

utf7 = 3[0x2B, 0x2F, 0x76, 0x38] [0x2B, 0x2F, 0x76, 0x39], [0x2B, 0x2F, 0x76, 0x2B], [0x2B, 0x2F, 0x76, 0x2F], [0x2B, 0x2F, 0x76, 0x38, 0x2D]

utf1 = 8[0xF7, 0x64, 0x4C]

utfebcdic = 9[0xDD, 0x73, 0x66, 0x73]

scsu = 10[0x0E, 0xFE, 0xFF]

bocu1 = 11[0xFB, 0xEE, 0x28]

gb18030 = 12[0x84, 0x31, 0x95, 0x33]

utf8 = 13[0xEF, 0xBB, 0xBF]

utf16be = 14[0xFE, 0xFF]

utf16le = 15[0xFF, 0xFE]

aliasBOMSeq = Tuple!(BOM, "schema", ubyte[], "sequence")

The type stored inside bomTable.

Functions 23

fnbool isValidCodePoint(dchar c) @safe pure nothrow @nogcReturns true if c is a valid code point

fnstring encodingName(T)() @propertyReturns the name of an encoding.

fnbool canEncode(E)(dchar c)Returns true iff it is possible to represent the specified codepoint in the encoding.

fnbool isValidCodeUnit(E)(E c)Returns true if the code unit is legal. For example, the byte 0x80 would not be legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F.

fnbool isValid(E)(const(E)[] s)Returns true if the string is encoded correctly

fnsize_t validLength(E)(const(E)[] s)Returns the length of the longest possible substring, starting from the first code unit, which is validly encoded.

fnimmutable(E)[] sanitize(E)(immutable(E)[] s)Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. The result is guaranteed to be valid for this encoding.

fnsize_t firstSequence(E)(const(E)[] s)Returns the length of the first encoded sequence.

fnsize_t lastSequence(E)(const(E)[] s)Returns the length of the last encoded sequence.

fnptrdiff_t index(E)(const(E)[] s, int n)Returns the array index at which the (n+1)th code point begins.

fndchar decode(S)(ref S s)Decodes a single code point.

fndchar decodeReverse(E)(ref const(E)[] s)Decodes a single code point from the end of a string.

fndchar safeDecode(S)(ref S s)Decodes a single code point. The input does not have to be valid.

fnsize_t encodedLength(E)(dchar c)Returns the number of code units required to encode a single code point.

fnE[] encode(E)(dchar c)Encodes a single code point.

fnsize_t encode(E)(dchar c, E[] array)Encodes a single code point into an array.

fnsize_t encode(E, R)(dchar c, auto ref R range) if (isNativeOutputRange!(R, E))

fnvoid encode(E)(dchar c, void delegate(E) dg)Encodes a single code point to a delegate.

fnsize_t encode(Tgt, Src, R)(in Src[] s, R range)Encodes the contents of `s` in units of type `Tgt`, writing the result to an output range.

fnCodePoints!(E) codePoints(E)(immutable(E)[] s)Returns a foreachable struct which can bidirectionally iterate over all code points in a string.

fnCodeUnits!(E) codeUnits(E)(dchar c)Returns a foreachable struct which can bidirectionally iterate over all code units in a code point.

fnvoid transcode(Src, Dst)(Src[] s, out Dst[] r)Convert a string from one encoding to another.

immutable(BOMSeq) getBOM(Range)(Range input) if (isForwardRange!Range &&  is(immutable  ElementType!Range ==  immutable  ubyte))

Returns a `BOMSeq` for a given `input`. If no `BOM` is present the `BOMSeq` for `BOM.none` is returned. The `BOM` sequence at the beginning of the range will not be comsumed from the passed range. ...

Variables 3

enumvarINVALID_SEQUENCE = cast(dchar) 0xFFFFFFFF

Special value returned by safeDecode

var

[
     BOMSeq(BOM.none,  null),
     BOMSeq(BOM.utf32be,  cast(ubyte[])([0x00,  0x00,  0xFE,  0xFF])),
     BOMSeq(BOM.utf32le,  cast(ubyte[])([0xFF,  0xFE,  0x00,  0x00])),
     BOMSeq(BOM.utf7,  cast(ubyte[])([0x2B,  0x2F,  0x76,  0x39])),
     BOMSeq(BOM.utf7,  cast(ubyte[])([0x2B,  0x2F,  0x76,  0x2B])),
     BOMSeq(BOM.utf7,  cast(ubyte[])([0x2B,  0x2F,  0x76,  0x2F])),
     BOMSeq(BOM.utf7,  cast(ubyte[])([0x2B,  0x2F,  0x76,  0x38,  0x2D])),
     BOMSeq(BOM.utf7,  cast(ubyte[])([0x2B,  0x2F,  0x76,  0x38])),
     BOMSeq(BOM.utf1,  cast(ubyte[])([0xF7,  0x64,  0x4C])),
     BOMSeq(BOM.utfebcdic,  cast(ubyte[])([0xDD,  0x73,  0x66,  0x73])),
     BOMSeq(BOM.scsu,  cast(ubyte[])([0x0E,  0xFE,  0xFF])),
     BOMSeq(BOM.bocu1,  cast(ubyte[])([0xFB,  0xEE,  0x28])),
     BOMSeq(BOM.gb18030,  cast(ubyte[])([0x84,  0x31,  0x95,  0x33])),
     BOMSeq(BOM.utf8,  cast(ubyte[])([0xEF,  0xBB,  0xBF])),
     BOMSeq(BOM.utf16be,  cast(ubyte[])([0xFE,  0xFF])),
     BOMSeq(BOM.utf16le,  cast(ubyte[])([0xFF,  0xFE]))
] bomTable

Mapping of a byte sequence to Byte Order Mark (BOM)

enumvarutfBOM = 0xfeff

Constant defining a fully decoded BOM

Templates 12

tmplEncoderFunctions()

Functions

E[] encode(dchar c)

void encode(dchar c, ref E[] array)

void encode(dchar c, void delegate(E) dg)

void skip(ref const(E)[] s)

dchar decode(S)(ref S s)

dchar safeDecode(S)(ref S s)

dchar decodeReverse(ref const(E)[] s)

tmplEncoderInstance(E)

tmplGenericEncoder()

Functions

bool canEncode(dchar c)

bool isValidCodeUnit(E c)

size_t encodedLength(dchar c)

void encodeViaWrite()(dchar c)

void skipViaRead()()

dchar decodeViaRead()()

dchar safeDecodeViaRead()()

dchar decodeReverseViaRead()()

EString replacementSequence()

tmplEncoderInstance(CharType : AsciiChar)

Functions

string encodingName()

bool canEncode(dchar c)

bool isValidCodeUnit(AsciiChar c)

size_t encodedLength(dchar c)

void encodeX(Range)(dchar c, Range r)

void encodeViaWrite()(dchar c)

void skipViaRead()()

dchar decodeViaRead()()

dchar safeDecodeViaRead()()

dchar decodeReverseViaRead()()

EString replacementSequence()

tmplEncoderInstance(CharType : Latin1Char)

Functions

string encodingName()

bool canEncode(dchar c)

bool isValidCodeUnit(Latin1Char c)

size_t encodedLength(dchar c)

void encodeViaWrite()(dchar c)

void skipViaRead()()

dchar decodeViaRead()()

dchar safeDecodeViaRead()()

dchar decodeReverseViaRead()()

EString replacementSequence()

tmplEncoderInstance(CharType : Latin2Char)

Functions

string encodingName()

tmplEncoderInstance(CharType : Windows1250Char)

Functions

string encodingName()

tmplEncoderInstance(CharType : Windows1251Char)

Functions

string encodingName()

tmplEncoderInstance(CharType : Windows1252Char)

Functions

string encodingName()

tmplEncoderInstance(CharType : char)

Functions

string encodingName()

bool canEncode(dchar c)

bool isValidCodeUnit(char c)

int tails(char c)

size_t encodedLength(dchar c)

void encodeViaWrite()(dchar c)

void skipViaRead()()

dchar decodeViaRead()()

dchar safeDecodeViaRead()()

dchar decodeReverseViaRead()()

EString replacementSequence()

tmplEncoderInstance(CharType : wchar)

Functions

string encodingName()

bool canEncode(dchar c)

bool isValidCodeUnit(wchar c)

size_t encodedLength(dchar c)

void encodeViaWrite()(dchar c)

void skipViaRead()()

dchar decodeViaRead()()

dchar safeDecodeViaRead()()

dchar decodeReverseViaRead()()

EString replacementSequence()

tmplEncoderInstance(CharType : dchar)

Functions

string encodingName()

bool canEncode(dchar c)

bool isValidCodeUnit(dchar c)

size_t encodedLength(dchar c)

void encodeViaWrite()(dchar c)

void skipViaRead()()

dchar decodeViaRead()()

dchar safeDecodeViaRead()()

dchar decodeReverseViaRead()()

EString replacementSequence()