std.encoding
Classes and functions for handling and transcoding between various encodings.
For cases where the encoding is known at compile-time, functions are provided for arbitrary encoding and decoding of characters, arbitrary transcoding between strings of different type, as well as validation and sanitization.
Encodings currently supported are UTF-8, UTF-16, UTF-32, ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1251 and WINDOWS-1252.
For cases where the encoding is not known at compile-time, but is known at run-time, the abstract class EncodingScheme and its subclasses is provided. To construct a run-time encoder/decoder, one does e.g.
---------------------------------------------------- auto e = EncodingScheme.create("utf-8"); ----------------------------------------------------
This library supplies EncodingScheme subclasses for ASCII, ISO-8859-1 (also known as LATIN-1), ISO-8859-2 (LATIN-2), WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, UTF-8, and (on little-endian architectures) UTF-16LE and UTF-32LE; or (on big-endian architectures) UTF-16BE and UTF-32BE.
This library provides a mechanism whereby other modules may add EncodingScheme subclasses for any other encoding.
Copyright
Types 28
const(E)[] sint opApply(scope int delegate(ref dchar) dg)int opApply(scope int delegate(ref size_t, ref dchar) dg)int opApplyReverse(scope int delegate(ref dchar) dg)int opApplyReverse(scope int delegate(ref size_t, ref dchar) dg)this(const(E)[] s)Defines various character sets.
Ditto
Defines an Latin1-encoded character.
Defines an Latin1-encoded string (as an array of immutable(Latin1Char)).
Defines a Latin2-encoded character.
Defines an Latin2-encoded string (as an array of immutable(Latin2Char)).
Defines a Windows1250-encoded character.
Defines an Windows1250-encoded string (as an array of immutable(Windows1250Char)).
Defines a Windows1251-encoded character.
Defines an Windows1251-encoded string (as an array of immutable(Windows1251Char)).
Defines a Windows1252-encoded character.
Defines an Windows1252-encoded string (as an array of immutable(Windows1252Char)).
The base class for exceptions thrown by this module
this(string msg)this(string msg)Abstract base class of all encoding schemes
void register(Klass: EncodingScheme)()Registers a subclass of EncodingScheme.void register(string className)EncodingScheme create(string encodingName)Obtains a subclass of EncodingScheme which is capable of encoding and decoding the named encoding scheme.string toString()Returns the standard name of the encoding schemestring[] names()Returns an array of all known names for this encoding schemebool canEncode(dchar c)Returns true if the character c can be represented in this encoding scheme.size_t encodedLength(dchar c)Returns the number of ubytes required to encode this code point.size_t encode(dchar c, ubyte[] buffer)Encodes a single code point into a user-supplied, fixed-size buffer.dchar decode(ref const(ubyte)[] s)Decodes a single code point.dchar safeDecode(ref const(ubyte)[] s)Decodes a single code point. The input does not have to be valid.immutable(ubyte)[] replacementSequence() @propertyReturns the sequence of ubytes to be used to represent any character which cannot be represented in the encoding scheme.bool isValid(const(ubyte)[] s)Returns true if the array is encoded correctlysize_t validLength()(const(ubyte)[] s)Returns the length of the longest possible substring, starting from the first element, which is validly encoded.immutable(ubyte)[] sanitize()(immutable(ubyte)[] s)Sanitizes an array by replacing malformed ubyte sequences with valid ubyte sequences. The result is guaranteed to be valid for this encoding scheme.size_t firstSequence()(const(ubyte)[] s)Returns the length of the first encoded sequence.size_t count()(const(ubyte)[] s)Returns the total number of code points encoded in a ubyte array.ptrdiff_t index()(const(ubyte)[] s, size_t n)Returns the array index at which the (n+1)th code point begins.EncodingScheme to handle ASCII
This scheme recognises the following names: "ANSI_X3.4-1968", "ANSI_X3.4-1986", "ASCII", "IBM367", "ISO646-US", "ISO_646.irv:1991", "US-ASCII", "cp367", "csASCII" "iso-ir-6", "us"
EncodingScheme to handle Latin-1
This scheme recognises the following names: "CP819", "IBM819", "ISO-8859-1", "ISO_8859-1", "ISO_8859-1:1987", "csISOLatin1", "iso-ir-100", "l1", "latin1"
EncodingScheme to handle Latin-2
This scheme recognises the following names: "Latin 2", "ISO-8859-2", "ISO_8859-2", "ISO_8859-2:1999", "Windows-28592"
EncodingScheme to handle Windows-1250
This scheme recognises the following names: "windows-1250"
EncodingScheme to handle Windows-1251
This scheme recognises the following names: "windows-1251"
EncodingScheme to handle Windows-1252
This scheme recognises the following names: "windows-1252"
EncodingScheme to handle UTF-8
This scheme recognises the following names: "UTF-8"
EncodingScheme to handle UTF-16 in native byte order
This scheme recognises the following names: "UTF-16LE" (little-endian architecture only) "UTF-16BE" (big-endian architecture only)
EncodingScheme to handle UTF-32 in native byte order
This scheme recognises the following names: "UTF-32LE" (little-endian architecture only) "UTF-32BE" (big-endian architecture only)
Definitions of common Byte Order Marks. The elements of the enum can used as indices into bomTable to get matching BOMSeq.
Functions 23
bool canEncode(E)(dchar c)Returns true iff it is possible to represent the specified codepoint in the encoding.bool isValidCodeUnit(E)(E c)Returns true if the code unit is legal. For example, the byte 0x80 would not be legal in ASCII, because ASCII code units must always be in the range 0x00 to 0x7F.size_t validLength(E)(const(E)[] s)Returns the length of the longest possible substring, starting from the first code unit, which is validly encoded.immutable(E)[] sanitize(E)(immutable(E)[] s)Sanitizes a string by replacing malformed code unit sequences with valid code unit sequences. The result is guaranteed to be valid for this encoding.ptrdiff_t index(E)(const(E)[] s, int n)Returns the array index at which the (n+1)th code point begins.size_t encodedLength(E)(dchar c)Returns the number of code units required to encode a single code point.size_t encode(Tgt, Src, R)(in Src[] s, R range)Encodes the contents of `s` in units of type `Tgt`, writing the result to an output range.CodePoints!(E) codePoints(E)(immutable(E)[] s)Returns a foreachable struct which can bidirectionally iterate over all code points in a string.CodeUnits!(E) codeUnits(E)(dchar c)Returns a foreachable struct which can bidirectionally iterate over all code units in a code point.immutable(BOMSeq) getBOM(Range)(Range input) if (isForwardRange!Range && is(immutable ElementType!Range == immutable ubyte))Returns a `BOMSeq` for a given `input`. If no `BOM` is present the `BOMSeq` for `BOM.none` is returned. The `BOM` sequence at the beginning of the range will not be comsumed from the passed range. ...Variables 3
INVALID_SEQUENCE = cast(dchar) 0xFFFFFFFFSpecial value returned by safeDecode
[
BOMSeq(BOM.none, null),
BOMSeq(BOM.utf32be, cast(ubyte[])([0x00, 0x00, 0xFE, 0xFF])),
BOMSeq(BOM.utf32le, cast(ubyte[])([0xFF, 0xFE, 0x00, 0x00])),
BOMSeq(BOM.utf7, cast(ubyte[])([0x2B, 0x2F, 0x76, 0x39])),
BOMSeq(BOM.utf7, cast(ubyte[])([0x2B, 0x2F, 0x76, 0x2B])),
BOMSeq(BOM.utf7, cast(ubyte[])([0x2B, 0x2F, 0x76, 0x2F])),
BOMSeq(BOM.utf7, cast(ubyte[])([0x2B, 0x2F, 0x76, 0x38, 0x2D])),
BOMSeq(BOM.utf7, cast(ubyte[])([0x2B, 0x2F, 0x76, 0x38])),
BOMSeq(BOM.utf1, cast(ubyte[])([0xF7, 0x64, 0x4C])),
BOMSeq(BOM.utfebcdic, cast(ubyte[])([0xDD, 0x73, 0x66, 0x73])),
BOMSeq(BOM.scsu, cast(ubyte[])([0x0E, 0xFE, 0xFF])),
BOMSeq(BOM.bocu1, cast(ubyte[])([0xFB, 0xEE, 0x28])),
BOMSeq(BOM.gb18030, cast(ubyte[])([0x84, 0x31, 0x95, 0x33])),
BOMSeq(BOM.utf8, cast(ubyte[])([0xEF, 0xBB, 0xBF])),
BOMSeq(BOM.utf16be, cast(ubyte[])([0xFE, 0xFF])),
BOMSeq(BOM.utf16le, cast(ubyte[])([0xFF, 0xFE]))
] bomTableMapping of a byte sequence to Byte Order Mark (BOM)
utfBOM = 0xfeffConstant defining a fully decoded BOM