std.uni

The std.uni module provides an implementation

of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, see decode and encode in std.utf for this functionality.

All primitives listed operate on Unicode characters and

sets of characters. For functions which operate on ASCII characters and ignore Unicode characters, see std.ascii. For definitions of Unicode character, code point and other terms used throughout this module see the terminology section below.

The focus of this module is the core needs of developing Unicode-aware

applications. To that effect it provides the following optimized primitives:

It's recognized that an application may need further enhancements

and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:

  • CodepointSet, a type for easy manipulation of sets of characters.

    Besides the typical set algebra it provides an unusual feature: a D source code generator for detection of code points in this set. This is a boon for meta-programming parser frameworks, and is used internally to power classification in small sets like isWhite.

  • A way to construct optimal packed multi-stage tables also known as a

    special case of Trie. The functions codepointTrie, codepointSetTrie construct custom tries that map dchar to value. The end result is a fast and predictable 1 lookup that powers functions like isAlpha and combiningClass, but for user-defined data sets.

  • A useful technique for Unicode-aware parsers that perform

    character classification of encoded code points is to avoid unnecassary decoding at all costs.

    utfMatcher provides an improvement over the usual workflow

    of decode-classify-process, combining the decoding and classification steps. By extracting necessary bits directly from encoded

    code units matchers achieve

    significant performance improvements. See MatcherConcept for the common interface of UTF matchers.

  • Generally useful building blocks for customized normalization: combiningClass for querying combining class

    and allowedIn for testing the Quick_Check property of a given normalization form.

  • Access to a large selection of commonly used sets of code points. Supported sets include Script,

    Block and General Category. The exact contents of a set can be observed in the CLDR utility, on the

    property index page

    of the Unicode website. See unicode for easy and (optionally) compile-time checked set queries.

Synopsis

import std.uni;
void main()
{
    // initialize code point sets using script/block or property name
    // now 'set' contains code points from both scripts.
    auto set = unicode("Cyrillic") | unicode("Armenian");
    // same thing but simpler and checked at compile-time
    auto ascii = unicode.ASCII;
    auto currency = unicode.Currency_Symbol;

    // easy set ops
    auto a = set & ascii;
    assert(a.empty); // as it has no intersection with ascii
    a = set | ascii;
    auto b = currency - a; // subtract all ASCII, Cyrillic and Armenian

    // some properties of code point sets
    assert(b.length > 45); // 46 items in Unicode 6.1, even more in 6.2
    // testing presence of a code point in a set
    // is just fine, it is O(logN)
    assert(!b['$']);
    assert(!b['\u058F']); // Armenian dram sign
    assert(b['¥']);

    // building fast lookup tables, these guarantee O(1) complexity
    // 1-level Trie lookup table essentially a huge bit-set ~262Kb
    auto oneTrie = toTrie!1(b);
    // 2-level far more compact but typically slightly slower
    auto twoTrie = toTrie!2(b);
    // 3-level even smaller, and a bit slower yet
    auto threeTrie = toTrie!3(b);
    assert(oneTrie['£']);
    assert(twoTrie['£']);
    assert(threeTrie['£']);

    // build the trie with the most sensible trie level
    // and bind it as a functor
    auto cyrillicOrArmenian = toDelegate(set);
    auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");
    assert(balance == "ընկեր!");
    // compatible with bool delegate(dchar)
    bool delegate(dchar) bindIt = cyrillicOrArmenian;

    // Normalization
    string s = "Plain ascii (and not only), is always normalized!";
    assert(s is normalize(s));// is the same string

    string nonS = "A\u0308ffin"; // A ligature
    auto nS = normalize(nonS); // to NFC, the W3C endorsed standard
    assert(nS == "Äffin");
    assert(nS != nonS);
    string composed = "Äffin";

    assert(normalize!NFD(composed) == "A\u0308ffin");
    // to NFKD, compatibility decomposition useful for fuzzy matching/searching
    assert(normalize!NFKD("2¹⁰") == "210");
}

Terminology

The following is a list of important Unicode notions

and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.

Abstract character
A unit of information used for the organization,

control, or representation of textual data. Note that:

  • When representing data, the nature of that data

    is generally symbolic as opposed to some other kind of data (for example, visual).

  • An abstract character has no concrete form

    and should not be confused with a glyph.

  • An abstract character does not necessarily

    correspond to what a user thinks of as a “character” and should not be confused with a Grapheme.

  • The abstract characters encoded (see Encoded character)

    are known as Unicode abstract characters.

  • Abstract characters not directly

    encoded by the Unicode Standard can often be represented by the use of combining character sequences.

Canonical decomposition

The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and these described in Conjoining Jamo Behavior (section 12 of

Unicode Conformance).

Canonical composition

The precise definition of the Canonical composition is the algorithm as specified in Unicode Conformance section 11. Informally it's the process that does the reverse of the canonical decomposition with the addition of certain rules that e.g. prevent legacy characters from appearing in the composed result.

Canonical equivalent

Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.

Character
Typically differs by context.

For the purpose of this documentation the term character implies encoded character, that is, a code point having an assigned abstract character (a symbolic meaning).

Code point
Any value in the Unicode codespace;

that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

Code unit
The minimal bit combination that can represent

a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar).

Note that in UTF-32, a code unit is a code point

and is represented by the D dchar type.

Combining character
A character with the General Category

of Combining Mark(M).

  • All characters with non-zero canonical combining class

    are combining characters, but the reverse is not the case: there are combining characters with a zero combining class.

  • These characters are not normally used in isolation

    unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

Combining class

A numerical value used by the Unicode Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not.

Compatibility decomposition

The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Conjoining Jamo Behavior no characters can be further decomposed.

Compatibility equivalent

Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Encoded character
An association (or mapping)

between an abstract character and a code point.

Glyph
The actual, concrete image of a glyph representation

having been rasterized or otherwise imaged onto some display surface.

Grapheme base
A character with the property

Grapheme_Base, or any standard Korean syllable block.

Grapheme cluster
Defined as the text between

grapheme boundaries as specified by Unicode Standard Annex #29,

Unicode text segmentation.

Important general properties of a grapheme:

  • The grapheme cluster represents a horizontally segmentable

    unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.

  • A grapheme cluster typically starts with a grapheme base

    and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching.

  • For many processes, a grapheme cluster behaves as if it was a

    single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties.

This module defines a number of primitives that work with graphemes:

Grapheme, decodeGrapheme and graphemeStride.

All of them are using extended grapheme boundaries as defined in the aforementioned standard annex.

Nonspacing mark
A combining character with the

General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me).

Spacing mark
A combining character that is not a nonspacing mark.

Normalization

The concepts of canonical equivalent

or compatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the function normalize to convert into any of the four defined forms.

A very important attribute of the Unicode Normalization Forms

is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.

The Unicode Standard specifies four normalization forms.

Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences.

The choice of the normalization form depends on the particular use case.

NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.

Construction of lookup tables

The Unicode standard describes a set of algorithms that

depend on having the ability to quickly look up various properties of a code point. Given the codespace of about 1 million code points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.

Common approaches such as hash-tables or binary search over

sorted code point intervals (as in InversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.

The recommended solution (see Unicode Implementation Guidelines)

is using multi-stage tables that are an implementation of the

Trie data structure with integer

keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.

Taking a 2-level Trie as an example the principle of operation is as follows.

Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.

The number of pages is variable (but not less then 1)

unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.

Assuming that pages are laid out consequently in one array at pages, the pseudo-code is:

auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;
pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];

Where if elemsPerPage is a power of 2 the whole process is

a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.

For completeness a level 1 trie is simply an array.

The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such as bool). See also BitPacked for enforcing it manually. The major size advantage however comes from the fact that multiple identical pages on every level are merged by construction.

The process of constructing a trie is more involved and is hidden from

the user in a form of the convenience functions codepointTrie,

codepointSetTrie and the even more convenient toTrie.

In general a set or built-in AA with dchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.

Unicode properties

This is a full list of Unicode properties accessible through unicode

with specific helpers per category nested within. Consult the

CLDR utility

when in doubt about the contents of a particular set.

General category sets listed below are only accessible with the unicode shorthand accessor.

Sets for other commonly useful properties that are

accessible with unicode:

Below is the table with block names accepted by unicode.block.

Note that the shorthand version unicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.

Below is the table with script names accepted by unicode.script

and by the shorthand version unicode:

Below is the table of names accepted by unicode.hangulSyllableType.

References:

ASCII Table, Wikipedia, The Unicode Consortium, Unicode normalization forms, Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance

Trademarks: Unicode(tm) is a trademark of Unicode, Inc.

Types 25

structMultiArray(Types...)
Fields
Types.length dim
size_t[dim] offsets
size_t[dim] sz
size_t[] storage
Methods
@property auto slice(size_t n)()inout inout pure nothrow @nogc
@property auto ptr(size_t n)()inout inout pure nothrow @nogc
size_t bytes(size_t n = size_t.max)() @property const @safe
void store(OutRange)(scope OutRange sink) if (isOutputRange!(OutRange, char)) const
@property auto raw_ptr(size_t n)()inout inout pure nothrow @nogc
Constructors
this(size_t[] sizes...)
this(const(size_t)[] raw_offsets, const(size_t)[] raw_sizes, return scope const(size_t)[] data)
Nested Templates
length(size_t n)
structPackedPtrImpl(T, size_t bits)
Fields
size_t.sizeof * 8 / bits factor
size_t.sizeof bytesPerWord
size_t * origin
Methods
private T simpleIndex(size_t n) inout
private void simpleWrite(TypeOfBitPacked!T val, size_t n)
Constructors
this(inout(size_t) * ptr)inout
structPackedArrayViewImpl(T, size_t bits)
Fields
size_t.sizeof * 8 / bits factor
PackedPtr!(T) ptr
size_t ofs
Methods
bool zeros(size_t s, size_t e)
T opIndex(size_t idx) inout
void opIndexAssign(TypeOfBitPacked!T val, size_t idx)
void opSliceAssign(TypeOfBitPacked!T val, size_t start, size_t end)
auto opSlice(size_t from, size_t to)inout inout
auto opSlice(){
bool opEquals(T)(auto ref T arr) const
size_t length()const{ @property const{
auto roundUp()(size_t val){
auto roundDown()(size_t val){
Constructors
this(inout(size_t) * origin, size_t offset, size_t items)
private structSliceOverIndexed(T)
Fields
is(typeof(() { T.init[0] = Item.init; })) assignableIndex
is(typeof(() { T.init[0 .. 0] = Item.init; })) assignableSlice
size_t from
T * arr
Methods
auto opIndex(size_t idx)const const
auto opSlice(size_t a, size_t b)
void opSliceAssign(T)(T val, size_t start, size_t end)
auto opSlice()
size_t length()const @property const
bool empty()const @property const
@property auto front()const const
@property auto back()const const
@property auto save() inout
void popFront()
void popBack()
bool opEquals(T)(auto ref T arr) const
aliassharLowerBound = sharMethod!uniformLowerBound
aliassharSwitchLowerBound = sharMethod!switchUniformLowerBound
private structGcPolicy
Methods
T[] dup(T)(const T[] arr)
T[] alloc(T)(size_t size)
T[] realloc(T)(T[] arr, size_t sz)
void replaceImpl(T, Range)(ref T[] dest, size_t from, size_t to, Range stuff)
void append(T, V)(ref T[] arr, V value) if (!isInputRange!V)
void append(T, V)(ref T[] arr, V value) if (isInputRange!V)
void destroy(T)(ref T arr) if (isDynamicArray!T && is(Unqual!T == T)) pure
void destroy(T)(ref T arr) if (isDynamicArray!T && !is(Unqual!T == T)) pure
private bool pureInGCFinalizer() @safe pure nothrow;
bool accessIsSafe() @property @safe nothrow pure
Fields
true accessIsSafe
Methods
T[] dup(T)(const T[] arr)
T[] alloc(T)(size_t size) @trusted
T[] realloc(T)(return scope T[] arr, size_t size) @trusted
void replaceImpl(T, Range)(ref T[] dest, size_t from, size_t to, Range stuff)
void append(T, V)(ref T[] arr, V value) if (!isInputRange!V)
void append(T, V)(ref T[] arr, V value) if (isInputRange!V && hasLength!V)
void destroy(T)(scope ref T[] arr) @trusted
alias_RealArray = CowArray!ReallocPolicy
aliasCodepointSet = InversionList!GcPolicy

The recommended default type for set of U+. For details, see the current implementation: InversionList.

The recommended type of Tuple to represent [a, b) intervals of U+. As used in InversionList. Any interval type should pass isIntegralPair trait.

Fields
uint[2] _tuple
Methods
bool opEquals(T)(T val) const
inout(uint) a() @property ref return inout
inout(uint) b() @property ref return inout
Constructors
this(uint low, uint high)

InversionList is a set of U+

represented as an array of open-right [a, b) intervals (see CodepointInterval above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:

10, 50, 60, 61, 80, 90

The way to read this is: start with negative meaning that all numbers

smaller then the next one are not present in this set (and positive - the contrary). Then switch positive/negative after each number passed from left to right.

This way negative spans until 10, then positive until 50,

then negative until 60, then positive until 61, and so on. As seen this provides a space-efficient storage of highly redundant data that comes in long runs. A description which Unicode properties fit nicely. The technique itself could be seen as a variation on RLE encoding.

Sets are value types (just like int is) thus they

are never aliased.

Example: --- auto a = CodepointSet('a', 'z'+1); auto b = CodepointSet('A', 'Z'+1); auto c = a; a = a | b; assert(a == CodepointSet('A', 'Z'+1, 'a', 'z'+1)); assert(a != c); ---

See also unicode for simpler construction of sets

from predefined ones.

Memory usage is 8 bytes per each contiguous interval in a set.

The value semantics are achieved by using the

COW technique

and thus it's not safe to cast this type to shared.

Note

It's not recommended to rely on the template parameters

or the exact type of a current U+ set in std.uni. The type and parameters may change when the standard allocators design is finalized. Use isCodepointSet with templates or just stick with the default alias CodepointSet throughout the whole code base.

Fields
CowArray!SP data
Methods
private static fromIntervals(Range)(Range intervals) pure
private static fromIntervals()(uint[] intervals...) pure
@property auto byInterval() scopeGet range that spans all of the intervals in this InversionList.
const(CodepointInterval)[] intervals() @property const
bool opIndex(uint val) constTests the presence of code point `val` in this set.
auto scanFor()(dchar ch) const
size_t length() @propertyNumber of in this set
This opBinary(string op, U)(U rhs) if (isCodepointSet!U || is(U: dchar))Sets support natural syntax for set algebra ) )
This opOpAssign(string op, U)(U rhs) if (isCodepointSet!U || is(U: dchar)) refThe 'op=' versions of the above overloaded operators.
bool opBinaryRight(string op: "in", U)(U ch) if (is(U : dchar)) constTests the presence of codepoint `ch` in this set, the same as opIndex.
auto opUnary(string op: "!")()Obtains a set that is the inversion of this set.
@property auto byCodepoint()A range that spans each in this set.
void toString(Writer)(scope Writer sink, scope const ref FormatSpec!char fmt)Obtain a textual representation of this InversionList in form of open-right intervals.
ref add()(uint a, uint b)Add an interval [a, b to this set.
ref intersect(U)(U rhs) if (isCodepointSet!U)
ref intersect()(dchar ch)
ref sub()(dchar ch)
ref sub(U)(U rhs) if (isCodepointSet!U)
ref add(U)(U rhs) if (isCodepointSet!U)
@property auto inverted()Obtains a set that is the inversion of this set.
string toSourceCode(const(CodepointInterval)[] range, string funcName)
string toSourceCode(string funcName = "")Generates string with D source code of unary function with name of `funcName` taking a single `dchar` argument. If `funcName` is empty the code is adjusted to be a lambda function.
bool empty() @property constTrue if this set doesn't contain any .
void sanitize()
ref subChar(dchar ch)
Marker addInterval(int a, int b, Marker hint = Marker.init) scope
Marker dropUpTo(uint a, Marker pos = Marker.init)
Marker skipUpTo(uint a, Marker pos = Marker.init)
Constructors
this(Set set)Construct from another code point set of any type.
this(Range intervals)Construct a set from a forward range of code point intervals.
this(uint[] intervals...)Construct a set from plain values of code point intervals.
Nested Templates
Intervals
structCowArray(SP = GcPolicy)
Fields
uint[] data
Methods
static auto reuse(uint[] arr)
bool empty() @property const
size_t length() @property const
void length(size_t len) @property
uint opIndex()(size_t idx)const const
void opIndexAssign(uint val, size_t idx)
auto opSlice(size_t from, size_t to)
auto opSlice(size_t from, size_t to) const
auto opSlice()
auto opSlice() const
void append(Range)(Range range) if (isInputRange!Range && hasLength!Range && is(ElementType!Range : uint))
void append()(uint[] val...)
bool opEquals()(auto ref const CowArray rhs) const
uint refCount() @property const
void refCount(uint cnt) @property
void dupThisReference(uint count)
Constructors
this(Range range)
this(Range range)
Destructors
private structTrieBuilder(Value, Key, Args...) if (isBitPackableType!Value && isValidArgsForTrie!(Key, Args))
Fields
Prefix.length - 1 lastLevel
size_t[Prefix.length] indices
Value defValue
size_t curIndex
ConstructState[Prefix.length] state
MultiArray!(idxTypes!(Key, fullBitSize!(Prefix), Prefix[0..$]), V) table
"non-monotonic prefix function(s), an unsorted range or " ~ "duplicate key->value mapping" errMsg
Methods
static auto deduceMaxIndex(Preds...)()
@property ref idx(size_t level)(){
void addValue(size_t level, T)(T val, size_t numVals)
void spillToNextPage(size_t level, Slice)(ref Slice ptr)
void spillToNextPageImpl(size_t level, Slice)(ref Slice ptr)
void putAt(size_t idx, Value v)
void putRangeAt(size_t idxA, size_t idxB, Value v)
void putRange(Key a, Key b, Value v)Put a value `v` into interval as mapped by keys from `a` to `b`. All slots prior to `a` are filled with the default filler.
void putValue(Key key, Value v)Put a value `v` into slot mapped by `key`. All slots prior to `key` are filled with the default filler.
auto build()Finishes construction of Trie, yielding an immutable Trie instance.
Constructors
this(Value filler)Construct a builder, where `filler` is a value to indicate empty slots (or "not found" condition).
Nested Templates
ConstructState
private structTrie(Value, Key, Args...) if (isValidPrefixForTrie!(Key, Args) || (isValidPrefixForTrie!(Key, Args[1..$]) && is(typeof(Args[0]) : size_t)))

A generic Trie data-structure for a fixed number of stages.

The design goal is optimal speed with smallest footprint size.

It's intentionally read-only and doesn't provide constructors.

To construct one use a special builder, see TrieBuilder and buildTrie.

Fields
MultiArray!(idxTypes!(Key, fullBitSize!(Prefix), Prefix[0..$]), Value) _table
Methods
TypeOfBitPacked!Value opIndex()(Key key) constLookup the `key` in this `Trie`.
size_t bytes(size_t n = size_t.max)() @property const
size_t pages(size_t n)() @property const
void store(OutRange)(scope OutRange sink) if (isOutputRange!(OutRange, char)) const
Constructors
this(typeof(_table) table)
this(const(size_t)[] offsets, const(size_t)[] sizes, const(size_t)[] data)
structclamp(size_t bits)
Fields
bits bitSize
Methods
size_t opCall(T)(T arg){
structclampIdx(size_t idx, size_t bits)
Fields
bits bitSize
Methods
size_t opCall(T)(T arg){

Conceptual type that outlines the common properties of all UTF Matchers.

Note

For illustration purposes only, every method

call results in assertion failure. Use utfMatcher to obtain a concrete matcher for UTF-8 or UTF-16 encodings.

Methods
bool match(Range)(ref Range inp) if (isRandomAccessRange!Range && is(ElementType!Range : char))Perform a semantic equivalent 2 operations: decoding a at front of `inp` and testing if it belongs to the set of of this matcher.
bool skip(Range)(ref Range inp) if (isRandomAccessRange!Range && is(ElementType!Range : char))ditto
bool test(Range)(ref Range inp) if (isRandomAccessRange!Range && is(ElementType!Range : char))ditto
@property auto subMatcher(Lengths...)()Advanced feature - provide direct access to a subset of matcher based a set of known encoding lengths. Lengths are provided in Code unit. The sub-matcher then may do less operations per any `test`/...
enumMode
alwaysSkip
neverSkip
skipOnMatch
structBitPacked(T, size_t sz) if (isIntegral!T || is(T: dchar))

Opaque wrapper around unsigned built-in integers and

code unit (char/wchar/dchar) types. Parameter sz indicates that the value is confined to the range of [0, 2^^sz). With this knowledge it can be packed more tightly when stored in certain data-structures like trie.

Note

The BitPacked!(T, sz) is implicitly convertible to T

but not vise-versa. Users have to ensure the value fits in the range required and use the cast operator to perform the conversion.

Fields
sz bitSize
T _value
structassumeSize(alias Fn, size_t bits)
Fields
bits bitSize
Methods
static auto ref opCall(T)(auto ref T arg)
structsliceBits(size_t from, size_t to)
Fields
to - from bitSize
Methods
static auto opCall(T)(T x)
aliaslo8 = assumeSize!(low_8, 8)
aliasmlo8 = assumeSize!(midlow_8, 8)
Fields
const(ubyte)[] _stream
size_t _idx
Methods
void popFront()
bool empty() @property const
DecompressedIntervals save() @property return scope
Constructors
this(const(ubyte)[] stream)

Functions 37

fnvoid copyBackwards(T, U)(T[] src, U[] dest)
fnvoid copyForward(T, U)(T[] src, U[] dest)
fnauto force(T, F)(F from) if (isIntegral!T && !is(T == F))
fnauto force(T, F)(F from) if (isBitPacked!T && !is(T == F))
fnauto force(T, F)(F from) if (is(T == F))
fnsize_t replicateBits(size_t times, size_t bits)(size_t val) @safe pure nothrow @nogc
fnsize_t spaceFor(size_t _bits)(size_t new_len) @safe pure nothrow @nogc
fnSliceOverIndexed!(const(T)) sliceOverIndexed(T)(size_t a, size_t b, const(T) * x) if (is(Unqual!T == T))
fnSliceOverIndexed!T sliceOverIndexed(T)(size_t a, size_t b, T * x) if (is(Unqual!T == T))
private fninout(PackedArrayView!T) packedArrayView(T)(inout(size_t) * ptr, size_t items)
fnstring genUnrolledSwitchSearch(size_t size) @safe pure nothrow
fnbool isPow2OrZero(size_t sz) @safe pure nothrow @nogc
fnsize_t uniformLowerBound(alias pred, Range, T)(Range range, T needle) if (is(T : ElementType!Range))
fnsize_t switchUniformLowerBound(alias pred, Range, T)(Range range, T needle) if (is(T : ElementType!Range))
fnsize_t genericReplace(Policy = void, T, Range)(ref T dest, size_t from, size_t to, Range stuff) @trusted
private fnuint safeRead24(scope const ubyte * ptr, size_t idx) @system pure nothrow @nogc
private fnvoid safeWrite24(scope ubyte * ptr, uint val, size_t idx) @system pure nothrow @nogc
private fnuint unalignedRead24(scope const ubyte * ptr, size_t idx) @system pure nothrow @nogc
private fnvoid unalignedWrite24(scope ubyte * ptr, uint val, size_t idx) @system pure nothrow @nogc
private fnuint read24(scope const ubyte * ptr, size_t idx) @system pure nothrow @nogc
private fnvoid write24(scope ubyte * ptr, uint val, size_t idx) @system pure nothrow @nogc
fnauto arrayRepr(T)(T x)
fnsize_t sumOfIntegerTuple(ints...)() @property
private fnauto utf8Matcher(Set)(Set set)
private fnauto utf16Matcher(Set)(Set set)
fnauto utfMatcher(Char, Set)(Set set) if (isCodepointSet!Set)Constructs a matcher object to classify from the `set` for encoding that has `Char` as code unit.
fnauto decoder(C)(C[] s, size_t offset = 0) if (is(C : wchar) || is(C : char))
fnauto toTrie(size_t level, Set)(Set set) if (isCodepointSet!Set)Convenience function to construct optimal configurations for packed Trie from any `set` of .
fnauto toDelegate(Set)(Set set) if (isCodepointSet!Set)Builds a `Trie` with typically optimal speed-size trade-off and wraps it into a delegate of the following type: bool delegate(dchar ch. )
fnuint low_8(uint x) @safe pure nothrow @nogc
fnuint midlow_8(uint x){ @safe pure nothrow @nogc
fnint comparePropertyName(Char1, Char2)(const(Char1)[] a, const(Char2)[] b) if (is(Char1 : dchar) && is(Char2 : dchar)) @safe pure
fnbool propertyNameLess(Char1, Char2)(const(Char1)[] a, const(Char2)[] b) if (is(Char1 : dchar) && is(Char2 : dchar)) @safe pure
fnvoid compressTo(uint val, ref scope ubyte[] arr) @safe pure nothrow
fnuint decompressFrom(scope const(ubyte)[] arr, ref size_t idx) @safe pure
fnubyte[] compressIntervals(Range)(Range intervals) if (isInputRange!Range && isIntegralPair!(ElementType!Range))
fnauto decompressIntervals(const(ubyte)[] data) @safe pure

Variables 4

enumvarlineSep = '\u2028'
enumvarparaSep = '\u2029'
enumvarnelSep = '\u0085'
enumvarlastDchar = 0x10FFFF

Templates 27

tmplisBitPackableType(T)
tmplPackedArrayView(T) if ((is(T dummy == BitPacked!(U, sz), U, size_t sz) && isBitPackableType!U) || isBitPackableType!T)
tmplPackedPtr(T) if ((is(T dummy == BitPacked!(U, sz), U, size_t sz) && isBitPackableType!U) || isBitPackableType!T)
tmplsharMethod(alias uniLowerBound)
Functions
size_t sharMethod(alias _pred = "a<b", Range, T)(Range range, T needle) if (is(T : ElementType!Range))
tmplisCodepointSet(T)

Tests if T is some kind a set of code points. Intended for template constraints.

tmplisIntegralPair(T, V = uint)

Tests if T is a pair of integers that implicitly convert to V. The following code must compile for any pair T:

(T x){ V a = x[0]; V b = x[1];}

The following must not compile:

(T x){ V c = x[2];}

tmplmapTrieIndex(Prefix...)

Maps Key to a suitable integer index within the range of size_t. The mapping is constructed by applying predicates from Prefix left to right and concatenating the resulting bits.

The first (leftmost) predicate defines the most significant bits of the resulting index.

Functions
size_t mapTrieIndex(Key)(Key key) if (isValidPrefixForTrie!(Key, Prefix))
tmplGetBitSlicing(size_t top, sizes...)
tmplcallableWith(T)
tmplisValidPrefixForTrie(Key, Prefix...)
tmplisValidArgsForTrie(Key, Args...)
tmplcodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)

A shorthand for creating a custom multi-level fixed Trie from a CodepointSet. sizes are numbers of bits per level, with the most significant bits used first.

Note

The sum of sizes must be equal 21.

See Also

toTrie, which is even simpler.

Example:

{
   import std.stdio;
   auto set = unicode("Number");
   auto trie = codepointSetTrie!(8, 5, 8)(set);
   writeln("Input code points to test:");
   foreach (line; stdin.byLine)
   {
       int count=0;
       foreach (dchar ch; line)
           if (trie[ch])// is number
               count++;
       writefln("Contains %d number code points.", count);
   }
}

Functions
auto codepointSetTrie(Set)(Set set) if (isCodepointSet!Set)
tmplCodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)

Type of Trie generated by codepointSetTrie function.

tmplcodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)

A slightly more general tool for building fixed Trie for the Unicode data.

Specifically unlike codepointSetTrie it's allows creating mappings of dchar to an arbitrary type T.

Note

Overload taking CodepointSets will naturally convert

only to bool mapping Tries.

CodepointTrie is the type of Trie as generated by codepointTrie function.

Functions
auto codepointTrie()(T[dchar] map, T defValue = T.init)
auto codepointTrie(R)(R range, T defValue = T.init) if (isInputRange!R && is(typeof(ElementType!R.init[0]) : T) && is(typeof(ElementType!R.init[1]) : dchar))
tmplCodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)

ditto

tmplcmpK0(alias Pred)
Functions
bool cmpK0(Value, Key)(Tuple!(Value, Key) a, Tuple!(Value, Key) b)
tmplbuildTrie(Value, Key, Args...) if (isValidArgsForTrie!(Key, Args))

The most general utility for construction of Tries short of using TrieBuilder directly.

Provides a number of convenience overloads. Args is tuple of maximum key value followed by predicates to construct index from key.

Alternatively if the first argument is not a value convertible to Key then the whole tuple of Args is treated as predicates and the maximum Key is deduced from predicates.

Functions
auto buildTrie(Range)(Range range, Value filler = Value.init) if (isInputRange!Range && is(typeof(Range.init.front[0]) : Value) && is(typeof(Range.init.front[1]) : Key))
auto buildTrie(Range)(Range range, Value filler = Value.init) if (is(TypeOfBitPacked!Value == bool) && isInputRange!Range && is(typeof(Range.init.front[0]) : Key) && is(typeof(Range.init.front[1]) : Key))
auto buildTrie(Range)(Range range, Value filler, bool unsorted) if (isInputRange!Range && is(typeof(Range.init.front[0]) : Value) && is(typeof(Range.init.front[1]) : Key))
auto buildTrie(Range)(Range range, Value filler = Value.init) if (is(TypeOfBitPacked!Value == bool) && isInputRange!Range && is(typeof(Range.init.front) : Key))
auto buildTrie()(Value[] array, Value filler = Value.init) if (isUnsigned!Key)
auto buildTrie(Key, Value)(Value[Key] map, Value filler = Value.init)
tmplUtf8Matcher()
Functions
void badEncoding()
char truncate()(char ch)
static auto encode(size_t sz)(dchar ch) if (sz > 1)
auto build(Set)(Set set)
Types
struct Impl
struct CherryPick
tmplUtf16Matcher()
Functions
void badEncoding()
auto encode2(dchar ch)
auto build(Set)(Set set)
Types
struct Impl
struct CherryPick
tmplbitSizeOf(Args...) if (Args.length == 1)
tmplisBitPacked(T)

Tests if T is some instantiation of BitPacked!(U, x) and thus suitable for packing.

tmplTypeOfBitPacked(T)

Gives the type U from BitPacked!(U, x) or T itself for every other type.

tmplSequence(size_t start, size_t end)
tmpluseItemAt(size_t idx, T) if (isIntegral!T || is(T: dchar))
Functions
size_t impl(const scope T[] arr){
tmpluseLastItem(T)
Functions
size_t impl(const scope T[] arr){
tmplfullBitSize(Prefix...)
tmplidxTypes(Key, size_t fullBits, Prefix...)