std.uni

The std.uni module provides an implementation

of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, see decode and encode in std.utf for this functionality.

All primitives listed operate on Unicode characters and

sets of characters. For functions which operate on ASCII characters and ignore Unicode characters, see std.ascii. For definitions of Unicode character, code point and other terms used throughout this module see the terminology section below.

The focus of this module is the core needs of developing Unicode-aware

applications. To that effect it provides the following optimized primitives:

Character classification by category and common properties: isAlpha, isWhite and others.
Case-insensitive string comparison (sicmp, icmp).
Converting text to any of the four normalization forms via normalize.
Decoding (decodeGrapheme) and iteration (byGrapheme, graphemeStride)
by user-perceived characters, that is by Grapheme clusters.
Decomposing and composing of individual character(s) according to canonical
or compatibility rules, see compose and decompose, including the specific version for Hangul syllables composeJamo and decomposeHangul.

It's recognized that an application may need further enhancements

and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:

CodepointSet, a type for easy manipulation of sets of characters.
Besides the typical set algebra it provides an unusual feature: a D source code generator for detection of code points in this set. This is a boon for meta-programming parser frameworks, and is used internally to power classification in small sets like isWhite.
A way to construct optimal packed multi-stage tables also known as a
special case of Trie. The functions codepointTrie, codepointSetTrie construct custom tries that map dchar to value. The end result is a fast and predictable 1 lookup that powers functions like isAlpha and combiningClass, but for user-defined data sets.
A useful technique for Unicode-aware parsers that perform
character classification of encoded code points is to avoid unnecassary decoding at all costs.
utfMatcher provides an improvement over the usual workflow
of decode-classify-process, combining the decoding and classification steps. By extracting necessary bits directly from encoded
code units matchers achieve
significant performance improvements. See MatcherConcept for the common interface of UTF matchers.
Generally useful building blocks for customized normalization: combiningClass for querying combining class
and allowedIn for testing the Quick_Check property of a given normalization form.
Access to a large selection of commonly used sets of code points. Supported sets include Script,
Block and General Category. The exact contents of a set can be observed in the CLDR utility, on the
property index page
of the Unicode website. See unicode for easy and (optionally) compile-time checked set queries.

Synopsis

import std.uni;
void main()
{
    // initialize code point sets using script/block or property name
    // now 'set' contains code points from both scripts.
    auto set = unicode("Cyrillic") | unicode("Armenian");
    // same thing but simpler and checked at compile-time
    auto ascii = unicode.ASCII;
    auto currency = unicode.Currency_Symbol;

    // easy set ops
    auto a = set & ascii;
    assert(a.empty); // as it has no intersection with ascii
    a = set | ascii;
    auto b = currency - a; // subtract all ASCII, Cyrillic and Armenian

    // some properties of code point sets
    assert(b.length > 45); // 46 items in Unicode 6.1, even more in 6.2
    // testing presence of a code point in a set
    // is just fine, it is O(logN)
    assert(!b['$']);
    assert(!b['\u058F']); // Armenian dram sign
    assert(b['¥']);

    // building fast lookup tables, these guarantee O(1) complexity
    // 1-level Trie lookup table essentially a huge bit-set ~262Kb
    auto oneTrie = toTrie!1(b);
    // 2-level far more compact but typically slightly slower
    auto twoTrie = toTrie!2(b);
    // 3-level even smaller, and a bit slower yet
    auto threeTrie = toTrie!3(b);
    assert(oneTrie['£']);
    assert(twoTrie['£']);
    assert(threeTrie['£']);

    // build the trie with the most sensible trie level
    // and bind it as a functor
    auto cyrillicOrArmenian = toDelegate(set);
    auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");
    assert(balance == "ընկեր!");
    // compatible with bool delegate(dchar)
    bool delegate(dchar) bindIt = cyrillicOrArmenian;

    // Normalization
    string s = "Plain ascii (and not only), is always normalized!";
    assert(s is normalize(s));// is the same string

    string nonS = "A\u0308ffin"; // A ligature
    auto nS = normalize(nonS); // to NFC, the W3C endorsed standard
    assert(nS == "Äffin");
    assert(nS != nonS);
    string composed = "Äffin";

    assert(normalize!NFD(composed) == "A\u0308ffin");
    // to NFKD, compatibility decomposition useful for fuzzy matching/searching
    assert(normalize!NFKD("2¹⁰") == "210");
}

Terminology

The following is a list of important Unicode notions

and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.

Abstract character

A unit of information used for the organization,

control, or representation of textual data. Note that:

When representing data, the nature of that data
is generally symbolic as opposed to some other kind of data (for example, visual).
An abstract character has no concrete form
and should not be confused with a glyph.
An abstract character does not necessarily
correspond to what a user thinks of as a “character” and should not be confused with a Grapheme.
The abstract characters encoded (see Encoded character)
are known as Unicode abstract characters.
Abstract characters not directly
encoded by the Unicode Standard can often be represented by the use of combining character sequences.

Canonical decomposition

The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and these described in Conjoining Jamo Behavior (section 12 of

Unicode Conformance).

Canonical composition

The precise definition of the Canonical composition is the algorithm as specified in Unicode Conformance section 11. Informally it's the process that does the reverse of the canonical decomposition with the addition of certain rules that e.g. prevent legacy characters from appearing in the composed result.

Canonical equivalent

Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.

Character

Typically differs by context.

For the purpose of this documentation the term character implies encoded character, that is, a code point having an assigned abstract character (a symbolic meaning).

Code point

Any value in the Unicode codespace;

that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

Code unit

The minimal bit combination that can represent

a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar).

Note that in UTF-32, a code unit is a code point

and is represented by the D dchar type.

Combining character

A character with the General Category

of Combining Mark(M).

All characters with non-zero canonical combining class
are combining characters, but the reverse is not the case: there are combining characters with a zero combining class.
These characters are not normally used in isolation
unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

Combining class

A numerical value used by the Unicode Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not.

Compatibility decomposition

The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Conjoining Jamo Behavior no characters can be further decomposed.

Compatibility equivalent

Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Encoded character

An association (or mapping)

between an abstract character and a code point.

Glyph

The actual, concrete image of a glyph representation

having been rasterized or otherwise imaged onto some display surface.

Grapheme base

A character with the property

Grapheme_Base, or any standard Korean syllable block.

Grapheme cluster

Defined as the text between

grapheme boundaries as specified by Unicode Standard Annex #29,

Unicode text segmentation.

Important general properties of a grapheme:

The grapheme cluster represents a horizontally segmentable
unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
A grapheme cluster typically starts with a grapheme base
and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching.
For many processes, a grapheme cluster behaves as if it was a
single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties.

This module defines a number of primitives that work with graphemes:

Grapheme, decodeGrapheme and graphemeStride.

All of them are using extended grapheme boundaries as defined in the aforementioned standard annex.

Nonspacing mark

A combining character with the

General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me).

Spacing mark

A combining character that is not a nonspacing mark.

Normalization

The concepts of canonical equivalent

or compatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the function normalize to convert into any of the four defined forms.

A very important attribute of the Unicode Normalization Forms

is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.

The Unicode Standard specifies four normalization forms.

Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences.

Normalization Form D (NFD): The canonical decomposition of a character sequence.
Normalization Form KD (NFKD): The compatibility decomposition of a character sequence.
Normalization Form C (NFC): The canonical composition of the canonical decomposition
of a coded character sequence.

Normalization Form KC (NFKC): The canonical composition
of the compatibility decomposition of a character sequence

The choice of the normalization form depends on the particular use case.

NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.

Construction of lookup tables

The Unicode standard describes a set of algorithms that

depend on having the ability to quickly look up various properties of a code point. Given the codespace of about 1 million code points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.

Common approaches such as hash-tables or binary search over

sorted code point intervals (as in InversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.

The recommended solution (see Unicode Implementation Guidelines)

is using multi-stage tables that are an implementation of the

Trie data structure with integer

keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.

Taking a 2-level Trie as an example the principle of operation is as follows.

Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.

The number of pages is variable (but not less then 1)

unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.

Assuming that pages are laid out consequently in one array at pages, the pseudo-code is:

auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;
pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];

Where if elemsPerPage is a power of 2 the whole process is

a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.

For completeness a level 1 trie is simply an array.

The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such as bool). See also BitPacked for enforcing it manually. The major size advantage however comes from the fact that multiple identical pages on every level are merged by construction.

The process of constructing a trie is more involved and is hidden from

the user in a form of the convenience functions codepointTrie,

codepointSetTrie and the even more convenient toTrie.

In general a set or built-in AA with dchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.

Unicode properties

This is a full list of Unicode properties accessible through unicode

with specific helpers per category nested within. Consult the

CLDR utility

when in doubt about the contents of a particular set.

General category sets listed below are only accessible with the unicode shorthand accessor.

Sets for other commonly useful properties that are

accessible with unicode:

Below is the table with block names accepted by unicode.block.

Note that the shorthand version unicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.

Below is the table with script names accepted by unicode.script

and by the shorthand version unicode:

Below is the table of names accepted by unicode.hangulSyllableType.

References:

ASCII Table, Wikipedia, The Unicode Consortium, Unicode normalization forms, Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance

Trademarks: Unicode(tm) is a trademark of Unicode, Inc.

Copyright

License

Boost License 1.0.

Authors

Dmitry Olshansky

Source: std/uni/package.d

Standards

Unicode v6.2

var lastDchar lineSep nelSep paraSep

tmpl bitSizeOf buildTrie callableWith cmpK0 codepointSetTrie CodepointSetTrie codepointTrie CodepointTrie fullBitSize GetBitSlicing idxTypes isBitPackableType isBitPacked isCodepointSet isIntegralPair isValidArgsForTrie isValidPrefixForTrie mapTrieIndex PackedArrayView PackedPtr Sequence sharMethod TypeOfBitPacked useItemAt useLastItem Utf16Matcher Utf8Matcher

alias _RealArray CodepointSet lo8 mlo8 sharLowerBound sharSwitchLowerBound

enum Mode

struct assumeSize BitPacked clamp clampIdx CodepointInterval CowArray DecompressedIntervals GcPolicy InversionList MatcherConcept MultiArray PackedArrayViewImpl PackedPtrImpl ReallocPolicy sliceBits SliceOverIndexed Trie TrieBuilder

fn arrayRepr comparePropertyName compressIntervals compressTo copyBackwards copyForward decoder decompressFrom decompressIntervals force genericReplace genUnrolledSwitchSearch isPow2OrZero low_8 midlow_8 packedArrayView propertyNameLess read24 replicateBits safeRead24 safeWrite24 sliceOverIndexed spaceFor sumOfIntegerTuple switchUniformLowerBound toDelegate toTrie unalignedRead24 unalignedWrite24 uniformLowerBound utf16Matcher utf8Matcher utfMatcher write24

Types 25

structMultiArray(Types...)

Fields

Types.length dim

size_t[dim] offsets

size_t[dim] sz

size_t[] storage

Methods

@property auto slice(size_t n)()inout inout pure nothrow @nogc

@property auto ptr(size_t n)()inout inout pure nothrow @nogc

size_t bytes(size_t n = size_t.max)() @property const @safe

void store(OutRange)(scope OutRange sink) if (isOutputRange!(OutRange, char)) const

@property auto raw_ptr(size_t n)()inout inout pure nothrow @nogc

Constructors

this(size_t[] sizes...)

this(const(size_t)[]  raw_offsets,
         const(size_t)[]  raw_sizes,
         return  scope  const(size_t)[]  data)

Nested Templates

length(size_t n)

structPackedPtrImpl(T, size_t bits)

Fields

size_t.sizeof * 8 / bits factor

size_t.sizeof bytesPerWord

size_t * origin

Methods

private T simpleIndex(size_t n) inout

private void simpleWrite(TypeOfBitPacked!T val, size_t n)

Constructors

this(inout(size_t) * ptr)inout

structPackedArrayViewImpl(T, size_t bits)

Fields

size_t.sizeof * 8 / bits factor

PackedPtr!(T) ptr

size_t ofs

Methods

bool zeros(size_t s, size_t e)

T opIndex(size_t idx) inout

void opIndexAssign(TypeOfBitPacked!T val, size_t idx)

void opSliceAssign(TypeOfBitPacked!T val, size_t start, size_t end)

auto opSlice(size_t from, size_t to)inout inout

auto opSlice(){

bool opEquals(T)(auto ref T arr) const

size_t length()const{ @property const{

auto roundUp()(size_t val){

auto roundDown()(size_t val){

Constructors

this(inout(size_t) * origin, size_t offset, size_t items)

private structSliceOverIndexed(T)

Fields

is(typeof(() { T.init[0] = Item.init; })) assignableIndex

is(typeof(() { T.init[0 .. 0] = Item.init; })) assignableSlice

size_t from

T * arr

Methods

auto opIndex(size_t idx)const const

auto opSlice(size_t a, size_t b)

void opSliceAssign(T)(T val, size_t start, size_t end)

auto opSlice()

size_t length()const @property const

bool empty()const @property const

@property auto front()const const

@property auto back()const const

@property auto save() inout

void popFront()

void popBack()

bool opEquals(T)(auto ref T arr) const

aliassharLowerBound = sharMethod!uniformLowerBound

aliassharSwitchLowerBound = sharMethod!switchUniformLowerBound

private structGcPolicy

Methods

T[] dup(T)(const T[] arr)

T[] alloc(T)(size_t size)

T[] realloc(T)(T[] arr, size_t sz)

void replaceImpl(T, Range)(ref T[] dest, size_t from, size_t to, Range stuff)

void append(T, V)(ref T[] arr, V value) if (!isInputRange!V)

void append(T, V)(ref T[] arr, V value) if (isInputRange!V)

void destroy(T)(ref T arr) if (isDynamicArray!T && is(Unqual!T == T)) pure

void destroy(T)(ref T arr) if (isDynamicArray!T && !is(Unqual!T == T)) pure

private bool pureInGCFinalizer() @safe pure nothrow;

bool accessIsSafe() @property @safe nothrow pure

structReallocPolicy

Fields

true accessIsSafe

Methods

T[] dup(T)(const T[] arr)

T[] alloc(T)(size_t size) @trusted

T[] realloc(T)(return scope T[] arr, size_t size) @trusted

void replaceImpl(T, Range)(ref T[] dest, size_t from, size_t to, Range stuff)

void append(T, V)(ref T[] arr, V value) if (!isInputRange!V)

void append(T, V)(ref T[] arr, V value) if (isInputRange!V && hasLength!V)

void destroy(T)(scope ref T[] arr) @trusted

alias_RealArray = CowArray!ReallocPolicy

aliasCodepointSet = InversionList!GcPolicy

The recommended default type for set of U+. For details, see the current implementation: InversionList.

structCodepointInterval

The recommended type of Tuple to represent [a, b) intervals of U+. As used in InversionList. Any interval type should pass isIntegralPair trait.

Fields

uint[2] _tuple

Methods

bool opEquals(T)(T val) const

inout(uint) a() @property ref return inout

inout(uint) b() @property ref return inout

Constructors

this(uint low, uint high)

structInversionList(SP = GcPolicy)

InversionList is a set of U+

represented as an array of open-right [a, b) intervals (see CodepointInterval above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:

10, 50, 60, 61, 80, 90

The way to read this is: start with negative meaning that all numbers

smaller then the next one are not present in this set (and positive - the contrary). Then switch positive/negative after each number passed from left to right.

This way negative spans until 10, then positive until 50,

then negative until 60, then positive until 61, and so on. As seen this provides a space-efficient storage of highly redundant data that comes in long runs. A description which Unicode properties fit nicely. The technique itself could be seen as a variation on RLE encoding.

Sets are value types (just like int is) thus they

are never aliased.

Example: --- auto a = CodepointSet('a', 'z'+1); auto b = CodepointSet('A', 'Z'+1); auto c = a; a = a | b; assert(a == CodepointSet('A', 'Z'+1, 'a', 'z'+1)); assert(a != c); ---

See also unicode for simpler construction of sets

from predefined ones.

Memory usage is 8 bytes per each contiguous interval in a set.

The value semantics are achieved by using the

COW technique

and thus it's not safe to cast this type to shared.

Note

It's not recommended to rely on the template parameters

or the exact type of a current U+ set in std.uni. The type and parameters may change when the standard allocators design is finalized. Use isCodepointSet with templates or just stick with the default alias CodepointSet throughout the whole code base.

Fields

CowArray!SP data

Methods

private static fromIntervals(Range)(Range intervals) pure

private static fromIntervals()(uint[] intervals...) pure

@property auto byInterval() scopeGet range that spans all of the intervals in this InversionList.

const(CodepointInterval)[] intervals() @property const

bool opIndex(uint val) constTests the presence of code point `val` in this set.

auto scanFor()(dchar ch) const

size_t length() @propertyNumber of in this set

This opBinary(string op, U)(U rhs) if (isCodepointSet!U || is(U: dchar))Sets support natural syntax for set algebra ) )

This opOpAssign(string op, U)(U rhs) if (isCodepointSet!U || is(U: dchar)) refThe 'op=' versions of the above overloaded operators.

bool opBinaryRight(string op: "in", U)(U ch) if (is(U : dchar)) constTests the presence of codepoint `ch` in this set, the same as opIndex.

auto opUnary(string op: "!")()Obtains a set that is the inversion of this set.

@property auto byCodepoint()A range that spans each in this set.

void toString(Writer)(scope Writer sink, scope const ref FormatSpec!char fmt)Obtain a textual representation of this InversionList in form of open-right intervals.

ref add()(uint a, uint b)Add an interval [a, b to this set.

ref intersect(U)(U rhs) if (isCodepointSet!U)

ref intersect()(dchar ch)

ref sub()(dchar ch)

ref sub(U)(U rhs) if (isCodepointSet!U)

ref add(U)(U rhs) if (isCodepointSet!U)

@property auto inverted()Obtains a set that is the inversion of this set.

string toSourceCode(const(CodepointInterval)[] range, string funcName)

string toSourceCode(string funcName = "")Generates string with D source code of unary function with name of `funcName` taking a single `dchar` argument. If `funcName` is empty the code is adjusted to be a lambda function.

bool empty() @property constTrue if this set doesn't contain any .

void sanitize()

ref subChar(dchar ch)

Marker addInterval(int a, int b, Marker hint = Marker.init) scope

Marker dropUpTo(uint a, Marker pos = Marker.init)

Marker skipUpTo(uint a, Marker pos = Marker.init)

Constructors

this(Set set)Construct from another code point set of any type.

this(Range intervals)Construct a set from a forward range of code point intervals.

this(uint[] intervals...)Construct a set from plain values of code point intervals.

Nested Templates

Intervals

structCowArray(SP = GcPolicy)

Fields

uint[] data

Methods

static auto reuse(uint[] arr)

bool empty() @property const

size_t length() @property const

void length(size_t len) @property

uint opIndex()(size_t idx)const const

void opIndexAssign(uint val, size_t idx)

auto opSlice(size_t from, size_t to)

auto opSlice(size_t from, size_t to) const

auto opSlice()

auto opSlice() const

void append(Range)(Range range) if (isInputRange!Range && hasLength!Range && is(ElementType!Range : uint))

void append()(uint[] val...)

bool opEquals()(auto ref const CowArray rhs) const

uint refCount() @property const

void refCount(uint cnt) @property

void freeThisReference()

void dupThisReference(uint count)

Constructors

this(Range range)

Destructors

~this

private structTrieBuilder(Value, Key, Args...) if (isBitPackableType!Value && isValidArgsForTrie!(Key, Args))

Fields

Prefix.length - 1 lastLevel

size_t[Prefix.length] indices

Value defValue

size_t curIndex

ConstructState[Prefix.length] state

MultiArray!(idxTypes!(Key, fullBitSize!(Prefix), Prefix[0..$]), V) table

"non-monotonic prefix function(s), an unsorted range or " ~
         "duplicate key->value mapping" errMsg

Methods

static auto deduceMaxIndex(Preds...)()

@property ref idx(size_t level)(){

void addValue(size_t level, T)(T val, size_t numVals)

void spillToNextPage(size_t level, Slice)(ref Slice ptr)

void spillToNextPageImpl(size_t level, Slice)(ref Slice ptr)

void putAt(size_t idx, Value v)

void putRangeAt(size_t idxA, size_t idxB, Value v)

void putRange(Key a, Key b, Value v)Put a value `v` into interval as mapped by keys from `a` to `b`. All slots prior to `a` are filled with the default filler.

void putValue(Key key, Value v)Put a value `v` into slot mapped by `key`. All slots prior to `key` are filled with the default filler.

auto build()Finishes construction of Trie, yielding an immutable Trie instance.

Constructors

this()

this(Value filler)Construct a builder, where `filler` is a value to indicate empty slots (or "not found" condition).

Nested Templates

ConstructState

private structTrie(Value, Key, Args...) if (isValidPrefixForTrie!(Key, Args) || (isValidPrefixForTrie!(Key, Args[1..$]) && is(typeof(Args[0]) : size_t)))

A generic Trie data-structure for a fixed number of stages.

The design goal is optimal speed with smallest footprint size.

It's intentionally read-only and doesn't provide constructors.

To construct one use a special builder, see TrieBuilder and buildTrie.

Fields

MultiArray!(idxTypes!(Key, fullBitSize!(Prefix), Prefix[0..$]), Value) _table

Methods

TypeOfBitPacked!Value opIndex()(Key key) constLookup the `key` in this `Trie`.

size_t bytes(size_t n = size_t.max)() @property const

size_t pages(size_t n)() @property const

void store(OutRange)(scope OutRange sink) if (isOutputRange!(OutRange, char)) const

Constructors

this(typeof(_table) table)

this(const(size_t)[]  offsets,  const(size_t)[]  sizes,
         const(size_t)[]  data)

structclamp(size_t bits)

Fields

bits bitSize

Methods

size_t opCall(T)(T arg){

structclampIdx(size_t idx, size_t bits)

Fields

bits bitSize

Methods

size_t opCall(T)(T arg){

structMatcherConcept

Conceptual type that outlines the common properties of all UTF Matchers.

Note

For illustration purposes only, every method

call results in assertion failure. Use utfMatcher to obtain a concrete matcher for UTF-8 or UTF-16 encodings.

Methods

bool match(Range)(ref Range inp) if (isRandomAccessRange!Range && is(ElementType!Range : char))Perform a semantic equivalent 2 operations: decoding a at front of `inp` and testing if it belongs to the set of of this matcher.

bool skip(Range)(ref Range inp) if (isRandomAccessRange!Range && is(ElementType!Range : char))ditto

bool test(Range)(ref Range inp) if (isRandomAccessRange!Range && is(ElementType!Range : char))ditto

@property auto subMatcher(Lengths...)()Advanced feature - provide direct access to a subset of matcher based a set of known encoding lengths. Lengths are provided in Code unit. The sub-matcher then may do less operations per any `test`/...

enumMode

alwaysSkip

neverSkip

skipOnMatch

structBitPacked(T, size_t sz) if (isIntegral!T || is(T: dchar))

Opaque wrapper around unsigned built-in integers and

code unit (char/wchar/dchar) types. Parameter sz indicates that the value is confined to the range of [0, 2^^sz). With this knowledge it can be packed more tightly when stored in certain data-structures like trie.

Note

The BitPacked!(T, sz) is implicitly convertible to T

but not vise-versa. Users have to ensure the value fits in the range required and use the cast operator to perform the conversion.

Fields

sz bitSize

T _value

structassumeSize(alias Fn, size_t bits)

Fields

bits bitSize

Methods

static auto ref opCall(T)(auto ref T arg)

structsliceBits(size_t from, size_t to)

Fields

to - from bitSize

Methods

static auto opCall(T)(T x)

aliaslo8 = assumeSize!(low_8, 8)

aliasmlo8 = assumeSize!(midlow_8, 8)

structDecompressedIntervals

Fields

const(ubyte)[] _stream

size_t _idx

CodepointInterval _front

Methods

CodepointInterval front() @property

void popFront()

bool empty() @property const

DecompressedIntervals save() @property return scope

Constructors

this(const(ubyte)[] stream)

Functions 37

fnvoid copyBackwards(T, U)(T[] src, U[] dest)

fnvoid copyForward(T, U)(T[] src, U[] dest)

fnauto force(T, F)(F from) if (isIntegral!T && !is(T == F))

fnauto force(T, F)(F from) if (isBitPacked!T && !is(T == F))

fnauto force(T, F)(F from) if (is(T == F))

fnsize_t replicateBits(size_t times, size_t bits)(size_t val) @safe pure nothrow @nogc

fnsize_t spaceFor(size_t _bits)(size_t new_len) @safe pure nothrow @nogc

fnSliceOverIndexed!(const(T)) sliceOverIndexed(T)(size_t a, size_t b, const(T) * x) if (is(Unqual!T == T))

fnSliceOverIndexed!T sliceOverIndexed(T)(size_t a, size_t b, T * x) if (is(Unqual!T == T))

private fninout(PackedArrayView!T) packedArrayView(T)(inout(size_t) * ptr, size_t items)

fnstring genUnrolledSwitchSearch(size_t size) @safe pure nothrow

fnbool isPow2OrZero(size_t sz) @safe pure nothrow @nogc

fnsize_t uniformLowerBound(alias pred, Range, T)(Range range, T needle) if (is(T : ElementType!Range))

fnsize_t switchUniformLowerBound(alias pred, Range, T)(Range range, T needle) if (is(T : ElementType!Range))

fnsize_t genericReplace(Policy = void, T, Range)(ref T dest, size_t from, size_t to, Range stuff) @trusted

private fnuint safeRead24(scope const ubyte * ptr, size_t idx) @system pure nothrow @nogc

private fnvoid safeWrite24(scope ubyte * ptr, uint val, size_t idx) @system pure nothrow @nogc

private fnuint unalignedRead24(scope const ubyte * ptr, size_t idx) @system pure nothrow @nogc

private fnvoid unalignedWrite24(scope ubyte * ptr, uint val, size_t idx) @system pure nothrow @nogc

private fnuint read24(scope const ubyte * ptr, size_t idx) @system pure nothrow @nogc

private fnvoid write24(scope ubyte * ptr, uint val, size_t idx) @system pure nothrow @nogc

fnauto arrayRepr(T)(T x)

fnsize_t sumOfIntegerTuple(ints...)() @property

private fnauto utf8Matcher(Set)(Set set)

private fnauto utf16Matcher(Set)(Set set)

fnauto utfMatcher(Char, Set)(Set set) if (isCodepointSet!Set)Constructs a matcher object to classify from the `set` for encoding that has `Char` as code unit.

fnauto decoder(C)(C[] s, size_t offset = 0) if (is(C : wchar) || is(C : char))

fnauto toTrie(size_t level, Set)(Set set) if (isCodepointSet!Set)Convenience function to construct optimal configurations for packed Trie from any `set` of .

fnauto toDelegate(Set)(Set set) if (isCodepointSet!Set)Builds a `Trie` with typically optimal speed-size trade-off and wraps it into a delegate of the following type: bool delegate(dchar ch. )

fnuint low_8(uint x) @safe pure nothrow @nogc

fnuint midlow_8(uint x){ @safe pure nothrow @nogc

int comparePropertyName(Char1,  Char2)(const(Char1)[]  a,  const(Char2)[]  b) if (is(Char1 :  dchar)  &&  is(Char2 :  dchar))

@safe pure

bool propertyNameLess(Char1,  Char2)(const(Char1)[]  a,  const(Char2)[]  b) if (is(Char1 :  dchar)  &&  is(Char2 :  dchar))

@safe pure

fnvoid compressTo(uint val, ref scope ubyte[] arr) @safe pure nothrow

fnuint decompressFrom(scope const(ubyte)[] arr, ref size_t idx) @safe pure

fnubyte[] compressIntervals(Range)(Range intervals) if (isInputRange!Range && isIntegralPair!(ElementType!Range))

fnauto decompressIntervals(const(ubyte)[] data) @safe pure

Variables 4

enumvarlineSep = '\u2028'

enumvarparaSep = '\u2029'

enumvarnelSep = '\u0085'

enumvarlastDchar = 0x10FFFF

Templates 27

tmplisBitPackableType(T)

tmplPackedArrayView(T)

if ((is(T dummy ==  BitPacked!(U,  sz),  U,  size_t sz)
     &&  isBitPackableType!U)  ||  isBitPackableType!T)

tmplPackedPtr(T)

if ((is(T dummy ==  BitPacked!(U,  sz),  U,  size_t sz)
     &&  isBitPackableType!U)  ||  isBitPackableType!T)

tmplsharMethod(alias uniLowerBound)

Functions

size_t sharMethod(alias _pred = "a<b", Range, T)(Range range, T needle) if (is(T : ElementType!Range))

tmplisCodepointSet(T)

Tests if T is some kind a set of code points. Intended for template constraints.

tmplisIntegralPair(T, V = uint)

Tests if T is a pair of integers that implicitly convert to V. The following code must compile for any pair T:

(T x){ V a = x[0]; V b = x[1];}

The following must not compile:

(T x){ V c = x[2];}

tmplmapTrieIndex(Prefix...)

Maps Key to a suitable integer index within the range of size_t. The mapping is constructed by applying predicates from Prefix left to right and concatenating the resulting bits.

The first (leftmost) predicate defines the most significant bits of the resulting index.

Functions

size_t mapTrieIndex(Key)(Key key) if (isValidPrefixForTrie!(Key, Prefix))

tmplGetBitSlicing(size_t top, sizes...)

tmplcallableWith(T)

tmplisValidPrefixForTrie(Key, Prefix...)

tmplisValidArgsForTrie(Key, Args...)

tmplcodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)

A shorthand for creating a custom multi-level fixed Trie from a CodepointSet. sizes are numbers of bits per level, with the most significant bits used first.

Note

The sum of sizes must be equal 21.

Note

Overload taking CodepointSets will naturally convert

only to bool mapping Tries.

CodepointTrie is the type of Trie as generated by codepointTrie function.

Functions

auto codepointTrie()(T[dchar] map, T defValue = T.init)

auto codepointTrie(R)(R range, T defValue = T.init) if (isInputRange!R && is(typeof(ElementType!R.init[0]) : T) && is(typeof(ElementType!R.init[1]) : dchar))

tmplCodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)

ditto

tmplcmpK0(alias Pred)

Functions

bool cmpK0(Value, Key)(Tuple!(Value, Key) a, Tuple!(Value, Key) b)

tmplbuildTrie(Value, Key, Args...) if (isValidArgsForTrie!(Key, Args))

The most general utility for construction of Tries short of using TrieBuilder directly.

Provides a number of convenience overloads. Args is tuple of maximum key value followed by predicates to construct index from key.

Alternatively if the first argument is not a value convertible to Key then the whole tuple of Args is treated as predicates and the maximum Key is deduced from predicates.

Functions

auto buildTrie(Range)(Range range, Value filler = Value.init) if (isInputRange!Range && is(typeof(Range.init.front[0]) : Value) && is(typeof(Range.init.front[1]) : Key))

auto buildTrie(Range)(Range range, Value filler = Value.init) if (is(TypeOfBitPacked!Value == bool) && isInputRange!Range && is(typeof(Range.init.front[0]) : Key) && is(typeof(Range.init.front[1]) : Key))

auto buildTrie(Range)(Range range, Value filler, bool unsorted) if (isInputRange!Range && is(typeof(Range.init.front[0]) : Value) && is(typeof(Range.init.front[1]) : Key))

auto buildTrie(Range)(Range range, Value filler = Value.init) if (is(TypeOfBitPacked!Value == bool) && isInputRange!Range && is(typeof(Range.init.front) : Key))

auto buildTrie()(Value[] array, Value filler = Value.init) if (isUnsigned!Key)

auto buildTrie(Key, Value)(Value[Key] map, Value filler = Value.init)

tmplUtf8Matcher()

Functions

void badEncoding()

char truncate()(char ch)

static auto encode(size_t sz)(dchar ch) if (sz > 1)

auto build(Set)(Set set)

Types

struct Impl

struct CherryPick

tmplUtf16Matcher()

Functions

void badEncoding()

auto encode2(dchar ch)

auto build(Set)(Set set)

Types

struct Impl

struct CherryPick

tmplbitSizeOf(Args...) if (Args.length == 1)

tmplisBitPacked(T)

Tests if T is some instantiation of BitPacked!(U, x) and thus suitable for packing.

tmplTypeOfBitPacked(T)

Gives the type U from BitPacked!(U, x) or T itself for every other type.

tmplSequence(size_t start, size_t end)

tmpluseItemAt(size_t idx, T) if (isIntegral!T || is(T: dchar))

Functions

size_t impl(const scope T[] arr){

tmpluseLastItem(T)

Functions

size_t impl(const scope T[] arr){

tmplfullBitSize(Prefix...)

tmplidxTypes(Key, size_t fullBits, Prefix...)

std.uni

Synopsis

Terminology

Normalization

Construction of lookup tables

Unicode properties

Copyright

License

Authors

Standards

Types 25

Note

Note

Note

Functions 37

Variables 4

Templates 27

Note

See Also

Note