std.experimental.lexer

Summary This module contains a range-based compile-time _lexer generator.

Overview The _lexer generator consists of a template mixin, Lexer, along with several helper templates for generating such things as token identifiers.

To write a _lexer using this API:

  1. Create the string array constants for your language.
  2. Create aliases for the various token and token identifier types

    specific to your language.

  3. Create a struct that mixes in the Lexer template mixin and

    implements the necessary functions.

Examples

  • A _lexer for D is available here.
  • A _lexer for Lua is available here.
  • A _lexer for JSON is available here.
TemplateParameters Template Parameter Definitions $(DDOC_ANCHOR defaultTokenFunction) defaultTokenFunction $(DD A function that serves as the default token lexing function. For most languages this will be the identifier lexing function.)) $(DT $(DDOC_ANCHOR tokenSeparatingFunction) tokenSeparatingFunction) $(DD A function that is able to determine if an identifier/keyword has come to an end. This function must return bool and take a single size_t argument representing the number of bytes to skip over before looking for a separating character.) $(DT $(DDOC_ANCHOR staticTokens) staticTokens) $(DD A listing of the tokens whose exact value never changes and which cannot possibly be a token handled by the default token lexing function. The most common example of this kind of token is an operator such as $(D_STRING "*"), or $(D_STRING "-") in a programming language.) $(DT $(DDOC_ANCHOR dynamicTokens) dynamicTokens) $(DD A listing of tokens whose value is variable, such as whitespace, identifiers, number literals, and string literals.) $(DT $(DDOC_ANCHOR possibleDefaultTokens) possibleDefaultTokens) $(DD A listing of tokens that could posibly be one of the tokens handled by the default token handling function. An common example of this is a keyword such as $(D_STRING "for"), which looks like the beginning of the identifier $(D_STRING "fortunate"). tokenSeparatingFunction is called to determine if the character after the $(D_STRING 'r') separates the identifier, indicating that the token is $(D_STRING "for"), or if lexing should be turned over to the defaultTokenFunction.) $(DT $(DDOC_ANCHOR tokenHandlers) tokenHandlers) $(DD A mapping of prefixes to custom token handling function names. The generated _lexer will search for the even-index elements of this array, and then call the function whose name is the element immedately after the even-indexed element. This is used for lexing complex tokens whose prefix is fixed.)

Here are some example constants for a simple calculator _lexer:

// There are a near infinite number of valid number literals, so numbers are
// dynamic tokens.
enum string[] dynamicTokens = ["numberLiteral", "whitespace"];

// The operators are always the same, and cannot start a numberLiteral, so
// they are staticTokens
enum string[] staticTokens = ["-", "+", "*", "/"];

// In this simple example there are no keywords or other tokens that could
// look like dynamic tokens, so this is blank.
enum string[] possibleDefaultTokens = [];

// If any whitespace character or digit is encountered, pass lexing over to
// our custom handler functions. These will be demonstrated in an example
// later on.
enum string[] tokenHandlers = [
    "0", "lexNumber",
    "1", "lexNumber",
    "2", "lexNumber",
    "3", "lexNumber",
    "4", "lexNumber",
    "5", "lexNumber",
    "6", "lexNumber",
    "7", "lexNumber",
    "8", "lexNumber",
    "9", "lexNumber",
    " ", "lexWhitespace",
    "\n", "lexWhitespace",
    "\t", "lexWhitespace",
    "\r", "lexWhitespace"
];

Types 2

structTokenStructure(IdType, string extraFields = "")

The token that is returned by the lexer.

Parameters

IdTypeThe D type of the "type" token type field.
extraFieldsA string containing D code for any extra fields that should be included in the token structure body. This string is passed directly to a mixin statement.

Examples

// No extra struct fields are desired in this example, so leave it blank.
alias Token = TokenStructure!(IdType, "");
Token minusToken = Token(tok!"-");
Fields
string textThe text of the token for dynamic tokens. May be `null`, e.g. for static tokens.
size_t line1-based line number at which the start of this token occurs.
size_t column1-based column number at which the start of this token occurs. This is measured in bytes and may not be correct when tab characters are involved.
size_t index0-based byte offset from the beginning of the input at which the start of this token occurs.
IdType typeThe token type.
Methods
bool opEquals(ref const typeof(this) other) const
bool opEquals(IdType type) constReturns: true if the token has the given type, false otherwise.

Range structure that wraps the _lexer's input.

Fields
const(ubyte)[] bytesThe input bytes.
size_t indexThe range's current position.
size_t columnThe current column number.
size_t lineThe current line number.
Methods
size_t mark()() constReturns: a mark at the current position that can then be used with slice.
void seek()(size_t m)Sets the range to the given position. Params: m = the position to seek to
const(ubyte)[] slice()(size_t m) constReturns a slice of the input byte array between the given mark and the current position. Params m = the beginning index of the slice to return
bool empty()() constImplements the range primitive empty.
ubyte front()() constImplements the range primitive front.
const(ubyte)[] peek(size_t p) constReturns: the current item as well as the items p items ahead.
bool startsWith(const(ubyte[]) needle) constReturns: true if the range starts with the given byte sequence
ubyte peekAt()(size_t offset) const
bool canPeek()(size_t p) constReturns: true if it is possible to peek p bytes ahead.
void popFront()()Implements the range primitive popFront.
void popFrontN()(size_t n)Implements the algorithm popFrontN more efficiently. This function does not detect or handle newlines.
void incrementLine()(size_t i = 1)Increments the range's line number and resets the column counter.
Constructors
this(const(ubyte)[] bytes, size_t index = 0, size_t column = 1, size_t line = 1)Params: bytes = the lexer input index = the initial offset from the beginning of bytes column = the initial column number line = the initial line number

Functions 1

fnstring tokenStringRepresentation(IdType, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens)(IdType type) pure nothrow @property @nogc @safeLooks up the string representation of the given token type.

Templates 2

tmplTokenIdType(alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens)

Template for determining the type used for a token type.

Selects the smallest unsigned integral type that is able to hold the value staticTokens.length + dynamicTokens.length + possibleDefaultTokens.length. For example if there are 20 static tokens, 30 dynamic tokens, and 10 possible default tokens, this template will alias itself to ubyte, as 20 + 30 + 10 < ubyte.max.

Examples

// In our calculator example this means that IdType is an alias for ubyte.
alias IdType = TokenIdType!(staticTokens, dynamicTokens, possibleDefaultTokens);
tmplTokenId(IdType, alias staticTokens, alias dynamicTokens, alias possibleDefaultTokens, string symbol)

Generates the token type identifier for the given symbol.

There are two special cases:

  • If symbol is "", then the token identifier will be 0
  • If symbol is "\0", then the token identifier will be the maximum

    valid token type identifier

In all cases this template will alias itself to a constant of type IdType. This template will fail at compile time if symbol is not one of the staticTokens, dynamicTokens, or possibleDefaultTokens.

Examples

template tok(string symbol)
{
    alias tok = TokenId!(IdType, staticTokens, dynamicTokens,
        possibleDefaultTokens, symbol);
}
// num and plus are of type ubyte.
IdType plus = tok!"+";
IdType num = tok!"numberLiteral";