Latitude programs are assumed to be encoded in UTF-8.
The following characters are considered to be whitespace: space
(U+0020
), horizontal tab (U+0009
), carriage return (U+000D
), and
line feed (U+000A
). In general, whitespace is used to separate
tokens, so unless otherwise stated, multiple whitespace characters
will be treated as equivalent to a single character of whitespace.
[TODO: Should some non-ASCII characters be whitespace?]
Aside from whitespace characters, there are three other classes of characters in the Latitude parser: special, semi-special, and normal.
U+002E
), comma
(U+002C
), colon (U+003A
), semicolon (U+003B
), parentheses
(U+0028
and U+0029
), brackets (U+005B
and U+005D
), braces
(U+007B
and U+007D
), single quotes (U+0027
), and double
quotes (U+0022
).U+0030
to U+0039
),
tilde (U+007E
), hash (U+0023
), and at-sign (U+0040
).U+0060
) and backslash (U+005C
) are reserved for
future extensions to the language and should not be used in
identifiers.An identifier is a nonempty string of characters, of which the first must be a normal character and the remaining characters must be either normal or semi-special.
There are two types of identifiers: standard identifiers and operator identifiers. For the most part, they are equivalent. However, operator identifiers will always expect argument lists when called as methods. An operator identifier is one that consists only of operator characters. Any identifier which is not an operator identifier is a standard identifier.
Operator characters are characters which satisfy all of the following properties.
U+0024
).Note that an identifier ending with an equal-sign (=
) is often
called an assignment identifier, as there is a special syntax which
invokes methods with these names in a convenient way.
The following exceptions take precedence over the above rule.
...
) form a valid identifier, called the
ellipsis identifier (and its corresponding variable is called the
ellipsis variable). As a consequence of this, a single dot is
considered a statement terminator, a pair of consecutive dots is
almost always a syntax error, and three dots in a row is parsed as
the ellipsis identifier.::
) form a valid identifier. Note that there
is no assignment identifier corresponding to this, as ::=
is a
special syntactic token and not a valid identifier.=
, <-
,
and =>
. These lexemes are treated specially by the language for
various syntactic constructs. However, identifiers which happen to
contain these strings are acceptable, so ==
or <=>
, for
example, are perfectly acceptable as identifiers.+
) or minus sign (-
) followed by at least one
digit (0
to 9
) is not a valid start to an identifier. That
is, any string which begins with a plus sign or minus sign followed
by at least one digit, followed by any other characters is not a
valid identifier, as it will be parsed as a numerical literal.Note also that identifiers that begin with a dollar-sign ($
) are
treated specially by scope resolution.
A source file consists of zero or more lines, as defined by the following grammar.
<line> ::= <stmt> "."
<stmt> ::= <chain> <stdname> <rhs> |
<chain> <opname> <rhs1> |
<literalish>
<rhs> ::= λ |
<shortarglist> |
":=" <stmt> |
"::=" <stmt> |
":" <arglist> |
<shortarglist> "=" <stmt> |
"<-" <stmt>
<rhs1> ::= <verysimplechainl> |
<shortarglist1> |
":=" <stmt> |
"::=" <stmt> |
":" <arglist> |
<shortarglist> "=" <stmt> |
"<-" <stmt>
<arglist> ::= λ |
<arg> <arglist1>
<arglist1> ::= λ |
"," <arg> <arglist1>
<arg> ::= <chain> <stdname> |
<chain> <opname> <name> |
<simplechain> <stdname> <shortarglist> |
<chain> <opname> <shortarglist1> |
<literalish>
<chain> ::= <chain> <opname> <name> |
<simplechain>
<simplechain> ::= <simplechain> <stdname> |
<simplechain> <stdname> <shortarglist> |
<literalish> |
λ
<verysimplechain> ::= <verysimplechainl> <stdname> |
<verysimplechainl> <stdname> <shortarglist> |
<verysimplechainl> ::= <verysimplechain> |
<literal> |
λ
<shortarglist> ::= "(" <arglist> ")" |
"(" <chain> <name> ":" <arg> ")" |
<literal>
<shortarglist1> ::= "(" <arglist> ")" |
"(" <chain> <name> ":" <arg> ")"
<literalish> ::= "~" <ename> <literalish> |
"(" <stmt> ")" |
<literal>
<literal> ::= "{" <linelist> "}" |
<number> |
"\"" <string> "\"" |
"'" <ename> |
"'(" <symbol> ")" |
"~" <ename> |
"[" <arglist> "]" |
"'[" <literallist> "]" |
"#\"" <rawstring> "\"#" |
[ "-" | "+" ] "0" <letter> { <alphanum> } |
"#'" <ename>
<linelist> ::= <line> <linelist> |
λ
<literallist> ::= <listlist> <literallist1> |
λ
<literallist1> ::= "," <listlist> <literallist1> |
λ
<listlit> ::= "\"" <string> "\"" |
"'" <ename> |
"~" <ename> |
<number> |
"'(" <symbol> ")" |
"[" <literallist> "]" |
<rawstring> |
<name>
<name> ::= <stdname> |
<opname>
Note that a <letter>
is any single lowercase or capital alphabetic
letter. Similarly, an <alphanum>
is a letter or single digit. A
<stdname>
is any valid standard identifier, and an <opname>
is any
valid operator identifier. An <ename>
is any nonempty sequence of
normal or semi-special characters.
A <number>
is a string of characters which satisfies one of the
following regular expressions.
[-+]?[0-9]+(\.[0-9]+)([eE][-+]?[0-9]+)?
[-+]?[0-9]+([eE][-+]?[0-9]+)
[-+]?[0-9]+
[-+]?[0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?i
[-+]?[0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?[-+][0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?i
The first two forms construct a floating-point number; the third form constructs an integer; the fourth and fifth construct complex numbers. For more information on the different types of numbers, refer to Number.
Additionally, integer literals can be prefixed with 0x
, 0b
, or
0o
(case insensitive, following the optional sign) to indicate that
the literal be parsed in hexadecimal, binary, or octal, respectively.
A <symbol>
consists of any sequence of non-close-paren characters
and backslash literals. Any character other than a backslash (\
) or
a close-parenthesis ()
) is interpreted literally as part of the
symbol’s name. This includes spaces and newlines. A backslash
interprets the character immediately following it literally, even if
the character following it is another backslash or a
close-parenthesis. Additionally, the backslash escape sequences,
including Unicode escape sequences, that can be used in string
literals can be used in symbol literals enclosed in parentheses.
A <string>
consists of a sequence of non-double-quote characters and
backslash escape sequences. Any character other than a backslash (\
)
or a double-quote ("
) is interpreted literally and placed into the
string literal. A quotation mark closes the string literal. Note that
newlines can be placed within strings freely and will be interpreted
as part of the string.
A backslash causes the character immediately following it to be interpreted specially. The following special translations take place.
Sequence | Result |
---|---|
\n |
U+000A |
\r |
U+000D |
\t |
U+0009 |
\a |
U+0007 |
\b |
U+0008 |
\f |
U+000C |
\v |
U+000B |
Additionally, strings can contain Unicode escape sequences. A Unicode
escape sequence consists of \u
, followed either by four hex digits
or up to six hex digits enclosed in curly braces. This sequence will
be interpreted as a single Unicode character, whose code point is
equal to the hex digits, interpreted as a hexadecimal numeral.
A backslash followed by any other character (including a second
backslash or a double quote, but excluding a u
) will be treated as
that second character literally. A backslash followed by a u
which
is not a part of a Unicode escape sequence is an invalid lexical token
and will result in a parse error.
A <rawstring>
starts with a #
, followed by the opening delimiter,
then any text, then the closing delimiter. The valid delimiter pairs
are ()
, []
, {}
, and ""
. Inside the string’s contents,
backslash is only considered a special character when succeeded by one
of the delimiter characters or another backslash. In any other case,
it is treated literally. Additionally, unescaped brackets of the same
type as the string delimiters can appear in the string, so long as
they appear in balanced pairs. Aside from the difference in parsing,
raw string literals are equivalent to ordinary string literals and are
treated the same way at runtime.
Line comments in Latitude start with a semicolon (;
) and continue
until the next newline character (U+000A
). Block comments start with
an opening brace followed by a star ({*
) and end with a star
followed by a closing brace (*}
). Note that block comments can be
nested, so matching pairs of {*
and *}
can appear within a block
comment so long as they are balanced. Comments can be placed in the
code anywhere that whitespace would be ignored, so in particular
comments cannot be placed in the middle of an identifier or inside a
string.
Finally, an additional type of line comment is allowed. If a line
begins with #!
then that entire line will be ignored by the parser.
This allows Unix-style shebang lines at the start of the file.
Latitude has a notion of operator precedence which can be customized
at runtime. When a new file is loaded or a string is evaluated, the
current operator table is accessed through the operators
slot on the
current lexical meta object. This slot should contain a dictionary.
Each value of this dictionary shall have slots prec
and assoc
defined on it, to the specification of
the Operator
object.
Whenever a chain of operators is encountered, the precedence rules are
considered. Operators with higher precedence will bind more tightly
than those with low precedence. Operators with the same precedence and
the same associativity will associate together, and those with the
same precedence but no associativity or differing associativity will
trigger a ParseError
. The built-in precedence rules are as follows.
Operator(s) | Precedence | Associativity |
---|---|---|
=~ == === < <= > >= /= | 5 | None |
++ | 10 | Left |
<> | 15 | Left |
<| | 20 | Right |
|> | 25 | Left |
(new operators) | 30 | Left |
+ - | 35 | Left |
/ | 40 | Left |
* | 45 | Left |
^ | 50 | Right |
[up]
[prev - Chapter 1 - Introduction]
[next - Chapter 3 - Object Model]