Commit 60ef83694df06a3e880e457cebe9ccaeff00b39e

Authored by Alain Prouté
1 parent 0bc83921

Updated directory lexical_analysis in anubis_distrib/library/

anubis_distrib/library/lexical_analysis/fast_lexer.anubis 0 → 100644
  1 +
  2 +
  3 +
  4 + The Anubis Project
  5 +
  6 + A tool for producing fast buffered lexers.
  7 +
  8 + Copyright (c) Constructive Mathematics 2008.
  9 +
  10 +
  11 + Author: Alain Prouté
  12 +
  13 +
  14 +
  15 + *** Introduction.
  16 +
  17 + This tool is more or less equivalent to the Unix tool LEX/FLEX. It replaces the
  18 + previous version of the same (at least similar) tool 'lexer_maker_v2.anubis' which
  19 + produces lexers which are too slow and is now obsolete.
  20 +
  21 + If you want to use this tool, you will have to add:
  22 +
  23 + read lexical_analysis/fast_lexer.anubis
  24 +
  25 + into your source file.
  26 +
  27 +
  28 + Consider a 'source' from which bytes can be read, such as a file, a network connection
  29 + (maybe an SSL connection), a string or a byte array, etc... There are tools for
  30 + getting the bytes from this source one after the other, but in general we are better
  31 + interested into particular sequences of bytes which are called `tokens'. As an
  32 + example, if the source is the following string:
  33 +
  34 + "344 + 87"
  35 +
  36 + we prefer to read the three 'tokens': "344", "+" and "87" directly (ignoring white
  37 + spaces) rather than the sequence of bytes '3', '4', '4', ' ', '+', ' ', '8' and '7'.
  38 +
  39 + A 'lexer' is precisely the gadget which will do this job easily and fast (and even
  40 + better than described above). It uses lexing streams, which are buffered for
  41 + better performances.
  42 +
  43 +
  44 +
  45 + ---------------------------------- Table of Contents ----------------------------------
  46 +
  47 + *** (1) Regular expressions.
  48 + *** (2) Lexer output.
  49 + *** (3) Lexing streams.
  50 + *** (4) Constructing a lexer.
  51 + *** (5) Plugging several lexers on the same input stream.
  52 +
  53 + ---------------------------------------------------------------------------------------
  54 +
  55 +
  56 +
  57 +
  58 + *** (1) Regular expressions.
  59 +
  60 + Regular expressions are character strings which are used for describing particular sets
  61 + of tokens. Regular expressions are written using ASCII characters, but some of them
  62 + have a special meaning. They are the following:
  63 +
  64 + ( ) [ ] - \ * + | . $ ^ ?
  65 +
  66 + All other characters just represent themself. For example, the regular expression
  67 + 'abcd' represents only the token 'abcd'.
  68 +
  69 + Parentheses do not represent anything. They are just used for delimiting regular
  70 + expressions. For example '(abcd)' represents the same thing as 'abcd'.
  71 +
  72 + The regular expression '[abcd]' represents the 4 tokens: 'a', 'b', 'c' and 'd'. In
  73 + other words, characters between brackets represent all the tokens made of one and only
  74 + one of these characters. There is a shortcut for ranges of characters. Instead of
  75 + writting
  76 +
  77 + [abcdefghijklmnopqrstuvwxyz]
  78 +
  79 + you may just write '[a-z]'. For example, the regular expression '[a-zA-Z0-9]'
  80 + represents any token made of one and only one alphanumeric character.
  81 +
  82 + If you add a caret just after the opening bracket, the regular expression represents
  83 + all one byte tokens for all bytes non present within the brackets (i.e. the
  84 + 'complement' in some sens of the previous set). For example, the regular expression
  85 + '[^a-z]' represents all one byte tokens whose unique character is not a lower case
  86 + letter. Note: a byte is any Word8, so that '[^a-z]' also matches characters of code
  87 + above 127.
  88 +
  89 + If 'A' is a regular expression, 'A+' represents any non empty concatenation of tokens
  90 + represented by 'A'. For example, '[a-z]+' represents any non empty sequence of
  91 + lowercase letters. Similarly, 'A*' represents all the tokens represented by 'A+', plus
  92 + the empty token (the token made of no character at all).
  93 +
  94 + If 'A' and 'B' are regular expressions, 'AB' is a regular expression representing any
  95 + concatenation of a token represented by 'A' and a token represented by 'B'. For
  96 + example, 'a+b+' represents any non empty sequence of 'a' followed by any non empty
  97 + sequence of 'b'. As another example, '[A-Z][A-Za-z]*' represents any sequence of
  98 + letters beginning by an upper case letter (hence actually non empty).
  99 +
  100 + The backslash character quotes the subsequent character. For example the regular
  101 + expression '\(' represents the token made of the single character '('. Of course, this
  102 + is useful for special characters. However, the sequences '\n', '\r' and '\t' represent
  103 + respectively a line feed, a carriage return and a tabulator.
  104 +
  105 + If 'A' and 'B' are regular expressions, 'A|B' is a regular expression representing all
  106 + the tokens represented by 'A' and all the tokens represented by 'B'. For example,
  107 + '(a+)|(b+)' represents all non empty sequences containing either only a's or only b's.
  108 +
  109 + The dot '.' represents any character except '\n'.
  110 +
  111 + If 'A' is a regular expression '^A' represents any token represented by 'A' provided
  112 + that it appears at the begining of a line. Similarly, 'A$' represents any token
  113 + represented by 'A' provided that it ends at the end of a line. For example the regular
  114 + expression '//.*$' matches a one line Anubis (or C++) comment, and the regular
  115 + expression '^define' matches the keyword 'define' only when it is found in the leftmost
  116 + column.
  117 +
  118 + If 'A' is a regular expression, 'A?' represents all the tokens represented by 'A' plus
  119 + the empty token.
  120 +
  121 +
  122 + When you construct a lexer you provide one or several regular expression. These regular
  123 + expression may be syntactically incorrect. For this reason, we have the following type
  124 + for classifying the possible errors:
  125 +
  126 +public type RegExprError:
  127 + premature_end_of_regexpr,
  128 + unexpected_right_par,
  129 + unexpected_right_bracket,
  130 + regexpr_is_empty,
  131 + star_not_following_a_regexpr,
  132 + plus_not_following_a_regexpr,
  133 + question_mark_not_following_a_regexpr,
  134 + non_character_within_brackets,
  135 + misplaced_hyphen,
  136 + unexpected_vbar,
  137 + empty_lexer_description.
  138 +
  139 +
  140 + For your convenience, the next function transforms such an error into a message in
  141 + English.
  142 +
  143 +public define String
  144 + to_English
  145 + (
  146 + RegExprError e
  147 + ).
  148 +
  149 +
  150 +
  151 +
  152 + *** (2) Lexer output.
  153 +
  154 + A single lexer may recognize different sorts of tokens. For example, a lexer may
  155 + recognize 'symbols' (represented say by the regular expression '[a-zA-Z]+'), and
  156 + integers (represented say by the regular expression '[0-9]+'). The role of the lexer is
  157 + not only to recognize such tokens, but also to return them in such a way that their
  158 + sort is obvious. For this reason, it is convenient to define a type of tokens with one
  159 + alternative for each sort of token. In the case of our example, this type could be:
  160 +
  161 + type Token:
  162 + symbol(String name),
  163 + integer(Int value).
  164 +
  165 + The type of tokens for a given lexer is represented in this file by the type parameter
  166 + '$Token'. A lexer returns a datum of type:
  167 +
  168 +public type LexerOutput($Token):
  169 + end_of_input,
  170 + error(ByteArray),
  171 + token($Token).
  172 +
  173 + The lexer returns 'end_of_input' when there is no hope that a next token may be read
  174 + from the input source. In the case of a file this means that the end of the file has
  175 + been reached. In the case of a network connection, this means that the connection has
  176 + been closed or that time is out. In the case of a string or a byte array, this means
  177 + that the end of the string or byte array has been reached.
  178 +
  179 + The lexer returns 'error(b)' when no token can be read from the input (but the end of
  180 + the input has not been reached). Some bytes may have been read from the input, which
  181 + could have been the beginning of a token until the first byte which cannot be part of a
  182 + token. Next time the lexer will be called, it will continue to read from after this
  183 + sequence.
  184 +
  185 + When a token has been recognized, the lexer has the token at its disposal in the form
  186 + of a byte array. In order to transform this byte array into a datum of type '$Token'
  187 + you have to provide a function of type 'ByteArray -> LexerOutput($Token)'. For
  188 + example, if a 'symbol' is to be recognized, the corresponding function could be
  189 + something like this:
  190 +
  191 + (ByteArray b) |-> token(symbol(to_string(b)))
  192 +
  193 + If an integer is to be recognized, the corresponding function could be:
  194 +
  195 + (ByteArray b) |-> if decimal_scan(to_string(b)) is
  196 + {
  197 + failure then error(b),
  198 + success(n) then token(integer(n))
  199 + }
  200 +
  201 + So, in the case of our example (using the type 'Token' above), the lexer may be
  202 + described by the following list of 'lexer items':
  203 +
  204 + [
  205 + lexer_item("[A-Za-z]+",
  206 + success((ByteArray b) |-> token(symbol(to_string(b))))),
  207 + lexer_item("[0-9]+",
  208 + success((ByteArray b) |-> if decimal_scan(to_string(b)) is
  209 + {
  210 + failure then error(b),
  211 + success(n) then token(integer(n))
  212 + }))
  213 + ]
  214 +
  215 + where the type 'LexerItem($Token)' is defined as follows:
  216 +
  217 +public type LexerItem($Token):
  218 + lexer_item(String regular_expression,
  219 + Maybe(ByteArray -> LexerOutput($Token)) action).
  220 +
  221 + If you don't provide a function in a lexer item (using 'failure' instead of 'success'),
  222 + the recognized token is just ignored and the lexer tries to read the next token.
  223 +
  224 + Notice that the most usual use of a lexer is to call it repeatedly until it returns
  225 + 'end_of_input'. However, in some circumstances, we want to check for example if a whole
  226 + string matches a regular expression. In this case the lexer is called a first time, and
  227 + if it returns a token it must be called a second time in order to check that we have
  228 + reached the end of the input.
  229 +
  230 +
  231 +
  232 +
  233 + *** (3) Lexing streams.
  234 +
  235 + The lexer recognizes tokens by reading characters from some input. The actual input may
  236 + be either a file, a network connection, a string, a byte array, or anything able to
  237 + provide characters. From any of the above you may construct a 'lexing stream'.
  238 +
  239 +public type LexingStream:... (an opaque type)
  240 +
  241 +public define LexingStream make_lexing_stream(ByteArray b).
  242 +public define LexingStream make_lexing_stream(String s).
  243 +public define Maybe(LexingStream) make_lexing_stream(RStream stream,
  244 + Int buffer_size,
  245 + Int timeout).
  246 +public define Maybe(LexingStream) make_lexing_stream(RWStream stream,
  247 + Int buffer_size,
  248 + Int timeout).
  249 +public define Maybe(LexingStream) make_lexing_stream(SSL_Connection stream,
  250 + Int buffer_size,
  251 + Int timeout).
  252 +
  253 + In the case of a file or network connection (first argument of type 'RStream',
  254 + 'RWStream', 'SSL_Connection') byte arrays are used for buffering the input. The maximal
  255 + size of these buffers must be provided as the second argument. The choice has no
  256 + incidence on the behavior of the lexer, except with respect to performances, and the
  257 + lexer can still return tokens longer than this size. The timeout is in seconds and
  258 + used each time the buffer is reloaded from the actual input. When time is out, the
  259 + lexer gives up as if the end of the input was reached. So, you may have to give a
  260 + rather high value to this timeout.
  261 +
  262 + 'make_lexing_stream' returns 'failure' if a read error or timeout occurs when the
  263 + buffer is loaded for the first time.
  264 +
  265 + In the case of a byte array or a string, the situation is much simpler. The buffer is
  266 + the byte array or the string itself, no time out is needed and the result has no
  267 + 'Maybe'.
  268 +
  269 + If you need another kind of lexing stream, have a look at the private part of this
  270 + file, in particular at the actual definition of type 'LexingStream', and write down
  271 + another such function.
  272 +
  273 +
  274 +
  275 +
  276 + *** (4) Constructing a lexer.
  277 +
  278 + In order to construct a lexer use the following:
  279 +
  280 +public define Result(RegExprError, LexingStream -> One -> LexerOutput($Token))
  281 + make_lexer
  282 + (
  283 + List(LexerItem($Token)) lexer_description
  284 + ).
  285 +
  286 + Thus, a lexer is constructed (if no error occurs) as a function of type:
  287 +
  288 + LexingStream -> One -> LexerOutput($Token)
  289 +
  290 + Applying this function to a lexing stream is understood as 'plugging' it to the
  291 + stream. The result is a function of type:
  292 +
  293 + One -> LexerOutput($Token)
  294 +
  295 + to be used repeatedly until it returns 'end_of_input'.
  296 +
  297 +
  298 +
  299 +
  300 +
  301 + *** (5) Plugging several lexers on the same input.
  302 +
  303 + It is often the case that we have to use several lexers on the same input. This is
  304 + equivalent to saying that we have only one lexer in this input but with several
  305 + different 'states' in the sens of LEX/FLEX for example. In our system there is no
  306 + notion of 'state' for lexers, but several lexers may use the same lexing stream
  307 + concurrently. You can plug them to the same lexing stream, and use them repeatedly in
  308 + any order depending on the sort of thing you want to read from the stream.
  309 +
  310 +
  311 +
  312 +
  313 +
  314 + --- That's all for the public part ! --------------------------------------------------
  315 +
  316 +
  317 +read tools/basis.anubis
  318 +read tools/streams.anubis
  319 +
  320 +
  321 + -------------------------------- Table of Contents ------------------------------------
  322 +
  323 +
  324 + ---------------------------------------------------------------------------------------
  325 +
  326 +
  327 +
  328 +
  329 + *** [1] Parsing regular expressions.
  330 +
  331 +
  332 + *** [1.1] Regular expressions.
  333 +
  334 + Regular expressions are formalized as follows.
  335 +
  336 +public type RegExpr:
  337 + char(Word8), // a
  338 + choice(List(Word8)), // [abc]
  339 + plus(RegExpr), // a+
  340 + star(RegExpr), // a*
  341 + cat(RegExpr,RegExpr), // ab
  342 + or(RegExpr,RegExpr), // (a|b)
  343 + beginning_of_line, // ^
  344 + end_of_line, // $
  345 + dot, // .
  346 + question_mark(RegExpr). // a?
  347 +
  348 +
  349 +
  350 + *** [1.2] Basic regular expressions.
  351 +
  352 + Basic regular expressions are enough for representing all regular expressions. In other
  353 + words any regular expression is equivalent to a basic regular expression. Furthermore,
  354 + at some point of the construction of lexers we have to handle 'actions'. We introduce
  355 + them here even if we generates them only in 'dfa_compiler.anubis'. This also makes the
  356 + type 'LexerOutput($Token)' required at this point.
  357 +
  358 +public type BasicRegExpr($Token):
  359 + char(Word8),
  360 + star(BasicRegExpr($Token)),
  361 + or(BasicRegExpr($Token),BasicRegExpr($Token)),
  362 + cat(BasicRegExpr($Token),BasicRegExpr($Token)),
  363 + epsilon, // matches the empty sequence of characters
  364 + beginning_of_line,
  365 + end_of_line,
  366 + action(Maybe(ByteArray -> LexerOutput($Token))).
  367 +
  368 + The role of 'epsilon', which matches only the empty lexeme, if to provide a
  369 + representation for the empty choice '[]', and for regular expressions of the form 'A?',
  370 + which are translated into 'or(A,epsilon)'.
  371 +
  372 + The following function transforms a regular expression into an equivalent basic regular
  373 + expression.
  374 +
  375 +public define BasicRegExpr($Token)
  376 + to_basic
  377 + (
  378 + RegExpr e
  379 + ).
  380 +
  381 +
  382 +
  383 + *** [1.3] 'Extended' characters.
  384 +
  385 + 'Extended' characters (used in regular expressions) are defined (and classified) as
  386 + follows.
  387 +
  388 +type ExChar:
  389 + left_par, // (
  390 + right_par, // )
  391 + left_bracket, // [
  392 + right_bracket, // ]
  393 + star, // *
  394 + plus, // +
  395 + or, // |
  396 + dot, // .
  397 + dollar, // $
  398 + caret, // ^
  399 + hyphen, // -
  400 + question_mark, // ?
  401 + char(Word8). // a, b, c, ...
  402 +
  403 +
  404 +
  405 +
  406 + *** [1.4] Getting the next (extended) character from the input stream.
  407 +
  408 + The next function reads an extended character from the input stream. It returns
  409 + 'failure' as it encounters the end of the input.
  410 +
  411 +define Maybe(ExChar)
  412 + next_exchar
  413 + (
  414 + Stream s
  415 + ) =
  416 + if read_byte(s) is
  417 + {
  418 + failure then failure,
  419 + success(c) then
  420 + if c = '\'
  421 + then if read_byte(s) is
  422 + {
  423 + failure then failure,
  424 + success(d) then
  425 + if d = 'n' then success(char('\n')) else
  426 + if d = 'r' then success(char('\r')) else
  427 + if d = 't' then success(char('\t')) else
  428 + success(char(d))
  429 + }
  430 + else if c = '(' then success(left_par)
  431 + else if c = ')' then success(right_par)
  432 + else if c = '[' then success(left_bracket)
  433 + else if c = ']' then success(right_bracket)
  434 + else if c = '|' then success(or)
  435 + else if c = '*' then success(star)
  436 + else if c = '+' then success(plus)
  437 + else if c = '.' then success(dot)
  438 + else if c = '$' then success(dollar)
  439 + else if c = '^' then success(caret)
  440 + else if c = '-' then success(hyphen)
  441 + else if c = '?' then success(question_mark)
  442 + else success(char(c))
  443 + }.
  444 +
  445 +
  446 +
  447 +
  448 +
  449 +
  450 + *** [1.5] Tools.
  451 +
  452 + *** [1.5.1] Truncating a Word32 to a Word8.
  453 +
  454 +define Word8
  455 + truncate_to_Word8
  456 + (
  457 + Word32 x
  458 + ) =
  459 + if x is word32(l1,_) then if l1 is word16(l2,_) then l2.
  460 +
  461 +
  462 +
  463 + *** [1.5.2] Creating a range of consecutive characters.
  464 +
  465 + Given a first character and a last character, create the list of all characters between
  466 + these two (included).
  467 +
  468 +define List(Word8)
  469 + range
  470 + (
  471 + Word8 a,
  472 + Word8 z
  473 + ) =
  474 + if z = a then [a] else [a . range(a+1,z)].
  475 +
  476 +
  477 +
  478 +
  479 + *** [1.5.3] Computing the complement of a set of characters.
  480 +
  481 + Compute the 'complement' of a choice, i.e. the list of all characters which do not
  482 + belong to the given choice.
  483 +
  484 +define List(Word8)
  485 + complement_choice
  486 + (
  487 + List(Word8) l,
  488 + List(Word8) result,
  489 + Word32 n
  490 + ) =
  491 + if n = -1 then result else
  492 + with c = truncate_to_Word8(n),
  493 + if member(l,c)
  494 + then complement_choice(l,result,n-1)
  495 + else complement_choice(l,[c . result],n-1).
  496 +
  497 +
  498 +
  499 +
  500 +
  501 + *** [1.5.4] Concatenating a list of regular expression (in reverse order).
  502 +
  503 + Concatenate a (non empty) list of RegExpr in reverse order:
  504 +
  505 +define RegExpr
  506 + cat_list
  507 + (
  508 + RegExpr last,
  509 + List(RegExpr) others
  510 + ) =
  511 + if others is
  512 + {
  513 + [ ] then last,
  514 + [h . t] then cat(cat_list(h,t),last)
  515 + }.
  516 +
  517 +
  518 +
  519 +
  520 + *** [1.5.5] Reading a 'choice' of characters.
  521 +
  522 + Reading a 'choice', i.e. the characters within square brackets.
  523 +
  524 +define Result(RegExprError,List(Word8))
  525 + read_choice
  526 + (
  527 + Stream s,
  528 + List(Word8) already_read
  529 + ) =
  530 + if next_exchar(s) is
  531 + {
  532 + failure then error(premature_end_of_regexpr),
  533 + success(x) then
  534 + if x is right_bracket then ok(already_read) else
  535 + if x is char(c) then read_choice(s,[c . already_read]) else
  536 + if x is hyphen then
  537 + if already_read is
  538 + {
  539 + [ ] then error(misplaced_hyphen),
  540 + [a . others] then
  541 + if next_exchar(s) is
  542 + {
  543 + failure then error(premature_end_of_regexpr),
  544 + success(y) then
  545 + if y is char(z)
  546 + then read_choice(s,reverse_append(range(a,z),others))
  547 + else error(non_character_within_brackets)
  548 + }
  549 + }
  550 + else error(non_character_within_brackets)
  551 + }.
  552 +
  553 +
  554 +
  555 +
  556 +
  557 + *** [1.5.6] Reading a complemented 'choice' of characters.
  558 +
  559 + The same one but giving the complement of the 'choice'.
  560 +
  561 +define Result(RegExprError,List(Word8))
  562 + read_counter_choice
  563 + (
  564 + Stream s,
  565 + List(Word8) already_read
  566 + ) =
  567 + if read_choice(s,already_read) is
  568 + {
  569 + error(msg) then error(msg),
  570 + ok(l) then ok(complement_choice(l,[],255))
  571 + }.
  572 +
  573 +
  574 +
  575 +
  576 + *** [1.5.7] Reading a 'choice' (general case).
  577 +
  578 + The following function is called when a left bracket has been read. It reads extended
  579 + characters until the right bracket is found.
  580 +
  581 +define Result(RegExprError,List(Word8))
  582 + read_within_brackets
  583 + (
  584 + Stream s
  585 + ) =
  586 + if next_exchar(s) is
  587 + {
  588 + failure then error(premature_end_of_regexpr),
  589 + success(x) then
  590 + if x = caret
  591 + then read_counter_choice(s,[])
  592 + else if x is char(c) then read_choice(s,[c])
  593 + else error(non_character_within_brackets)
  594 + }.
  595 +
  596 +
  597 +
  598 +
  599 +
  600 +
  601 + *** [1.6] Reading a regular expression.
  602 +
  603 +
  604 +
  605 + *** [1.6.1] Right delimiters.
  606 +
  607 +type RightDelimiter:
  608 + right_par,
  609 + end_of_regexpr.
  610 +
  611 +
  612 +
  613 +
  614 + *** [1.6.2] Recursive reading.
  615 +
  616 +define Result(RegExprError,RegExpr)
  617 + read_regexpr
  618 + (
  619 + Stream s,
  620 + List(RegExpr) already_read,
  621 + RightDelimiter delim
  622 + ) =
  623 + if next_exchar(s) is
  624 + {
  625 + failure then
  626 + if delim is
  627 + {
  628 + right_par then
  629 + error(premature_end_of_regexpr),
  630 +
  631 + end_of_regexpr then
  632 + if already_read is
  633 + {
  634 + [ ] then error(regexpr_is_empty),
  635 + [last . others] then
  636 + ok(cat_list(last,others))
  637 + }
  638 + },
  639 +
  640 + success(ec) then
  641 + if ec is
  642 + {
  643 + left_par then
  644 + if read_regexpr(s,[],right_par) is
  645 + {
  646 + error(msg) then
  647 + error(msg),
  648 +
  649 + ok(r1) then
  650 + read_regexpr(s,[r1 . already_read],delim)
  651 + },
  652 +
  653 + right_par then
  654 + if delim is
  655 + {
  656 + right_par then
  657 + if already_read is
  658 + {
  659 + [ ] then
  660 + error(unexpected_right_par),
  661 +
  662 + [last . others] then
  663 + ok(cat_list(last,others))
  664 + },
  665 +
  666 + end_of_regexpr then
  667 + error(unexpected_right_par)
  668 + },
  669 +
  670 + left_bracket then
  671 + if read_within_brackets(s) is
  672 + {
  673 + error(msg) then error(msg),
  674 +
  675 + ok(r1) then if already_read is
  676 + {
  677 + [ ] then
  678 + read_regexpr(s,[choice(r1)],delim),
  679 +
  680 + [last . others] then
  681 + read_regexpr(s,[choice(r1),last . others],delim)
  682 + }
  683 + },
  684 +
  685 + right_bracket then
  686 + error(unexpected_right_bracket),
  687 +
  688 + star then
  689 + if already_read is
  690 + {
  691 + [ ] then
  692 + error(star_not_following_a_regexpr),
  693 +
  694 + [last . others] then
  695 + read_regexpr(s,[star(last) . others],delim)
  696 + },
  697 +
  698 + plus then
  699 + if already_read is
  700 + {
  701 + [ ] then
  702 + error(plus_not_following_a_regexpr),
  703 +
  704 + [last . others] then
  705 + read_regexpr(s,[plus(last) . others],delim)
  706 + },
  707 +
  708 + or then
  709 + if read_regexpr(s,[],delim) is
  710 + {
  711 + error(msg) then error(msg),
  712 +
  713 + ok(r1) then
  714 + if already_read is
  715 + {
  716 + [ ] then error(unexpected_vbar),
  717 + [h . t] then
  718 + ok(or(cat_list(h,t),r1))
  719 + }
  720 + },
  721 +
  722 + dot then
  723 + read_regexpr(s,[dot . already_read], delim),
  724 +
  725 + dollar then
  726 + read_regexpr(s,[end_of_line . already_read], delim),
  727 +
  728 + caret then
  729 + read_regexpr(s,[beginning_of_line . already_read], delim),
  730 +
  731 + hyphen then
  732 + error(misplaced_hyphen),
  733 +
  734 + question_mark then
  735 + if already_read is
  736 + {
  737 + [ ] then
  738 + error(question_mark_not_following_a_regexpr),
  739 +
  740 + [last . others] then
  741 + read_regexpr(s,[question_mark(last) . others],delim)
  742 + },
  743 +
  744 + char(c) then
  745 + read_regexpr(s,[char(c) . already_read], delim)
  746 + }
  747 + }.
  748 +
  749 +
  750 +
  751 +
  752 + *** [1.6.3] Normalizing a regular expression.
  753 +
  754 + This amounts to add (^)? at the beginning of every regular expression not beginning by
  755 + ^ and ($)? at the end of any regular expression not ending by $.
  756 +
  757 +define Bool
  758 + begins_by_bol
  759 + (
  760 + RegExpr re
  761 + ) =
  762 + if re is
  763 + {
  764 + char(Word8 _0) then false,
  765 + choice(List(Word8) _0) then false,
  766 + plus(RegExpr _0) then false,
  767 + star(RegExpr _0) then false,
  768 + cat(RegExpr _0,RegExpr _1) then begins_by_bol(_0),
  769 + or(RegExpr _0,RegExpr _1) then false,
  770 + beginning_of_line then true,
  771 + end_of_line then false,
  772 + dot then false,
  773 + question_mark(RegExpr _0) then false
  774 + }.
  775 +
  776 +define Bool
  777 + ends_by_eol
  778 + (
  779 + RegExpr re
  780 + ) =
  781 + if re is
  782 + {
  783 + char(Word8 _0) then false,
  784 + choice(List(Word8) _0) then false,
  785 + plus(RegExpr _0) then false,
  786 + star(RegExpr _0) then false,
  787 + cat(RegExpr _0,RegExpr _1) then ends_by_eol(_1),
  788 + or(RegExpr _0,RegExpr _1) then false,
  789 + beginning_of_line then false,
  790 + end_of_line then true,
  791 + dot then false,
  792 + question_mark(RegExpr _0) then false
  793 + }.
  794 +
  795 +
  796 +define RegExpr
  797 + normalize
  798 + (
  799 + RegExpr re
  800 + ) =
  801 + with re1 = if begins_by_bol(re) then re else cat(question_mark(beginning_of_line),re),
  802 + if ends_by_eol(re1) then re1 else cat(re1,question_mark(end_of_line)).
  803 +
  804 +
  805 +
  806 +
  807 + *** [1.6.4] The tool for parsing regular expressions.
  808 +
  809 +define Result(RegExprError,RegExpr)
  810 + parse_regular_expression
  811 + (
  812 + Stream s,
  813 + ) =
  814 + if read_regexpr(s,[],end_of_regexpr) is
  815 + {
  816 + error(msg) then error(msg),
  817 + ok(re) then ok(normalize(re))
  818 + }.
  819 +
  820 +
  821 +
  822 +
  823 +
  824 + *** [1.7] Transforming a regular expression into a basic one.
  825 +
  826 + *** [1.7.1] Expanding a 'choice' of characters.
  827 +
  828 + Given list of characters (a 'choice sequence'), compute the correponding basic regular
  829 + expression.
  830 +
  831 +define BasicRegExpr($Token)
  832 + expand_choice
  833 + (
  834 + List(Word8) l
  835 + ) =
  836 + if l is
  837 + {
  838 + [ ] then epsilon,
  839 + [h . t] then
  840 + if t is [ ] then char(h) else
  841 + or(char(h),expand_choice(t))
  842 + }.
  843 +
  844 +
  845 +
  846 + *** [1.7.2] The tool for converting to basic.
  847 +
  848 + Convert a regular expression to a basic one.
  849 +
  850 +public define BasicRegExpr($Token)
  851 + to_basic
  852 + (
  853 + RegExpr r
  854 + ) =
  855 + if r is
  856 + {
  857 + char(c) then char(c),
  858 + choice(l) then expand_choice(l),
  859 + plus(r1) then with br = to_basic(r1), cat(br,star(br)),
  860 + star(r1) then star(to_basic(r1)),
  861 + cat(r1,r2) then cat(to_basic(r1),to_basic(r2)),
  862 + or(r1,r2) then or(to_basic(r1),to_basic(r2)),
  863 + beginning_of_line then beginning_of_line,
  864 + end_of_line then end_of_line,
  865 + dot then expand_choice(reverse_append(range(0,'\n'-1),
  866 + range('\n'+1,255))),
  867 + question_mark(r1) then or(epsilon,to_basic(r1))
  868 + }.
  869 +
  870 +
  871 +
  872 +
  873 + *** [1.8] Formating error messages into English.
  874 +
  875 +public define String
  876 + to_English
  877 + (
  878 + RegExprError e
  879 + ) =
  880 + if e is
  881 + {
  882 + premature_end_of_regexpr then "Premature end of regular expression.",
  883 + unexpected_right_par then "Unexpected right parenthese.",
  884 + unexpected_right_bracket then "Unexpected right bracket.",
  885 + regexpr_is_empty then "Regular expression is empty.",
  886 + star_not_following_a_regexpr then "Found '*' not following any regular expression.",
  887 + plus_not_following_a_regexpr then "Found '+' not following any regular expression.",
  888 + question_mark_not_following_a_regexpr then "Found '?' not following any regular expression.",
  889 + non_character_within_brackets then "Non character within brackets.",
  890 + misplaced_hyphen then "Misplaced hyphen.",
  891 + unexpected_vbar then "Misplaced vertical bar.",
  892 + empty_lexer_description then "Empty lexer description."
  893 + }.
  894 +
  895 +
  896 +
  897 +
  898 +
  899 +
  900 +
  901 +
  902 + *** [2] Lexing streams.
  903 +
  904 + *** [2.1] The type 'LexingStream'.
  905 +
  906 + A lexing stream provides tools which are adhoc for using low level fast lexers as
  907 + defined in section 13 of predefined.anubis:
  908 +
  909 + - a variable 'buffer_v' containing the current buffer,
  910 + - a variable 'start_v' giving the starting position of the current lexeme within the buffer,
  911 + - a variable 'last_accept_v' giving the last accepting position (if any),
  912 + - a variable 'current_v' giving the currrent position of reading within the buffer,
  913 + - a function 'reload_buffer' for loading new bytes from the input.
  914 +
  915 +
  916 +public type LexingStream:
  917 + lexing_stream
  918 + (
  919 + Var(ByteArray) buffer_v, // the current buffer
  920 + Var(Int) start_v, // start of lexem in buffer
  921 + Var(FastLexerLastAccepted) last_accept_v, // last accepting position (if any)
  922 + Var(Int) current_v, // position of reading in buffer
  923 + Int -> Maybe(One) reload_buffer // command for loading the sequel in the buffer
  924 + ).
  925 +
  926 + While we are reading a lexeme, we keep the starting position (offset of first character
  927 + of the current lexeme) in 'start_v' so as to be able to extract the lexeme. We also
  928 + keep the last position at which a lexeme was accepted. This is because the lexer always
  929 + tries to read the longuest possible lexeme. If at some point the lexeme is rejected,
  930 + and if there is a last accepting position, the current position comes back to this last
  931 + accepting position, and the lexeme is accepted.
  932 +
  933 + 'reload_buffer' works as follows. It returns 'failure' is there is nothing more to be
  934 + read from the actual input (the connection is down, the end of the file has been
  935 + reached or time is out). In this case, the current buffer is unchanged.
  936 +
  937 + Otherwise, it reads a chunk of characters (say V) from the actual input, extracts the
  938 + part of the current buffer starting at the argument (say U), and establishes U+V as
  939 + then new current buffer. The other variables are updated accordingly.
  940 +
  941 +
  942 +
  943 +
  944 + *** [2.2] Constructing lexing streams.
  945 +
  946 + *** [2.2.1] From a byte array.
  947 +
  948 +public define LexingStream
  949 + make_lexing_stream
  950 + (
  951 + ByteArray b
  952 + ) =
  953 + lexing_stream(var(b), // buffer
  954 + var(0), // starting position
  955 + var(none), // last accepting position
  956 + var(0), // current position
  957 + (Int u) |-> failure). // buffer cannot be reloaded
  958 +
  959 +
  960 +
  961 +
  962 + *** [2.2.2] From a string.
  963 +
  964 +public define LexingStream
  965 + make_lexing_stream
  966 + (
  967 + String s
  968 + ) =
  969 + make_lexing_stream(to_byte_array(s)).
  970 +
  971 +
  972 +
  973 +
  974 + *** [2.2.3] From a read only stream.
  975 +
  976 +public define Maybe(LexingStream)
  977 + make_lexing_stream
  978 + (
  979 + RStream stream,
  980 + Int buffer_size,
  981 + Int timeout
  982 + ) =
  983 + if read(stream,buffer_size,timeout) is
  984 + {
  985 + error then failure,
  986 + timeout then failure,
  987 + ok(buffer) then
  988 + with buffer_v = var(buffer),
  989 + start_v = var((Int)0),
  990 + last_accepted_v = var((FastLexerLastAccepted)none),
  991 + current_v = var((Int)0),
  992 + reload_buffer = (Int i) |->
  993 + if read(stream,buffer_size,timeout) is
  994 + {
  995 + error then failure,
  996 + timeout then failure,
  997 + ok(more) then
  998 + //print("Buffer reloaded ("+abs_to_decimal(length(more))+" bytes).\n");
  999 + if length(more) = 0
  1000 + then (with old_buffer = *buffer_v,
  1001 + old_length = length(old_buffer),
  1002 + dropped = *start_v, // number of bytes dropped from old buffer
  1003 + buffer_v <- extract(old_buffer,dropped,old_length);
  1004 + start_v <- 0;
  1005 + current_v <- *current_v - dropped;
  1006 + last_accepted_v <-
  1007 + if *last_accepted_v is
  1008 + {
  1009 + none then none,
  1010 + last(s,a) then last(s,a - dropped)
  1011 + };
  1012 + failure)
  1013 + else (with old_buffer = *buffer_v,
  1014 + old_length = length(old_buffer),
  1015 + dropped = *start_v, // number of bytes dropped from old buffer
  1016 + buffer_v <- extract(old_buffer,dropped,old_length)+more;
  1017 + start_v <- 0;
  1018 + current_v <- *current_v - dropped;
  1019 + last_accepted_v <-
  1020 + if *last_accepted_v is
  1021 + {
  1022 + none then none,
  1023 + last(s,a) then last(s,a - dropped)
  1024 + };
  1025 + success(unique))
  1026 + },
  1027 + success(lexing_stream(buffer_v,
  1028 + start_v,
  1029 + last_accepted_v,
  1030 + current_v,
  1031 + reload_buffer))
  1032 + }.
  1033 +
  1034 +
  1035 +
  1036 +
  1037 + *** [2.2.4] From a read/write stream.
  1038 +
  1039 +public define Maybe(LexingStream)
  1040 + make_lexing_stream
  1041 + (
  1042 + RWStream stream,
  1043 + Int buffer_size,
  1044 + Int timeout
  1045 + ) =
  1046 + make_lexing_stream(weaken(stream),buffer_size,timeout).
  1047 +
  1048 +
  1049 +
  1050 +
  1051 + *** [2.2.5] From an SSL connection.
  1052 +
  1053 +public define Maybe(LexingStream)
  1054 + make_lexing_stream
  1055 + (
  1056 + SSL_Connection stream,
  1057 + Int buffer_size,
  1058 + Int timeout
  1059 + ) =
  1060 + if (Maybe(ByteArray))read(stream,buffer_size,timeout) is
  1061 + {
  1062 + failure then failure,
  1063 + success(buffer) then
  1064 + with buffer_v = var(buffer),
  1065 + start_v = var((Int)0),
  1066 + last_accepted_v = var((FastLexerLastAccepted)none),
  1067 + current_v = var((Int)0),
  1068 + reload_buffer = (Int i) |->
  1069 + if (Maybe(ByteArray))read(stream,buffer_size,timeout) is
  1070 + {
  1071 + failure then failure,
  1072 + success(more) then
  1073 + if length(more) = 0
  1074 + then failure
  1075 + else with old_buffer = *buffer_v,
  1076 + old_length = length(old_buffer),
  1077 + dropped = *start_v, // number of bytes dropped from old buffer
  1078 + buffer_v <- extract(old_buffer,dropped,old_length)+more;
  1079 + start_v <- 0;
  1080 + current_v <- *current_v - dropped;
  1081 + last_accepted_v <-
  1082 + if *last_accepted_v is
  1083 + {
  1084 + none then none,
  1085 + last(s,a) then last(s,a - dropped)
  1086 + };
  1087 + success(unique)
  1088 + },
  1089 + success(lexing_stream(buffer_v,
  1090 + start_v,
  1091 + last_accepted_v,
  1092 + current_v,
  1093 + reload_buffer))
  1094 + }.
  1095 +
  1096 +
  1097 +
  1098 +
  1099 +
  1100 +
  1101 + *** [3] Constructing the automaton.
  1102 +
  1103 + The description of a lexer is given as a list of 'LexerItem($Token)', where the
  1104 + parameter '$Token' represents the type of tokens. Each lexer item is made of a regular
  1105 + expression and an action. If the action is 'failure', the token just read is ignored
  1106 + and the lexer tries to read the next one. Otherwise, the action is applied to the
  1107 + lexeme just read, and the result of the action is returned by the lexer. The type
  1108 + 'LexerOutput($Token)' is defined in 'regexpr_parser.anubis'.
  1109 +
  1110 +
  1111 + A DFA is presented as a list of states. Each state is either accepting or
  1112 + rejecting. Each state has a name (of type Word32), and a list of transitions. Accepting
  1113 + states also have the corresponding 'action'.
  1114 +
  1115 + Each transition has a 'label' and the name of a state (the target state for this
  1116 + transition). Labels are of the following sorts:
  1117 +
  1118 +public type DFA_label:
  1119 + char(Word8),
  1120 + beginning_of_line,
  1121 + end_of_line.
  1122 +
  1123 +public type DFA_transition:
  1124 + transition(DFA_label label,
  1125 + Word32 target_name).
  1126 +
  1127 +public type DFA_state($Token):
  1128 + rejecting (Word32 name,
  1129 + List(DFA_transition) transitions),
  1130 +
  1131 + accepting (Word32 name,
  1132 + List(DFA_transition) transitions,
  1133 + Maybe(ByteArray -> LexerOutput($Token)) action).
  1134 +
  1135 +
  1136 +
  1137 + Now, here is the tool for making the DFA. The type 'RegExprError' is defined in
  1138 + 'regexpr_parser.anubis'.
  1139 +
  1140 +public define Result(RegExprError,List(DFA_state($Token)))
  1141 + make_DFA
  1142 + (
  1143 + List(LexerItem($Token)) lexer_description
  1144 + ).
  1145 +
  1146 +
  1147 +
  1148 + *** [3.1] Pre-labels.
  1149 +
  1150 + These are the labels before the renameing of the DFA.
  1151 +
  1152 + 'beginning_of_line' and 'end_of_line' are also treated as special characters, even if
  1153 + they cannot be present as such in the input. The fast lexer detects their presence
  1154 + based on the neighbouring of the character '\n', and uses special transitions in that
  1155 + case.
  1156 +
  1157 + On the contrary, 'actions' cannot be considered as matching anything in the
  1158 + input. However, in a given state and action may be present among transitions, just
  1159 + meaning that in this state, if no transition may be followed, the action must be
  1160 + chosen instead.
  1161 +
  1162 +
  1163 +public type DFA_pre_label($Token):
  1164 + char(Word8),
  1165 + beginning_of_line,
  1166 + end_of_line,
  1167 + action(Maybe(ByteArray -> LexerOutput($Token))).
  1168 +
  1169 +
  1170 +
  1171 +
  1172 + *** [3.2] Decorating basic regular expressions.
  1173 +
  1174 + Given a basic regular expression, we associate a unique integer to each of its leaves
  1175 + (when seen as a tree), which are either a character a beginning of line or an end of
  1176 + line. Such an integer is called a 'position'.
  1177 +
  1178 + Furthermore, we add three decorations to each Basic regular
  1179 + expression:
  1180 +
  1181 + - a flag 'nullable', which, when true, means that the regular expression may match
  1182 + the empty string,
  1183 +
  1184 + - a list of integers, representing all positions which may correspond to the first
  1185 + character of a matching string,
  1186 +
  1187 + - a list of integers, representing all positions which may correspond to the last
  1188 + character in a matching string.
  1189 +
  1190 + Actually, these two lists are lists of pairs (Word32,Label), where
  1191 + the label corresponds to the position.
  1192 +
  1193 +type DecoratedBasicRegExpr($Token):
  1194 + char (Word8,
  1195 + Word32 pos,
  1196 + Bool nullable,
  1197 + List((Word32,DFA_pre_label($Token))) firstpos,
  1198 + List((Word32,DFA_pre_label($Token))) lastpos),
  1199 +
  1200 + bol (Word32 pos,
  1201 + Bool nullable,
  1202 + List((Word32,DFA_pre_label($Token))) firstpos,
  1203 + List((Word32,DFA_pre_label($Token))) lastpos),
  1204 +
  1205 + eol (Word32 pos,
  1206 + Bool nullable,
  1207 + List((Word32,DFA_pre_label($Token))) firstpos,
  1208 + List((Word32,DFA_pre_label($Token))) lastpos),
  1209 +
  1210 + epsilon (Bool nullable,
  1211 + List((Word32,DFA_pre_label($Token))) firstpos,
  1212 + List((Word32,DFA_pre_label($Token))) lastpos),
  1213 +
  1214 + or (DecoratedBasicRegExpr($Token),DecoratedBasicRegExpr($Token),
  1215 + Bool nullable,
  1216 + List((Word32,DFA_pre_label($Token))) firstpos,
  1217 + List((Word32,DFA_pre_label($Token))) lastpos),
  1218 +
  1219 + cat (DecoratedBasicRegExpr($Token),DecoratedBasicRegExpr($Token),
  1220 + Bool nullable,
  1221 + List((Word32,DFA_pre_label($Token))) firstpos,
  1222 + List((Word32,DFA_pre_label($Token))) lastpos),
  1223 +
  1224 + star (DecoratedBasicRegExpr($Token),
  1225 + Bool nullable,
  1226 + List((Word32,DFA_pre_label($Token))) firstpos,
  1227 + List((Word32,DFA_pre_label($Token))) lastpos),
  1228 +
  1229 + action (Maybe(ByteArray -> LexerOutput($Token)),
  1230 + Word32 pos,
  1231 + Bool nullable,
  1232 + List((Word32,DFA_pre_label($Token))) firstpos,
  1233 + List((Word32,DFA_pre_label($Token))) lastpos).
  1234 +
  1235 +
  1236 +
  1237 + The following function adds positions and decorations to a regular expression. Since we
  1238 + have to generate position names, we give the first position to be used, and the
  1239 + function returns the regular expression (with positions and decorations) and the next
  1240 + position free for further use. The computation is simply recursive (there is no 'graph
  1241 + walk' to do, only a 'tree walk').
  1242 +
  1243 +
  1244 +define (DecoratedBasicRegExpr($Token),Word32)
  1245 + decorate
  1246 + (
  1247 + BasicRegExpr($Token) r,
  1248 + Word32 n
  1249 + ) =
  1250 + if r is
  1251 + {
  1252 + char(c) then
  1253 + (char(c,n,false,[(n,char(c))],[(n,char(c))]), n+1),
  1254 +
  1255 + star(r1) then
  1256 + if decorate(r1,n) is (rp1,m) then
  1257 + (star(rp1,
  1258 + true,
  1259 + firstpos(rp1),
  1260 + lastpos(rp1)),m),
  1261 +
  1262 + or(r1,r2) then
  1263 + if decorate(r1,n) is (rp1,m) then
  1264 + if decorate(r2,m) is (rp2,l) then
  1265 + (or(rp1,rp2,
  1266 + if nullable(rp1) then true else nullable(rp2),
  1267 + append(firstpos(rp1),firstpos(rp2)),
  1268 + append(lastpos(rp1),lastpos(rp2))),l),
  1269 +
  1270 + cat(r1,r2) then
  1271 + if decorate(r1,n) is (rp1,m) then
  1272 + if decorate(r2,m) is (rp2,l) then
  1273 + (cat(rp1,rp2,
  1274 + if nullable(rp1) then nullable(rp2) else false,
  1275 + if nullable(rp1) then append(firstpos(rp1),firstpos(rp2)) else firstpos(rp1),
  1276 + if nullable(rp2) then append(lastpos(rp1),lastpos(rp2)) else lastpos(rp2)),l),
  1277 +
  1278 + epsilon then
  1279 + (epsilon(true,[],[]),n),
  1280 +
  1281 + beginning_of_line then
  1282 + (bol(n,false,[(n,beginning_of_line)],[(n,beginning_of_line)]),n+1),
  1283 +
  1284 + end_of_line then
  1285 + (eol(n,false,[(n,end_of_line)],[(n,end_of_line)]),n+1),
  1286 +
  1287 + action(a) then
  1288 + (action(a,n,false,[(n,action(a))],[(n,action(a))]),n+1)
  1289 + }.
  1290 +
  1291 +
  1292 + Notice that the 'firstpos' and 'lastpos' fields in decorated regular expressions are
  1293 + always increasingly ordered lists of distinct integers (when ignoring labels), as may
  1294 + be easily verified by induction from the previous definition. Hint: when we write
  1295 +
  1296 + if decorate(r1,n) is (rp1,m)
  1297 +
  1298 + any position i in rp1 is such that n =< i < m.
  1299 +
  1300 +
  1301 +
  1302 + *** [3.3] Computing the follow table.
  1303 +
  1304 +
  1305 + A 'follow table' tells us which positions can follow a given position (when scanning a
  1306 + string). It also gives the label attached to a position. Its type is:
  1307 +
  1308 +type FollowTable($Token):
  1309 + empty,
  1310 + follow_table(Word32, // position
  1311 + DFA_pre_label($Token), // label
  1312 + List(Word32), // following positions
  1313 + FollowTable($Token) next).
  1314 +
  1315 +
  1316 + Our lists of Word32s will have to remain increasingly sorted (for the purpose of
  1317 + comparison).
  1318 +
  1319 + The following function merges two lists sorted in increasing order, so that the result
  1320 + is still increasingly sorted.
  1321 +
  1322 +define List(Word32)
  1323 + merge_sorted
  1324 + (
  1325 + List(Word32) l1,
  1326 + List(Word32) l2
  1327 + ) =
  1328 + if l1 is
  1329 + {
  1330 + [ ] then l2,
  1331 + [h1 . t1] then
  1332 + if l2 is
  1333 + {
  1334 + [ ] then l1,
  1335 + [h2 . t2] then
  1336 + if h1 = h2 // avoid duplications
  1337 + then [h1 . merge_sorted(t1,t2)]
  1338 + else if h1 -< h2
  1339 + then [h1 . merge_sorted(t1,l2)]
  1340 + else [h2 . merge_sorted(l1,t2)]
  1341 + }
  1342 + }.
  1343 +
  1344 +
  1345 + 'heads' takes a list of pairs, and returns the list of all heads of these pairs. Remark
  1346 + that if we apply 'heads' to either a 'firstpos' or a 'lastpos' datum, we get a list of
  1347 + increasingly ordered distinct integers.
  1348 +
  1349 +define List($T)
  1350 + heads
  1351 + (
  1352 + List(($T,$U)) l
  1353 + ) =
  1354 + if l is
  1355 + {
  1356 + [ ] then [ ],
  1357 + [h . t] then if h is (u,v) then
  1358 + [u . heads(t)]
  1359 + }.
  1360 +
  1361 +
  1362 +
  1363 + Adding entries to a follow table. Given:
  1364 +
  1365 + - a list of keys (e1,...,ek) of type (Word32,DFA_pre_label($Token))
  1366 + - a list of values (t1,...,tn) of type (Word32,DFA_pre_label($Token))
  1367 + - a A-list of triplets of type (Word32,DFA_pre_label($Token),List(Word32)),
  1368 +
  1369 + update that A-list, adding keys e1,...,en if they are not already in the A-list, and
  1370 + putting each head of ti as a value for each ej. The third element of each triplet (a
  1371 + list of integers) should always remain inceasingly sorted, and have distinct elements.
  1372 +
  1373 + First, assume there is only one key (and its label) to add:
  1374 +
  1375 +
  1376 +define FollowTable($Token)
  1377 + add_follow_entry
  1378 + (
  1379 + Word32 key,
  1380 + DFA_pre_label($Token) c,
  1381 + List((Word32,DFA_pre_label($Token))) values,
  1382 + FollowTable($Token) previous
  1383 + ) =
  1384 + if previous is
  1385 + {
  1386 + empty then follow_table(key,c,heads(values),empty),
  1387 + follow_table(k1,c1,v1,t) then
  1388 + if key = k1
  1389 + then follow_table(k1,c1,merge_sorted(heads(values),v1),t)
  1390 + else follow_table(k1,c1,v1,add_follow_entry(key,c,values,t))
  1391 + }.
  1392 +
  1393 +
  1394 + Now, add several keys.
  1395 +
  1396 +define FollowTable($Token)
  1397 + add_follow_entries
  1398 + (
  1399 + List((Word32,DFA_pre_label($Token))) keys,
  1400 + List((Word32,DFA_pre_label($Token))) values,
  1401 + FollowTable($Token) previous
  1402 + ) =
  1403 + if keys is
  1404 + {
  1405 + [ ] then previous,
  1406 + [k1 . ks] then
  1407 + if k1 is (k,c) then
  1408 + add_follow_entries(ks,values,add_follow_entry(k,c,values,previous))
  1409 + }.
  1410 +
  1411 + Appending two follow tables (it is assumed that they have no key in common).
  1412 +
  1413 +define FollowTable($Token)
  1414 + append
  1415 + (
  1416 + FollowTable($Token) t1,
  1417 + FollowTable($Token) t2
  1418 + ) =
  1419 + if t1 is
  1420 + {
  1421 + empty then t2,
  1422 + follow_table(p,l,n,tail1) then follow_table(p,l,n,append(tail1,t2))
  1423 + }.
  1424 +
  1425 +
  1426 + Making the follow_table from a decorated basic regular expression.
  1427 +
  1428 +define FollowTable($Token)
  1429 + make_follow_table
  1430 + (
  1431 + DecoratedBasicRegExpr($Token) r
  1432 + ) =
  1433 + if r is
  1434 + {
  1435 + char(c,n,nb,fp,lp) then follow_table(n,char(c),[],empty),
  1436 + bol(n,nb,fp,lp) then follow_table(n,beginning_of_line,[],empty),
  1437 + eol(n,nb,fp,lp) then follow_table(n,end_of_line,[],empty),
  1438 + epsilon(nb,fp,lp) then empty,
  1439 + or(r1,r2,nb,fp,lp) then append(make_follow_table(r1),make_follow_table(r2)),
  1440 + /* we can use append because r1 and r2 cannot share a
  1441 + key. */
  1442 +
  1443 + cat(r1,r2,nb,fp,lp) then
  1444 + with t = append(make_follow_table(r1),make_follow_table(r2)),
  1445 + /* same remark on append */
  1446 + l1 = lastpos(r1),
  1447 + f2 = firstpos(r2),
  1448 + add_follow_entries(l1,f2,t),
  1449 +
  1450 + star(r1,nb,fp,lp) then
  1451 + with t = make_follow_table(r1),
  1452 + f = firstpos(r1),
  1453 + l = lastpos(r1),
  1454 + add_follow_entries(l,f,t),
  1455 +
  1456 + action(a,n,nb,fb,lp) then follow_table(n,action(a),[],empty)
  1457 + }.
  1458 +
  1459 +
  1460 +
  1461 +
  1462 +
  1463 + Finding an entry in a follow table.
  1464 +
  1465 +define (Word32,DFA_pre_label($Token),List(Word32))
  1466 + follow_table_entry
  1467 + (
  1468 + Word32 p,
  1469 + FollowTable($Token) l
  1470 + ) =
  1471 + if l is
  1472 + {
  1473 + empty then alert, // we should always find it
  1474 + follow_table(n,c,pos,t) then
  1475 + if p = n
  1476 + then (n,c,pos)
  1477 + else follow_table_entry(p,t)
  1478 + }.
  1479 +
  1480 +
  1481 +
  1482 +
  1483 +
  1484 +
  1485 +
  1486 +
  1487 +
  1488 +
  1489 + Names of states in the DFA are primarily increasingly sorted lists of Word32s. They are
  1490 + transformed into Word32 when the DFA is renameed (see below). A transition is just a
  1491 + pair made of a label and a state name.
  1492 +
  1493 +type DFA_pre_transition($Token):
  1494 + transition(DFA_pre_label($Token) label,
  1495 + List(Word32) target_name).
  1496 +
  1497 +
  1498 + A state is made of a state name and a list of transitions.
  1499 +
  1500 +type DFA_pre_state($Token):
  1501 + state(List(Word32) name,
  1502 + Maybe(List(DFA_pre_transition($Token))) transitions).
  1503 +
  1504 +
  1505 + The reason why the field 'transitions' has a 'Maybe' is that we may consider
  1506 + 'incomplete' states, which did not yet receive their transitions.
  1507 +
  1508 + Note: A DFA is not a tree in general, but a graph. This is the reason why states have
  1509 + names. Since we cannot construct circular data in Anubis, the presence of names allows
  1510 + nevertheless the construction of graphs (including circularities). However, we cannot
  1511 + refer directly to a state, but only to its name.
  1512 +
  1513 + We explain now how the automaton is constructed for a decorated basic regular
  1514 + expression 'r'.
  1515 +
  1516 + First of all, there is an initial state, whose name is firstpos(r). What it means is
  1517 + that in this state, we expect to read a character corresponding to one of these
  1518 + positions.
  1519 +
  1520 + More generally, for any state 's', the name of the state is the list of all positions
  1521 + which may match the next character to be read from the input.
  1522 +
  1523 + Since, we don't care about unreachable states, we construct the automaton, starting
  1524 + with the initial state, and adding all the states required by the transitions, until no
  1525 + more state may be added. Of course, this process terminates, since the set of all
  1526 + possible state names is obviously finite (its cardinal is at most 2^p, where p is the
  1527 + number of positions in r).
  1528 +
  1529 + For a given state, with name [p_1,...,pk], the transitions are given by the labels of
  1530 + p_1,...,p_k. Nevertheless, several positions may have the same label. Hence, for a
  1531 + given label, let q_1,...,q_j be those among p_1,...,p_k which have this label. The
  1532 + target state for the corresponding transition is obtained by taking all the positions
  1533 + which may follow one of q_1,...,q_j.
  1534 +
  1535 + That's all !
  1536 +
  1537 +
  1538 + Empty state names. What does it mean that the name of a state is empty ? This means
  1539 + that reaching this state produces an error. Indeed, a state accepts a string if and
  1540 + only if it contains a position labelled by an action, and has transitions to other
  1541 + states if and only if it contains a position labelled by a character (or
  1542 + 'end_of_file').
  1543 +
  1544 + A state which contains an action is an accepting state. Nevertheless, it may also have
  1545 + transitions. Hence, the lexer may eventually accept a longer sequence. But following
  1546 + the transitions may also lead to an error. Hence the lexer must always keep the most
  1547 + recently found solution, and use it (if it exists) if it enters a dead end (and in that
  1548 + case, there is no error at all).
  1549 +
  1550 + When using a solution, the lexer must also apply the action. This action must have been
  1551 + saved by the lexer. Hence it is necessary to number actions, and to create a function
  1552 + for each action.
  1553 +
  1554 +
  1555 +
  1556 +
  1557 + Given a state name [p_1,...,p_k], and the follow table, the function
  1558 + 'prepare_transitions' produces a list of pairs
  1559 +
  1560 + (a , l)
  1561 +
  1562 + where 'a' is a label, and 'l' the list of all positions with label 'a' which may follow
  1563 + one of p1,...,pk. We need an auxiliary function 'insert'.
  1564 +
  1565 +
  1566 +
  1567 +
  1568 +
  1569 +define List(DFA_pre_transition($Token))
  1570 + insert
  1571 + (
  1572 + DFA_pre_label($Token) c,
  1573 + List(Word32) l,
  1574 + List(DFA_pre_transition($Token)) q
  1575 + ) =
  1576 + if q is
  1577 + {
  1578 + [ ] then [transition(c,l)],
  1579 + [h . t] then
  1580 + if h is transition(c1,l1) then
  1581 + if c = c1
  1582 + then [transition(c,merge_sorted(l,l1)) . t]
  1583 + else [h . insert(c,l,t)]
  1584 + }.
  1585 +
  1586 +
  1587 +define List(DFA_pre_transition($Token))
  1588 + prepare_transitions
  1589 + (
  1590 + List(Word32) name,
  1591 + FollowTable($Token) ft
  1592 + ) =
  1593 + if name is
  1594 + {
  1595 + [ ] then [ ],
  1596 + [p1 . p_others] then
  1597 + if follow_table_entry(p1,ft) is (p,c,l) then
  1598 + with q = prepare_transitions(p_others,ft),
  1599 + insert(c,l,q)
  1600 + }.
  1601 +
  1602 +
  1603 +
  1604 +
  1605 + Now, we compute our DFA, i.e a list of DFA_pre_state($Token)s. We begin with only one state in the
  1606 + list. The name of this state is firstpos(r), and it has not yet received its
  1607 + transitions. In other words, it is:
  1608 +
  1609 + state(firstpos(r),failure)
  1610 +
  1611 + Then, we enter an 'infinite' loop. At each pass, we look for a state which did not yet
  1612 + receive its transitions. If there is no such state, the DFA is ready (and we exit the
  1613 + loop). Otherwise, we add its transitions to the state, and this may create new states
  1614 + (without their transitions) in the DFA.
  1615 +
  1616 + We need a function to separate (if possible) an incomplete state from a list of states:
  1617 +
  1618 +define Maybe((DFA_pre_state($Token),List(DFA_pre_state($Token))))
  1619 + separate_incomplete_state
  1620 + (
  1621 + List(DFA_pre_state($Token)) l
  1622 + ) =
  1623 + if l is
  1624 + {
  1625 + [ ] then failure,
  1626 + [s1 . so] then
  1627 + if transitions(s1) is
  1628 + {
  1629 + failure then
  1630 + success((s1,so)),
  1631 + success(_) then
  1632 + if separate_incomplete_state(so) is
  1633 + {
  1634 + failure then failure,
  1635 + success(p) then if p is (i,m) then
  1636 + success((i,[s1 . m]))
  1637 + }
  1638 + }
  1639 + }.
  1640 +
  1641 +
  1642 + We need a function to extract the list of target names from a list of transitions.
  1643 +
  1644 +define List(List(Word32))
  1645 + get_targets
  1646 + (
  1647 + List(DFA_pre_transition($Token)) l
  1648 + ) =
  1649 + if l is
  1650 + {
  1651 + [ ] then [ ],
  1652 + [h . t] then if h is transition(n,target) then
  1653 + [target . get_targets(t)]
  1654 + }.
  1655 +
  1656 +
  1657 + We need a predicate to test if a list of states contains a state of
  1658 + a given name.
  1659 +
  1660 +define Bool
  1661 + is_state_name_in
  1662 + (
  1663 + List(DFA_pre_state($Token)) l,
  1664 + List(Word32) n // sorted list of integers
  1665 + ) =
  1666 + if l is
  1667 + {
  1668 + [ ] then false,
  1669 + [h . t] then
  1670 + if h is state(m,tr) then
  1671 + if n = m // comparing sorted lists of integers
  1672 + then true
  1673 + else is_state_name_in(t,n)
  1674 + }.
  1675 +
  1676 +
  1677 + We need a function to add new states to a list of states. The new states are given in
  1678 + the form of a list of state names and are added without their transitions.
  1679 +
  1680 +define List(DFA_pre_state($Token))
  1681 + add_new_states
  1682 + (
  1683 + List(List(Word32)) names,
  1684 + List(DFA_pre_state($Token)) states
  1685 + ) =
  1686 + if names is
  1687 + {
  1688 + [ ] then states,
  1689 + [h . t] then
  1690 + if is_state_name_in(states,h)
  1691 + then add_new_states(t,states)
  1692 + else add_new_states(t,[state(h,failure) . states])
  1693 + }.
  1694 +
  1695 +
  1696 +
  1697 + We need a function to complete a state which did not yet receive its transitions.
  1698 +
  1699 +define List(DFA_pre_state($Token))
  1700 + complete_state
  1701 + (
  1702 + DFA_pre_state($Token) i, // incomplete state
  1703 + List(DFA_pre_state($Token)) o, // other states
  1704 + FollowTable($Token) ft
  1705 + ) =
  1706 + with trans = prepare_transitions(name(i),ft),
  1707 + targets = get_targets(trans),
  1708 + add_new_states(targets,[state(name(i),success(trans)) . o]).
  1709 +
  1710 +
  1711 + Now, here is our 'infinite' loop.
  1712 +
  1713 +define List(DFA_pre_state($Token))
  1714 + make_DFA_pre
  1715 + (
  1716 + List(DFA_pre_state($Token)) l,
  1717 + FollowTable($Token) ft
  1718 + ) =
  1719 + if separate_incomplete_state(l) is
  1720 + {
  1721 + failure then l, // the DFA is ready
  1722 +
  1723 + success(p) then if p is (s,o) then
  1724 + with new = complete_state(s,o,ft),
  1725 + make_DFA_pre(new,ft)
  1726 + }.
  1727 +
  1728 +
  1729 +
  1730 +
  1731 +
  1732 + *** [3.5] Renaming the states of the DFA.
  1733 +
  1734 + Names of states in our DFA are lists of integers. We need to replace them by integers.
  1735 +
  1736 + From a DFA whose state names are lists of integers, we create a list of pairs (old,new)
  1737 + where new is a new name (an integer) and old an old name (a list of integers).
  1738 +
  1739 +define List((List(Word32),Word32)) // an association list
  1740 + name_list
  1741 + (
  1742 + List(DFA_pre_state($Token)) l,
  1743 + Word32 first_new_name
  1744 + ) =
  1745 + if l is
  1746 + {
  1747 + [ ] then [ ],
  1748 + [h . t] then
  1749 + if h is state(old_name,tr) then
  1750 + [(old_name,first_new_name) . name_list(t,first_new_name+1)]
  1751 + }.
  1752 +
  1753 +
  1754 + Given an old name and our association list, we can get the new name.
  1755 +
  1756 +define Word32
  1757 + get_new_name
  1758 + (
  1759 + List(Word32) old_name,
  1760 + List((List(Word32),Word32)) nlist
  1761 + ) =
  1762 + if nlist is
  1763 + {
  1764 + [ ] then alert, // the new name should always exist
  1765 + [h . t] then if h is (o,n) then
  1766 + if old_name = o
  1767 + then n
  1768 + else get_new_name(old_name,t)
  1769 + }.
  1770 +
  1771 +
  1772 + Now, we rename all transitions in a given state. At the same time we separate actual
  1773 + transitions from actions. This is why the following function returns a pair made of a
  1774 + list of transitions, and maybe an action. Since the action is of type:
  1775 +
  1776 + Maybe(ByteArray -> LexerOutput($Token))
  1777 +
  1778 + the non mandatory action is of type:
  1779 +
  1780 + Maybe(Maybe(ByteArray -> LexerOutput($Token)))
  1781 +
  1782 +
  1783 +define (List(DFA_transition),Maybe(Maybe(ByteArray -> LexerOutput($Token))))
  1784 + rename
  1785 + (
  1786 + List(DFA_pre_transition($Token)) l,
  1787 + List((List(Word32),Word32)) nlist
  1788 + ) =
  1789 + if l is
  1790 + {
  1791 + [ ] then ([ ],failure),
  1792 + [h . t] then
  1793 + if rename(t,nlist) is (trs,mbmba) then
  1794 + if h is transition(pre_label,target) then
  1795 + if pre_label is
  1796 + {
  1797 + char(c) then
  1798 + ([transition(char(c),get_new_name(target,nlist)) . trs],mbmba),
  1799 + beginning_of_line then
  1800 + ([transition(beginning_of_line,get_new_name(target,nlist)) . trs],mbmba),
  1801 + end_of_line then
  1802 + ([transition(end_of_line,get_new_name(target,nlist)) . trs],mbmba),
  1803 + action(mba) then if mbmba is
  1804 + {
  1805 + failure then (trs,success(mba)),
  1806 + success(x) then // two actions in the same state: choose the first one.
  1807 + (trs,success(mba))
  1808 + }
  1809 + }
  1810 + }.
  1811 +
  1812 +
  1813 + Now, we rename all the states.
  1814 +
  1815 +define List(DFA_state($Token))
  1816 + rename
  1817 + (
  1818 + List(DFA_pre_state($Token)) l,
  1819 + List((List(Word32),Word32)) nlist
  1820 + ) =
  1821 + if l is
  1822 + {
  1823 + [ ] then [ ],
  1824 + [h . t] then
  1825 + if h is state(old_name,mbtrans) then
  1826 + if mbtrans is
  1827 + {
  1828 + failure then alert, // pre-states must have been completed
  1829 + success(trans) then
  1830 + if rename(trans,nlist) is (trs,mbmba) then
  1831 + if mbmba is
  1832 + {
  1833 + failure then
  1834 + [rejecting(get_new_name(old_name,nlist),trs) . rename(t,nlist)]
  1835 + success(mba) then
  1836 + [accepting(get_new_name(old_name,nlist),trs,mba) . rename(t,nlist)]
  1837 + }
  1838 + }
  1839 + }.
  1840 +
  1841 +
  1842 +
  1843 + *** [3.5] Making the DFA.
  1844 +
  1845 +
  1846 +
  1847 +
  1848 +define Result(RegExprError,BasicRegExpr($Token))
  1849 + prepare_global_regexpr
  1850 + (
  1851 + List(LexerItem($Token)) lexer_description
  1852 + ) =
  1853 + if lexer_description is
  1854 + {
  1855 + [ ] then error(empty_lexer_description),
  1856 + [h . t] then if h is lexer_item(re,a) then
  1857 + if parse_regular_expression(make_stream(re)) is
  1858 + {
  1859 + error(msg) then error(msg),
  1860 + ok(re1) then if t is
  1861 + {
  1862 + [ ] then
  1863 + ok(cat(to_basic(re1),action(a))),
  1864 + [_ . _] then if prepare_global_regexpr(t) is
  1865 + {
  1866 + error(msg) then error(msg),
  1867 + ok(p) then
  1868 + ok(or(cat(to_basic(re1),action(a)),p))
  1869 + }
  1870 + }
  1871 + }
  1872 + }.
  1873 +
  1874 +
  1875 +
  1876 +public define Result(RegExprError,List(DFA_state($Token)))
  1877 + make_DFA
  1878 + (
  1879 + List(LexerItem($Token)) lexer_description
  1880 + ) =
  1881 + if prepare_global_regexpr(lexer_description) is
  1882 + {
  1883 + error(msg) then error(msg),
  1884 + ok(re) then if decorate(re,0) is (br,_) then
  1885 + with dfa = reverse(make_DFA_pre([state(heads(firstpos(br)),failure)],
  1886 + make_follow_table(br))),
  1887 + ok(rename(dfa,name_list(dfa,0)))
  1888 + }.
  1889 +
  1890 +
  1891 +
  1892 +
  1893 +
  1894 + *** [3.6] Translating a DFA into a fast lexer description.
  1895 +
  1896 + The types 'FastLexerTransition' and 'FastLexerState' is defined in 'predefined.anubis'
  1897 + section 13.
  1898 +
  1899 +
  1900 +define List(FastLexerTransition)
  1901 + to_fast_lexer_transitions
  1902 + (
  1903 + List(DFA_transition) l
  1904 + ) =
  1905 + if l is
  1906 + {
  1907 + [ ] then [ ],
  1908 + [h . t] then if h is transition(label,target) then
  1909 + [if label is
  1910 + {
  1911 + char(c) then transition(c,target),
  1912 + beginning_of_line then beginning_of_line(target),
  1913 + end_of_line then end_of_line(target)
  1914 + } . to_fast_lexer_transitions(t)]
  1915 + }.
  1916 +
  1917 +
  1918 +public define List(FastLexerState)
  1919 + to_fast_lexer_description
  1920 + (
  1921 + List(DFA_state($Token)) l
  1922 + ) =
  1923 + if l is
  1924 + {
  1925 + [ ] then [ ],
  1926 + [h . t] then [if h is
  1927 + {
  1928 + rejecting(n,trs) then rejecting(to_fast_lexer_transitions(trs))
  1929 + accepting(n,trs,a) then accepting(to_fast_lexer_transitions(trs))
  1930 + } . to_fast_lexer_description(t)]
  1931 + }.
  1932 +
  1933 +
  1934 +
  1935 +
  1936 +
  1937 +
  1938 + *** [4] Constructing the lexer.
  1939 +
  1940 + The low level fast lexer (see 'predefined.anubis', section 13) does not care about
  1941 + actions. Hence, we must manage actions in parallel. To this end we use the following
  1942 + type:
  1943 +
  1944 + MVar(Maybe(ByteArray -> LexerOutput($Token)))
  1945 +
  1946 + The action for state 'n' (assumed to be an accepting state because the multiple
  1947 + variable is never used for rejecting states) is the value stored in slot 'n'. The
  1948 + default value is 'failure' meaning 'ignore this token and read the next
  1949 + one'. Otherwise, the function is applied to the lexeme just read, and the lexer returns
  1950 + the result of this function.
  1951 +
  1952 + The multiple variable is filled up by:
  1953 +
  1954 +define One
  1955 + fill_actions
  1956 + (
  1957 + List(DFA_state($Token)) dfa,
  1958 + MVar(Maybe(ByteArray -> LexerOutput($Token))) v
  1959 + ) =
  1960 + if dfa is
  1961 + {
  1962 + [ ] then unique,
  1963 + [h . t] then
  1964 + if h is
  1965 + {
  1966 + rejecting(name,trs) then unique,
  1967 + accepting(name,trs,action) then
  1968 + v(name) <- action
  1969 + };
  1970 + fill_actions(t,v)
  1971 + }.
  1972 +
  1973 +
  1974 + Making the multiple variable for actions is performed by:
  1975 +
  1976 +define MVar(Maybe(ByteArray -> LexerOutput($Token)))
  1977 + get_actions
  1978 + (
  1979 + List(DFA_state($Token)) dfa
  1980 + ) =
  1981 + with ns = length(dfa), // total number of states
  1982 + v = mvar(truncate_to_Word32(ns),
  1983 + (Maybe(ByteArray -> LexerOutput($Token)))failure),
  1984 + fill_actions(dfa,v); v.
  1985 +
  1986 +
  1987 +
  1988 + Now we plug the lexer to a lexing stream
  1989 +
  1990 +
  1991 +define One -> LexerOutput($Token)
  1992 + plug_lexer
  1993 + (
  1994 + LexingStream stream,
  1995 + (ByteArray input,
  1996 + FastLexerLastAccepted last_accepted,
  1997 + FastLexerBeginningOfLine bol,
  1998 + FastLexerEndOfLine eol,
  1999 + Int position,
  2000 + Word32 starting_state) -> FastLexerOutput lexer,
  2001 + MVar(Maybe(ByteArray -> LexerOutput($Token))) actions
  2002 + ) =
  2003 + with bol_v = var((FastLexerBeginningOfLine)at_beginning_of_line),
  2004 + eol_v = var((FastLexerEndOfLine)not_at_end_of_line),
  2005 + if stream is lexing_stream(buffer_v,start_v,last_accept_v,current_v,reload_buffer) then
  2006 + (One _) |-l-> if lexer(*buffer_v,
  2007 + *last_accept_v,
  2008 + *bol_v,
  2009 + *eol_v,
  2010 + *current_v,
  2011 + 0) // reading a new token always starts in state 0
  2012 + is
  2013 + {
  2014 + rejected(state,end,a) then
  2015 + if a is
  2016 + {
  2017 + not_at_end_of_input then
  2018 + with result = (LexerOutput($Token))error(extract(*buffer_v,*start_v,end)),
  2019 + current_v <- end+1;
  2020 + start_v <- end+1;
  2021 + last_accept_v <- none;
  2022 + result,
  2023 +
  2024 + at_end_of_input then
  2025 + if reload_buffer(*start_v) is
  2026 + {
  2027 + failure then //print("At end (1).\n");
  2028 + end_of_input, // really at end of input
  2029 + success(_) then
  2030 + l(unique) // continue reading this token
  2031 + }
  2032 + }
  2033 +
  2034 + accepted(state,end,a) then
  2035 + if a is
  2036 + {
  2037 + not_at_end_of_input then
  2038 + if *actions(state) is
  2039 + {
  2040 + failure then
  2041 + current_v <- end;
  2042 + start_v <- end;
  2043 + last_accept_v <- none;
  2044 + l(unique), // ignore and try to read the next token
  2045 +
  2046 + success(f) then
  2047 + with result = f(extract(*buffer_v,*start_v,end)),
  2048 + current_v <- end;
  2049 + start_v <- end;
  2050 + last_accept_v <- none;
  2051 + result
  2052 + },
  2053 +
  2054 + at_end_of_input then
  2055 + if reload_buffer(*start_v) is
  2056 + {
  2057 + failure then
  2058 + if *actions(state) is
  2059 + {
  2060 + failure then //print("At end (2).\n");
  2061 + end_of_input, // ignore and don't try to continue
  2062 + success(f) then
  2063 + with result = f(extract(*buffer_v,*start_v,end)),
  2064 + current_v <- end;
  2065 + start_v <- end;
  2066 + last_accept_v <- none;
  2067 + result
  2068 + },
  2069 +
  2070 + success(_) then l(unique) // continue reading this token
  2071 + }
  2072 + }
  2073 + }.
  2074 +
  2075 +
  2076 +
  2077 + Finally, the tool for making a lexer.
  2078 +
  2079 +public define Result(RegExprError, LexingStream -> One -> LexerOutput($Token))
  2080 + make_lexer
  2081 + (
  2082 + List(LexerItem($Token)) lexer_description
  2083 + ) =
  2084 + if make_DFA(lexer_description) is
  2085 + {
  2086 + error(msg) then error(msg),
  2087 + ok(List(DFA_state($Token)) dfa) then
  2088 + if make_fast_lexer(to_fast_lexer_description(dfa)) is
  2089 + {
  2090 + unknown_state(n) then alert, // cannot happen
  2091 + ok(fl) then ok((LexingStream ls) |-> plug_lexer(ls,fl,get_actions(dfa)))
  2092 + }
  2093 + }.
  2094 +
  2095 +
  2096 +
0 2097 \ No newline at end of file
... ...
anubis_distrib/library/lexical_analysis/fast_lexer_example_1.anubis 0 → 100644
  1 +
  2 +
  3 + The Anubis Project
  4 +
  5 + Tools for lexical analysis.
  6 + A simple example.
  7 +
  8 + Copyright (c) Constructive Mathematics 2007-2008.
  9 +
  10 +
  11 + Author: Alain Prouté
  12 +
  13 +
  14 + In this file we present a simple example of use of 'fast_lexer.anubis'. The program
  15 + generated is a very simplified version of the Unix tool 'grep':
  16 +
  17 +global define One
  18 + fast_lexer_example_1
  19 + (
  20 + List(String) args
  21 + ).
  22 +
  23 + This program receives a regular expression and a filename as its arguments. Its purpose
  24 + is to print to the standard output all the sequences in the file matching the regular
  25 + expression, with line numbers.
  26 +
  27 +define String
  28 + usage =
  29 + "Usage: anbexec fast_lexer_example_1 <regular expression> <file name>\n".
  30 +
  31 +
  32 +
  33 + --- That's all for the public part ! --------------------------------------------------
  34 + Nevertheless, since this is an example, you may have to read the sequel, which is fully
  35 + commented.
  36 +
  37 +
  38 +
  39 +
  40 + -------------------------------- Table of Contents ------------------------------------
  41 +
  42 + *** [1] Tokens.
  43 + *** [2] Preparing the lexer description.
  44 + *** [3] Preparing the lexing stream.
  45 + *** [4] The main loop.
  46 + *** [5] Carrying on.
  47 +
  48 + ---------------------------------------------------------------------------------------
  49 +
  50 +
  51 +
  52 + First of all, we must access the tool:
  53 +
  54 +read lexical_analysis/fast_lexer.anubis
  55 +read lexical_analysis/dfa_compiler.anubis
  56 +read lexical_analysis/regexpr_parser.anubis
  57 +read lexical_analysis/lexing_stream.anubis
  58 +
  59 +
  60 +
  61 + *** [1] Tokens.
  62 +
  63 + The first thing to do is to define the type for representing tokens since 'fast lexer'
  64 + has a parameter '$Token'. In the case of this example, this type is very simple:
  65 +
  66 +type Token:
  67 + matching(String),
  68 + newline.
  69 +
  70 + since each recognized sequence is just considered as a string. However, we also have to
  71 + recognize newline characters in order to be able to count lines.
  72 +
  73 +
  74 +
  75 +
  76 + *** [2] Preparing the lexer description.
  77 +
  78 + Before you can construct your lexer, you must prepare a 'lexer description'. It's of type:
  79 +
  80 + List(LexerItem(Token))
  81 +
  82 + We have one lexer item for the given regular expression, another one for
  83 + newlines. However, we need a third one for ignoring everything else.
  84 +
  85 +
  86 +define List(LexerItem(Token))
  87 + prepare_lexer_description
  88 + (
  89 + String regular_expression
  90 + ) =
  91 + [
  92 + /* recognize sequences matching the given regular expression */
  93 + lexer_item(regular_expression,
  94 + success((ByteArray b) |-> token(matching(to_string(b))))),
  95 +
  96 + /* recognize newline characters */
  97 + lexer_item("\n",
  98 + success((ByteArray b) |-> token(newline))),
  99 +
  100 + /* ignore everything else */
  101 + lexer_item(".", /* "." represents any character except '\n' */
  102 + failure)
  103 + ].
  104 +
  105 +
  106 + The lexer will be constructed below by applying the function 'make_lexer' (declared in
  107 + 'fast_lexer.anubis') to this lexer description.
  108 +
  109 +
  110 +
  111 +
  112 +
  113 + *** [3] Preparing the lexing stream.
  114 +
  115 + Lexical analysis is performed from an input stream (of type 'LexingStream'). In the
  116 + case of this example, the input stream is constructed from the given filename. Of
  117 + course, this may fail since the file may eventually not be opened or read.
  118 +
  119 +define Maybe(LexingStream)
  120 + prepare_input
  121 + (
  122 + String filename
  123 + ) =
  124 + /* try to open the file ('predefined.anubis' section 5.1) */
  125 + if file(filename,read) is
  126 + {
  127 + failure then failure,
  128 + success(f) then make_lexing_stream(f, /* the opened file */
  129 + 1000, /* size of buffer for the lexing stream */
  130 + 100) /* timeout (seconds) */
  131 + }.
  132 +
  133 +
  134 +
  135 +
  136 +
  137 + *** [4] The main loop.
  138 +
  139 + Assuming our lexer is ready as a function of type 'One -> LexerOutput(Token)' (i.e. the
  140 + lexing stream is already plugged into it), we construct the main loop of this program.
  141 + It consists in calling the lexer repeatedly until it returns 'end_of_input'.
  142 +
  143 + When it returns an error (actually a lexical error), we print this error. However, this
  144 + should never happen, because our lexer has a lexer item for ignoring anything not
  145 + matching one of the first two lexer items.
  146 +
  147 + In this loop we also count lines. There is no need for a Var(Int) for that
  148 + purpose. It's much better to use a 'deterministic local variable' in the form of an
  149 + extra argument to our function. The function will be called with the value 1 for this
  150 + argument, which simulates the initialisation of the variable.
  151 +
  152 +define One
  153 + main_loop
  154 + (
  155 + One -> LexerOutput(Token) lexer,
  156 + Int lineno /* no need for a Var(Int) */
  157 + ) =
  158 + /* get the next token or whatever */
  159 + if lexer(unique) is
  160 + {
  161 + end_of_input then /* no more token: exit the main loop */
  162 + unique,
  163 +
  164 + error(b) then
  165 + /* should never happen with this lexer (see the above comment) */
  166 + print("Error: ["+to_string(b)+"]\n");
  167 + /* nevertheless we continue the lexical analysis */
  168 + main_loop(lexer,lineno),
  169 +
  170 + token(t) then
  171 + /* a token has been recognized */
  172 + if t is
  173 + {
  174 + matching(s) then /* print the current line number and the recognized sequence */
  175 + print(abs_to_decimal(lineno)+": "+s+"\n");
  176 + /* continue with the same lineno */
  177 + main_loop(lexer,lineno),
  178 +
  179 + newline then /* continue with an incremented lineno */
  180 + main_loop(lexer,lineno+1)
  181 + }
  182 + }.
  183 +
  184 +
  185 +
  186 +
  187 +
  188 + *** [5] Carrying on.
  189 +
  190 +
  191 +read tools/basis.anubis (needed for UTime soustraction)
  192 +
  193 +
  194 + Now we can define our tool. We have to:
  195 +
  196 + - check that the user gave the two required arguments on the command line,
  197 + - prepare the lexer description,
  198 + - prepare the input stream,
  199 + - run the main loop.
  200 +
  201 +global define One
  202 + fast_lexer_example_1
  203 + (
  204 + List(String) args
  205 + ) =
  206 + /* check for first argument */
  207 + if args is
  208 + {
  209 + [ ] then print(usage),
  210 + [re . t] then
  211 + /* check for second argument */
  212 + if t is
  213 + {
  214 + [ ] then print(usage),
  215 + [filename . _] then
  216 + /* prepare the lexer description and make the lexer */
  217 + if make_lexer(prepare_lexer_description(re)) is
  218 + {
  219 + error(msg) then print("Syntax error in regular expression: "+to_English(msg)+"\n"),
  220 + ok(lexer) then
  221 + /* prepare the input stream */
  222 + if prepare_input(filename) is
  223 + {
  224 + failure then print("cannot open or read file '"+filename+"'.\n"),
  225 + success(ls) then
  226 + with start_time = unow,
  227 + /* run the main loop */
  228 + main_loop(lexer(ls),1);
  229 + if unow - start_time is utime(secs,microsecs) then
  230 + print("Duration: "+abs_to_decimal(secs)+" seconds, "+abs_to_decimal(microsecs)+" microseconds.\n")
  231 + }
  232 + }
  233 + }
  234 + }.
  235 +
  236 +
  237 +
  238 +
  239 +
  240 +
0 241 \ No newline at end of file
... ...
anubis_distrib/library/lexical_analysis/lexer_maker_v2_example.lexer 0 → 100644
  1 +
  2 +
  3 + This is an example of use of 'lexer_maker'.
  4 +
  5 +read tools/basis.anubis
  6 +
  7 +
  8 + We want to test email addresses. Below is a regular expression for that
  9 + purpose. Actually, this expression is too naïve. A real one would be more complicated.
  10 +
  11 +#ETL
  12 +
  13 +#email_tester String
  14 +[a-zA-Z0-9\-_]+(\.[a-zA-Z0-9\-_]+)*@[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+ { (ls,token(text)) }
  15 +#
  16 +
  17 +
  18 + Since '@' is a normal character, a string needs to contain exactly one '@' for being
  19 + accepted. What is accepted before and after this '@' is described by:
  20 +
  21 + [a-zA-Z]+(\.[a-zA-Z]+)*
  22 +
  23 + The first part: [a-zA-Z]+ means ``at least one letter''. The last part: (\.[a-zA-Z]+)*
  24 + means: ``a dot followed by at least one letter, and this may be repeated any number of
  25 + times (including zero)''.
  26 +
  27 +
  28 + This part of the source file is the 'postambule' (just Anubis text, which is copied 'as
  29 + is' to the lexer_maker output file).
  30 +
  31 + The above stuff produces a function named 'email_tester' into the lexer_maker output
  32 + file. This function is used below:
  33 +
  34 +global define One
  35 + test_email_address
  36 + (
  37 + List(String) args
  38 + ) =
  39 + if args is
  40 + {
  41 + [ ] then print("Usage: test_email_address <address> ... <address>\n"),
  42 + [_ . _] then
  43 + map_forget((String s) |->
  44 + with ls = lexer_state(make_stream(s),[],[],email_tester,true,false,failure),
  45 + if email_tester(ls) is (_,result) then if result is
  46 + {
  47 + end_of_file then print("End of input.\n"),
  48 + token(t) then with result1 = implode(t),
  49 + if length(result1) = length(s)
  50 + then print(s+" (accepted)\n")
  51 + else print(s+" (truncated as: "+result1+")\n"),
  52 + error then print(s+" (rejected)\n")
  53 + },
  54 + args)
  55 + }.
  56 +
  57 +
  58 +
... ...
anubis_distrib/library/lexical_analysis/testing_fast_lexer.anubis 0 → 100644
  1 +
  2 +
  3 +
  4 +
  5 +
  6 +
  7 +
  8 +
  9 + This is just for testing 'fast_lexer.anubis'.
  10 +
  11 +read tools/basis.anubis
  12 +read tools/streams.anubis
  13 +
  14 +read regexpr_parser.anubis
  15 +read dfa_compiler.anubis
  16 +
  17 +
  18 +define String
  19 + format
  20 + (
  21 + DFA_pre_label(String) l
  22 + ) =
  23 + if l is
  24 + {
  25 + char(c) then implode[c],
  26 + beginning_of_line then "^",
  27 + end_of_line then "$",
  28 + action(mbf) then if mbf is
  29 + {
  30 + failure then "<ignore>",
  31 + success(f) then if f(constant_byte_array(0,0)) is
  32 + token(s) then s else alert
  33 + }
  34 + }.
  35 +
  36 +define String
  37 + format
  38 + (
  39 + DFA_label l
  40 + ) =
  41 + if l is
  42 + {
  43 + char(c) then implode[c],
  44 + beginning_of_line then "^",
  45 + end_of_line then "$"
  46 + }.
  47 +
  48 +define Printable_tree
  49 + format
  50 + (
  51 + DFA_transition t
  52 + ) =
  53 + if t is transition(label,target_name) then
  54 + ["'", format(label), "'>", target_name, " "].
  55 +
  56 +
  57 +define Printable_tree
  58 + format
  59 + (
  60 + List(DFA_transition) l
  61 + ) =
  62 + if l is
  63 + {
  64 + [ ] then [ ],
  65 + [h . t] then [format(h) . format(t)]
  66 + }.
  67 +
  68 +define Printable_tree
  69 + format
  70 + (
  71 + DFA_state(String) s
  72 + ) =
  73 + if s is
  74 + {
  75 + rejecting(n,trs) then ["\n", to_decimal(n), " (rejecting) ", format(trs)],
  76 + accepting(n,trs,mba) then ["\n", to_decimal(n), " (accepting) ", format(trs),
  77 + if mba is
  78 + {
  79 + failure then "<ignore>",
  80 + success(a) then "<action "+
  81 + if a(constant_byte_array(0,0)) is
  82 + {
  83 + end_of_input then alert,
  84 + error(_) then alert,
  85 + token(s1) then s1
  86 + }+">"
  87 + }]
  88 + }.
  89 +
  90 +
  91 +define Printable_tree
  92 + format
  93 + (
  94 + List(DFA_state(String)) l
  95 + ) =
  96 + if l is
  97 + {
  98 + [ ] then ["\n------------------------\n"],
  99 + [h . t] then
  100 + [format(h) . format(t)]
  101 + }.
  102 +
  103 +
  104 +define One
  105 + syntax
  106 + =
  107 + print("Usage: fast_lexer_test <regular expression> ... <regular expression>\n\n").
  108 +
  109 +
  110 +define String
  111 + format
  112 + (
  113 + RegExpr e
  114 + ) =
  115 + if e is
  116 + {
  117 + char(Word8 c) then implode([c]),
  118 + choice(l) then "["+implode(l)+"]",
  119 + plus(RegExpr e1) then "("+format(e1)+"+"+")",
  120 + star(RegExpr e1) then "("+format(e1)+"*"+")",
  121 + cat(RegExpr e1,RegExpr e2) then format(e1)+format(e2),
  122 + or(RegExpr e1,RegExpr e2) then "("+format(e1)+"|"+format(e2)+")",
  123 + beginning_of_line then "^",
  124 + end_of_line then "$",
  125 + dot then ".",
  126 + question_mark(e1) then "("+format(e1)+")?"
  127 + }.
  128 +
  129 +
  130 +define String
  131 + format
  132 + (
  133 + BasicRegExpr($Token) e
  134 + ) =
  135 + if e is
  136 + {
  137 + char(c) then implode([c]),
  138 + star(e1) then "("+format(e1)+"*"+")",
  139 + or(e1,e2) then "("+format(e1)+"|"+format(e2)+")",
  140 + cat(e1,e2) then format(e1)+format(e2),
  141 + epsilon then "()",
  142 + beginning_of_line then "^",
  143 + end_of_line then "$",
  144 + action(a) then "<action>"
  145 + }.
  146 +
  147 +
  148 +define List(LexerItem(String))
  149 + prepare_lexer_items
  150 + (
  151 + List(String) regexprs,
  152 + Int i
  153 + ) =
  154 + if regexprs is
  155 + {
  156 + [ ] then [ ],
  157 + [h . t] then
  158 + [lexer_item(h,success((ByteArray b) |-> token(to_decimal(i))))
  159 + . prepare_lexer_items(t,i+1)]
  160 + }.
  161 +
  162 +
  163 +define Printable_tree
  164 + format
  165 + (
  166 + List(FastLexerTransition) l
  167 + ) =
  168 + if l is
  169 + {
  170 + [ ] then [ ],
  171 + [h . t] then if h is
  172 + {
  173 + transition(c,s) then
  174 + [implode[c], ":", s, " " . format(t)],
  175 + beginning_of_line(s) then
  176 + ["^:",s, " " . format(t)],
  177 + end_of_line(s) then
  178 + ["$:",s, " " . format(t)]
  179 + }
  180 + }.
  181 +
  182 +define Printable_tree
  183 + format
  184 + (
  185 + List(FastLexerState) l,
  186 + Int i
  187 + ) =
  188 + if l is
  189 + {
  190 + [ ] then ["\n------------------------\n"],
  191 + [h . t] then if h is
  192 + {
  193 + rejecting(trs) then ["\n", i, " rejecting: ", format(trs) . format(t,i+1)],
  194 + accepting(trs) then ["\n", i, " accepting: ", format(trs) . format(t,i+1)]
  195 + }
  196 + }.
  197 +
  198 +
  199 +
  200 +
  201 +define One
  202 + run_fast_lexer
  203 + (
  204 + (ByteArray input,
  205 + FastLexerLastAccepted last_accepted,
  206 + FastLexerBeginningOfLine bol,
  207 + FastLexerEndOfLine eol,
  208 + Int position,
  209 + Word32 starting_state) -> FastLexerOutput fast
  210 + ) =
  211 + with text = prompt("Try it out (q to quit): ") + "\n",
  212 + if text = "q\n" then unique else
  213 + with ba = to_byte_array(text),
  214 + if fast(ba,
  215 + none,
  216 + at_beginning_of_line,
  217 + not_at_end_of_line,
  218 + 0,
  219 + 0) is
  220 + {
  221 + rejected(n,e,a) then print("\""+to_string(extract(ba,0,e))+
  222 + "\" rejected in state "+to_decimal(n)+"\n"),
  223 + accepted(n,e,a) then print("\""+to_string(extract(ba,0,e))+
  224 + "\" accepted in state "+to_decimal(n)+"\n")
  225 + };
  226 + run_fast_lexer(fast).
  227 +
  228 +
  229 +define One
  230 + run_fast_lexer
  231 + (
  232 + List(FastLexerState) l
  233 + ) =
  234 + if make_fast_lexer(l) is
  235 + {
  236 + unknown_state(n) then print("\nUnknown state: "+to_decimal(n)),
  237 + ok(fast) then run_fast_lexer(fast)
  238 + }.
  239 +
  240 +
  241 +global define One
  242 + fast_lexer_test
  243 + (
  244 + List(String) args
  245 + ) =
  246 + if args is [] then syntax else
  247 + map_forget((String e) |-> if parse_regular_expression(make_stream(e)) is
  248 + {
  249 + error(msg) then print("*** Error: "+to_English(msg)+"\n\n"),
  250 + ok(re) then print("Regular expression "+e+" is correct.\n");
  251 + print("Read as: "+format(re)+"\n");
  252 + print("Basic equivalent: "+
  253 + format((BasicRegExpr(String))to_basic(re))+"\n\n")
  254 + },
  255 + args);
  256 + if make_DFA(prepare_lexer_items(args,0)) is
  257 + {
  258 + error(msg) then print("*** Error: "+to_English(msg)+"\n\n"),
  259 + ok(auto) then with fl = to_fast_lexer_description(auto),
  260 + print("Automaton:\n------------------------ ");
  261 + print(format(auto));
  262 + print("Fast Lexer:\n------------------------ ");
  263 + print(format(fl,0));
  264 + run_fast_lexer(fl)
  265 + }.
  266 +
  267 +
  268 +
  269 +
  270 +
  271 +
  272 +
0 273 \ No newline at end of file
... ...