Commit 60ef83694df06a3e880e457cebe9ccaeff00b39e
1 parent
0bc83921
Updated directory lexical_analysis in anubis_distrib/library/
Showing
4 changed files
with
2666 additions
and
0 deletions
Show diff stats
anubis_distrib/library/lexical_analysis/fast_lexer.anubis
0 → 100644
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + The Anubis Project | |
| 5 | + | |
| 6 | + A tool for producing fast buffered lexers. | |
| 7 | + | |
| 8 | + Copyright (c) Constructive Mathematics 2008. | |
| 9 | + | |
| 10 | + | |
| 11 | + Author: Alain Prouté | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + *** Introduction. | |
| 16 | + | |
| 17 | + This tool is more or less equivalent to the Unix tool LEX/FLEX. It replaces the | |
| 18 | + previous version of the same (at least similar) tool 'lexer_maker_v2.anubis' which | |
| 19 | + produces lexers which are too slow and is now obsolete. | |
| 20 | + | |
| 21 | + If you want to use this tool, you will have to add: | |
| 22 | + | |
| 23 | + read lexical_analysis/fast_lexer.anubis | |
| 24 | + | |
| 25 | + into your source file. | |
| 26 | + | |
| 27 | + | |
| 28 | + Consider a 'source' from which bytes can be read, such as a file, a network connection | |
| 29 | + (maybe an SSL connection), a string or a byte array, etc... There are tools for | |
| 30 | + getting the bytes from this source one after the other, but in general we are better | |
| 31 | + interested into particular sequences of bytes which are called `tokens'. As an | |
| 32 | + example, if the source is the following string: | |
| 33 | + | |
| 34 | + "344 + 87" | |
| 35 | + | |
| 36 | + we prefer to read the three 'tokens': "344", "+" and "87" directly (ignoring white | |
| 37 | + spaces) rather than the sequence of bytes '3', '4', '4', ' ', '+', ' ', '8' and '7'. | |
| 38 | + | |
| 39 | + A 'lexer' is precisely the gadget which will do this job easily and fast (and even | |
| 40 | + better than described above). It uses lexing streams, which are buffered for | |
| 41 | + better performances. | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + ---------------------------------- Table of Contents ---------------------------------- | |
| 46 | + | |
| 47 | + *** (1) Regular expressions. | |
| 48 | + *** (2) Lexer output. | |
| 49 | + *** (3) Lexing streams. | |
| 50 | + *** (4) Constructing a lexer. | |
| 51 | + *** (5) Plugging several lexers on the same input stream. | |
| 52 | + | |
| 53 | + --------------------------------------------------------------------------------------- | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + *** (1) Regular expressions. | |
| 59 | + | |
| 60 | + Regular expressions are character strings which are used for describing particular sets | |
| 61 | + of tokens. Regular expressions are written using ASCII characters, but some of them | |
| 62 | + have a special meaning. They are the following: | |
| 63 | + | |
| 64 | + ( ) [ ] - \ * + | . $ ^ ? | |
| 65 | + | |
| 66 | + All other characters just represent themself. For example, the regular expression | |
| 67 | + 'abcd' represents only the token 'abcd'. | |
| 68 | + | |
| 69 | + Parentheses do not represent anything. They are just used for delimiting regular | |
| 70 | + expressions. For example '(abcd)' represents the same thing as 'abcd'. | |
| 71 | + | |
| 72 | + The regular expression '[abcd]' represents the 4 tokens: 'a', 'b', 'c' and 'd'. In | |
| 73 | + other words, characters between brackets represent all the tokens made of one and only | |
| 74 | + one of these characters. There is a shortcut for ranges of characters. Instead of | |
| 75 | + writting | |
| 76 | + | |
| 77 | + [abcdefghijklmnopqrstuvwxyz] | |
| 78 | + | |
| 79 | + you may just write '[a-z]'. For example, the regular expression '[a-zA-Z0-9]' | |
| 80 | + represents any token made of one and only one alphanumeric character. | |
| 81 | + | |
| 82 | + If you add a caret just after the opening bracket, the regular expression represents | |
| 83 | + all one byte tokens for all bytes non present within the brackets (i.e. the | |
| 84 | + 'complement' in some sens of the previous set). For example, the regular expression | |
| 85 | + '[^a-z]' represents all one byte tokens whose unique character is not a lower case | |
| 86 | + letter. Note: a byte is any Word8, so that '[^a-z]' also matches characters of code | |
| 87 | + above 127. | |
| 88 | + | |
| 89 | + If 'A' is a regular expression, 'A+' represents any non empty concatenation of tokens | |
| 90 | + represented by 'A'. For example, '[a-z]+' represents any non empty sequence of | |
| 91 | + lowercase letters. Similarly, 'A*' represents all the tokens represented by 'A+', plus | |
| 92 | + the empty token (the token made of no character at all). | |
| 93 | + | |
| 94 | + If 'A' and 'B' are regular expressions, 'AB' is a regular expression representing any | |
| 95 | + concatenation of a token represented by 'A' and a token represented by 'B'. For | |
| 96 | + example, 'a+b+' represents any non empty sequence of 'a' followed by any non empty | |
| 97 | + sequence of 'b'. As another example, '[A-Z][A-Za-z]*' represents any sequence of | |
| 98 | + letters beginning by an upper case letter (hence actually non empty). | |
| 99 | + | |
| 100 | + The backslash character quotes the subsequent character. For example the regular | |
| 101 | + expression '\(' represents the token made of the single character '('. Of course, this | |
| 102 | + is useful for special characters. However, the sequences '\n', '\r' and '\t' represent | |
| 103 | + respectively a line feed, a carriage return and a tabulator. | |
| 104 | + | |
| 105 | + If 'A' and 'B' are regular expressions, 'A|B' is a regular expression representing all | |
| 106 | + the tokens represented by 'A' and all the tokens represented by 'B'. For example, | |
| 107 | + '(a+)|(b+)' represents all non empty sequences containing either only a's or only b's. | |
| 108 | + | |
| 109 | + The dot '.' represents any character except '\n'. | |
| 110 | + | |
| 111 | + If 'A' is a regular expression '^A' represents any token represented by 'A' provided | |
| 112 | + that it appears at the begining of a line. Similarly, 'A$' represents any token | |
| 113 | + represented by 'A' provided that it ends at the end of a line. For example the regular | |
| 114 | + expression '//.*$' matches a one line Anubis (or C++) comment, and the regular | |
| 115 | + expression '^define' matches the keyword 'define' only when it is found in the leftmost | |
| 116 | + column. | |
| 117 | + | |
| 118 | + If 'A' is a regular expression, 'A?' represents all the tokens represented by 'A' plus | |
| 119 | + the empty token. | |
| 120 | + | |
| 121 | + | |
| 122 | + When you construct a lexer you provide one or several regular expression. These regular | |
| 123 | + expression may be syntactically incorrect. For this reason, we have the following type | |
| 124 | + for classifying the possible errors: | |
| 125 | + | |
| 126 | +public type RegExprError: | |
| 127 | + premature_end_of_regexpr, | |
| 128 | + unexpected_right_par, | |
| 129 | + unexpected_right_bracket, | |
| 130 | + regexpr_is_empty, | |
| 131 | + star_not_following_a_regexpr, | |
| 132 | + plus_not_following_a_regexpr, | |
| 133 | + question_mark_not_following_a_regexpr, | |
| 134 | + non_character_within_brackets, | |
| 135 | + misplaced_hyphen, | |
| 136 | + unexpected_vbar, | |
| 137 | + empty_lexer_description. | |
| 138 | + | |
| 139 | + | |
| 140 | + For your convenience, the next function transforms such an error into a message in | |
| 141 | + English. | |
| 142 | + | |
| 143 | +public define String | |
| 144 | + to_English | |
| 145 | + ( | |
| 146 | + RegExprError e | |
| 147 | + ). | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + *** (2) Lexer output. | |
| 153 | + | |
| 154 | + A single lexer may recognize different sorts of tokens. For example, a lexer may | |
| 155 | + recognize 'symbols' (represented say by the regular expression '[a-zA-Z]+'), and | |
| 156 | + integers (represented say by the regular expression '[0-9]+'). The role of the lexer is | |
| 157 | + not only to recognize such tokens, but also to return them in such a way that their | |
| 158 | + sort is obvious. For this reason, it is convenient to define a type of tokens with one | |
| 159 | + alternative for each sort of token. In the case of our example, this type could be: | |
| 160 | + | |
| 161 | + type Token: | |
| 162 | + symbol(String name), | |
| 163 | + integer(Int value). | |
| 164 | + | |
| 165 | + The type of tokens for a given lexer is represented in this file by the type parameter | |
| 166 | + '$Token'. A lexer returns a datum of type: | |
| 167 | + | |
| 168 | +public type LexerOutput($Token): | |
| 169 | + end_of_input, | |
| 170 | + error(ByteArray), | |
| 171 | + token($Token). | |
| 172 | + | |
| 173 | + The lexer returns 'end_of_input' when there is no hope that a next token may be read | |
| 174 | + from the input source. In the case of a file this means that the end of the file has | |
| 175 | + been reached. In the case of a network connection, this means that the connection has | |
| 176 | + been closed or that time is out. In the case of a string or a byte array, this means | |
| 177 | + that the end of the string or byte array has been reached. | |
| 178 | + | |
| 179 | + The lexer returns 'error(b)' when no token can be read from the input (but the end of | |
| 180 | + the input has not been reached). Some bytes may have been read from the input, which | |
| 181 | + could have been the beginning of a token until the first byte which cannot be part of a | |
| 182 | + token. Next time the lexer will be called, it will continue to read from after this | |
| 183 | + sequence. | |
| 184 | + | |
| 185 | + When a token has been recognized, the lexer has the token at its disposal in the form | |
| 186 | + of a byte array. In order to transform this byte array into a datum of type '$Token' | |
| 187 | + you have to provide a function of type 'ByteArray -> LexerOutput($Token)'. For | |
| 188 | + example, if a 'symbol' is to be recognized, the corresponding function could be | |
| 189 | + something like this: | |
| 190 | + | |
| 191 | + (ByteArray b) |-> token(symbol(to_string(b))) | |
| 192 | + | |
| 193 | + If an integer is to be recognized, the corresponding function could be: | |
| 194 | + | |
| 195 | + (ByteArray b) |-> if decimal_scan(to_string(b)) is | |
| 196 | + { | |
| 197 | + failure then error(b), | |
| 198 | + success(n) then token(integer(n)) | |
| 199 | + } | |
| 200 | + | |
| 201 | + So, in the case of our example (using the type 'Token' above), the lexer may be | |
| 202 | + described by the following list of 'lexer items': | |
| 203 | + | |
| 204 | + [ | |
| 205 | + lexer_item("[A-Za-z]+", | |
| 206 | + success((ByteArray b) |-> token(symbol(to_string(b))))), | |
| 207 | + lexer_item("[0-9]+", | |
| 208 | + success((ByteArray b) |-> if decimal_scan(to_string(b)) is | |
| 209 | + { | |
| 210 | + failure then error(b), | |
| 211 | + success(n) then token(integer(n)) | |
| 212 | + })) | |
| 213 | + ] | |
| 214 | + | |
| 215 | + where the type 'LexerItem($Token)' is defined as follows: | |
| 216 | + | |
| 217 | +public type LexerItem($Token): | |
| 218 | + lexer_item(String regular_expression, | |
| 219 | + Maybe(ByteArray -> LexerOutput($Token)) action). | |
| 220 | + | |
| 221 | + If you don't provide a function in a lexer item (using 'failure' instead of 'success'), | |
| 222 | + the recognized token is just ignored and the lexer tries to read the next token. | |
| 223 | + | |
| 224 | + Notice that the most usual use of a lexer is to call it repeatedly until it returns | |
| 225 | + 'end_of_input'. However, in some circumstances, we want to check for example if a whole | |
| 226 | + string matches a regular expression. In this case the lexer is called a first time, and | |
| 227 | + if it returns a token it must be called a second time in order to check that we have | |
| 228 | + reached the end of the input. | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + *** (3) Lexing streams. | |
| 234 | + | |
| 235 | + The lexer recognizes tokens by reading characters from some input. The actual input may | |
| 236 | + be either a file, a network connection, a string, a byte array, or anything able to | |
| 237 | + provide characters. From any of the above you may construct a 'lexing stream'. | |
| 238 | + | |
| 239 | +public type LexingStream:... (an opaque type) | |
| 240 | + | |
| 241 | +public define LexingStream make_lexing_stream(ByteArray b). | |
| 242 | +public define LexingStream make_lexing_stream(String s). | |
| 243 | +public define Maybe(LexingStream) make_lexing_stream(RStream stream, | |
| 244 | + Int buffer_size, | |
| 245 | + Int timeout). | |
| 246 | +public define Maybe(LexingStream) make_lexing_stream(RWStream stream, | |
| 247 | + Int buffer_size, | |
| 248 | + Int timeout). | |
| 249 | +public define Maybe(LexingStream) make_lexing_stream(SSL_Connection stream, | |
| 250 | + Int buffer_size, | |
| 251 | + Int timeout). | |
| 252 | + | |
| 253 | + In the case of a file or network connection (first argument of type 'RStream', | |
| 254 | + 'RWStream', 'SSL_Connection') byte arrays are used for buffering the input. The maximal | |
| 255 | + size of these buffers must be provided as the second argument. The choice has no | |
| 256 | + incidence on the behavior of the lexer, except with respect to performances, and the | |
| 257 | + lexer can still return tokens longer than this size. The timeout is in seconds and | |
| 258 | + used each time the buffer is reloaded from the actual input. When time is out, the | |
| 259 | + lexer gives up as if the end of the input was reached. So, you may have to give a | |
| 260 | + rather high value to this timeout. | |
| 261 | + | |
| 262 | + 'make_lexing_stream' returns 'failure' if a read error or timeout occurs when the | |
| 263 | + buffer is loaded for the first time. | |
| 264 | + | |
| 265 | + In the case of a byte array or a string, the situation is much simpler. The buffer is | |
| 266 | + the byte array or the string itself, no time out is needed and the result has no | |
| 267 | + 'Maybe'. | |
| 268 | + | |
| 269 | + If you need another kind of lexing stream, have a look at the private part of this | |
| 270 | + file, in particular at the actual definition of type 'LexingStream', and write down | |
| 271 | + another such function. | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + *** (4) Constructing a lexer. | |
| 277 | + | |
| 278 | + In order to construct a lexer use the following: | |
| 279 | + | |
| 280 | +public define Result(RegExprError, LexingStream -> One -> LexerOutput($Token)) | |
| 281 | + make_lexer | |
| 282 | + ( | |
| 283 | + List(LexerItem($Token)) lexer_description | |
| 284 | + ). | |
| 285 | + | |
| 286 | + Thus, a lexer is constructed (if no error occurs) as a function of type: | |
| 287 | + | |
| 288 | + LexingStream -> One -> LexerOutput($Token) | |
| 289 | + | |
| 290 | + Applying this function to a lexing stream is understood as 'plugging' it to the | |
| 291 | + stream. The result is a function of type: | |
| 292 | + | |
| 293 | + One -> LexerOutput($Token) | |
| 294 | + | |
| 295 | + to be used repeatedly until it returns 'end_of_input'. | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + *** (5) Plugging several lexers on the same input. | |
| 302 | + | |
| 303 | + It is often the case that we have to use several lexers on the same input. This is | |
| 304 | + equivalent to saying that we have only one lexer in this input but with several | |
| 305 | + different 'states' in the sens of LEX/FLEX for example. In our system there is no | |
| 306 | + notion of 'state' for lexers, but several lexers may use the same lexing stream | |
| 307 | + concurrently. You can plug them to the same lexing stream, and use them repeatedly in | |
| 308 | + any order depending on the sort of thing you want to read from the stream. | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + --- That's all for the public part ! -------------------------------------------------- | |
| 315 | + | |
| 316 | + | |
| 317 | +read tools/basis.anubis | |
| 318 | +read tools/streams.anubis | |
| 319 | + | |
| 320 | + | |
| 321 | + -------------------------------- Table of Contents ------------------------------------ | |
| 322 | + | |
| 323 | + | |
| 324 | + --------------------------------------------------------------------------------------- | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + *** [1] Parsing regular expressions. | |
| 330 | + | |
| 331 | + | |
| 332 | + *** [1.1] Regular expressions. | |
| 333 | + | |
| 334 | + Regular expressions are formalized as follows. | |
| 335 | + | |
| 336 | +public type RegExpr: | |
| 337 | + char(Word8), // a | |
| 338 | + choice(List(Word8)), // [abc] | |
| 339 | + plus(RegExpr), // a+ | |
| 340 | + star(RegExpr), // a* | |
| 341 | + cat(RegExpr,RegExpr), // ab | |
| 342 | + or(RegExpr,RegExpr), // (a|b) | |
| 343 | + beginning_of_line, // ^ | |
| 344 | + end_of_line, // $ | |
| 345 | + dot, // . | |
| 346 | + question_mark(RegExpr). // a? | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + *** [1.2] Basic regular expressions. | |
| 351 | + | |
| 352 | + Basic regular expressions are enough for representing all regular expressions. In other | |
| 353 | + words any regular expression is equivalent to a basic regular expression. Furthermore, | |
| 354 | + at some point of the construction of lexers we have to handle 'actions'. We introduce | |
| 355 | + them here even if we generates them only in 'dfa_compiler.anubis'. This also makes the | |
| 356 | + type 'LexerOutput($Token)' required at this point. | |
| 357 | + | |
| 358 | +public type BasicRegExpr($Token): | |
| 359 | + char(Word8), | |
| 360 | + star(BasicRegExpr($Token)), | |
| 361 | + or(BasicRegExpr($Token),BasicRegExpr($Token)), | |
| 362 | + cat(BasicRegExpr($Token),BasicRegExpr($Token)), | |
| 363 | + epsilon, // matches the empty sequence of characters | |
| 364 | + beginning_of_line, | |
| 365 | + end_of_line, | |
| 366 | + action(Maybe(ByteArray -> LexerOutput($Token))). | |
| 367 | + | |
| 368 | + The role of 'epsilon', which matches only the empty lexeme, if to provide a | |
| 369 | + representation for the empty choice '[]', and for regular expressions of the form 'A?', | |
| 370 | + which are translated into 'or(A,epsilon)'. | |
| 371 | + | |
| 372 | + The following function transforms a regular expression into an equivalent basic regular | |
| 373 | + expression. | |
| 374 | + | |
| 375 | +public define BasicRegExpr($Token) | |
| 376 | + to_basic | |
| 377 | + ( | |
| 378 | + RegExpr e | |
| 379 | + ). | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + *** [1.3] 'Extended' characters. | |
| 384 | + | |
| 385 | + 'Extended' characters (used in regular expressions) are defined (and classified) as | |
| 386 | + follows. | |
| 387 | + | |
| 388 | +type ExChar: | |
| 389 | + left_par, // ( | |
| 390 | + right_par, // ) | |
| 391 | + left_bracket, // [ | |
| 392 | + right_bracket, // ] | |
| 393 | + star, // * | |
| 394 | + plus, // + | |
| 395 | + or, // | | |
| 396 | + dot, // . | |
| 397 | + dollar, // $ | |
| 398 | + caret, // ^ | |
| 399 | + hyphen, // - | |
| 400 | + question_mark, // ? | |
| 401 | + char(Word8). // a, b, c, ... | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + *** [1.4] Getting the next (extended) character from the input stream. | |
| 407 | + | |
| 408 | + The next function reads an extended character from the input stream. It returns | |
| 409 | + 'failure' as it encounters the end of the input. | |
| 410 | + | |
| 411 | +define Maybe(ExChar) | |
| 412 | + next_exchar | |
| 413 | + ( | |
| 414 | + Stream s | |
| 415 | + ) = | |
| 416 | + if read_byte(s) is | |
| 417 | + { | |
| 418 | + failure then failure, | |
| 419 | + success(c) then | |
| 420 | + if c = '\' | |
| 421 | + then if read_byte(s) is | |
| 422 | + { | |
| 423 | + failure then failure, | |
| 424 | + success(d) then | |
| 425 | + if d = 'n' then success(char('\n')) else | |
| 426 | + if d = 'r' then success(char('\r')) else | |
| 427 | + if d = 't' then success(char('\t')) else | |
| 428 | + success(char(d)) | |
| 429 | + } | |
| 430 | + else if c = '(' then success(left_par) | |
| 431 | + else if c = ')' then success(right_par) | |
| 432 | + else if c = '[' then success(left_bracket) | |
| 433 | + else if c = ']' then success(right_bracket) | |
| 434 | + else if c = '|' then success(or) | |
| 435 | + else if c = '*' then success(star) | |
| 436 | + else if c = '+' then success(plus) | |
| 437 | + else if c = '.' then success(dot) | |
| 438 | + else if c = '$' then success(dollar) | |
| 439 | + else if c = '^' then success(caret) | |
| 440 | + else if c = '-' then success(hyphen) | |
| 441 | + else if c = '?' then success(question_mark) | |
| 442 | + else success(char(c)) | |
| 443 | + }. | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + *** [1.5] Tools. | |
| 451 | + | |
| 452 | + *** [1.5.1] Truncating a Word32 to a Word8. | |
| 453 | + | |
| 454 | +define Word8 | |
| 455 | + truncate_to_Word8 | |
| 456 | + ( | |
| 457 | + Word32 x | |
| 458 | + ) = | |
| 459 | + if x is word32(l1,_) then if l1 is word16(l2,_) then l2. | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + *** [1.5.2] Creating a range of consecutive characters. | |
| 464 | + | |
| 465 | + Given a first character and a last character, create the list of all characters between | |
| 466 | + these two (included). | |
| 467 | + | |
| 468 | +define List(Word8) | |
| 469 | + range | |
| 470 | + ( | |
| 471 | + Word8 a, | |
| 472 | + Word8 z | |
| 473 | + ) = | |
| 474 | + if z = a then [a] else [a . range(a+1,z)]. | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + *** [1.5.3] Computing the complement of a set of characters. | |
| 480 | + | |
| 481 | + Compute the 'complement' of a choice, i.e. the list of all characters which do not | |
| 482 | + belong to the given choice. | |
| 483 | + | |
| 484 | +define List(Word8) | |
| 485 | + complement_choice | |
| 486 | + ( | |
| 487 | + List(Word8) l, | |
| 488 | + List(Word8) result, | |
| 489 | + Word32 n | |
| 490 | + ) = | |
| 491 | + if n = -1 then result else | |
| 492 | + with c = truncate_to_Word8(n), | |
| 493 | + if member(l,c) | |
| 494 | + then complement_choice(l,result,n-1) | |
| 495 | + else complement_choice(l,[c . result],n-1). | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + *** [1.5.4] Concatenating a list of regular expression (in reverse order). | |
| 502 | + | |
| 503 | + Concatenate a (non empty) list of RegExpr in reverse order: | |
| 504 | + | |
| 505 | +define RegExpr | |
| 506 | + cat_list | |
| 507 | + ( | |
| 508 | + RegExpr last, | |
| 509 | + List(RegExpr) others | |
| 510 | + ) = | |
| 511 | + if others is | |
| 512 | + { | |
| 513 | + [ ] then last, | |
| 514 | + [h . t] then cat(cat_list(h,t),last) | |
| 515 | + }. | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + *** [1.5.5] Reading a 'choice' of characters. | |
| 521 | + | |
| 522 | + Reading a 'choice', i.e. the characters within square brackets. | |
| 523 | + | |
| 524 | +define Result(RegExprError,List(Word8)) | |
| 525 | + read_choice | |
| 526 | + ( | |
| 527 | + Stream s, | |
| 528 | + List(Word8) already_read | |
| 529 | + ) = | |
| 530 | + if next_exchar(s) is | |
| 531 | + { | |
| 532 | + failure then error(premature_end_of_regexpr), | |
| 533 | + success(x) then | |
| 534 | + if x is right_bracket then ok(already_read) else | |
| 535 | + if x is char(c) then read_choice(s,[c . already_read]) else | |
| 536 | + if x is hyphen then | |
| 537 | + if already_read is | |
| 538 | + { | |
| 539 | + [ ] then error(misplaced_hyphen), | |
| 540 | + [a . others] then | |
| 541 | + if next_exchar(s) is | |
| 542 | + { | |
| 543 | + failure then error(premature_end_of_regexpr), | |
| 544 | + success(y) then | |
| 545 | + if y is char(z) | |
| 546 | + then read_choice(s,reverse_append(range(a,z),others)) | |
| 547 | + else error(non_character_within_brackets) | |
| 548 | + } | |
| 549 | + } | |
| 550 | + else error(non_character_within_brackets) | |
| 551 | + }. | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + *** [1.5.6] Reading a complemented 'choice' of characters. | |
| 558 | + | |
| 559 | + The same one but giving the complement of the 'choice'. | |
| 560 | + | |
| 561 | +define Result(RegExprError,List(Word8)) | |
| 562 | + read_counter_choice | |
| 563 | + ( | |
| 564 | + Stream s, | |
| 565 | + List(Word8) already_read | |
| 566 | + ) = | |
| 567 | + if read_choice(s,already_read) is | |
| 568 | + { | |
| 569 | + error(msg) then error(msg), | |
| 570 | + ok(l) then ok(complement_choice(l,[],255)) | |
| 571 | + }. | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + *** [1.5.7] Reading a 'choice' (general case). | |
| 577 | + | |
| 578 | + The following function is called when a left bracket has been read. It reads extended | |
| 579 | + characters until the right bracket is found. | |
| 580 | + | |
| 581 | +define Result(RegExprError,List(Word8)) | |
| 582 | + read_within_brackets | |
| 583 | + ( | |
| 584 | + Stream s | |
| 585 | + ) = | |
| 586 | + if next_exchar(s) is | |
| 587 | + { | |
| 588 | + failure then error(premature_end_of_regexpr), | |
| 589 | + success(x) then | |
| 590 | + if x = caret | |
| 591 | + then read_counter_choice(s,[]) | |
| 592 | + else if x is char(c) then read_choice(s,[c]) | |
| 593 | + else error(non_character_within_brackets) | |
| 594 | + }. | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + *** [1.6] Reading a regular expression. | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + *** [1.6.1] Right delimiters. | |
| 606 | + | |
| 607 | +type RightDelimiter: | |
| 608 | + right_par, | |
| 609 | + end_of_regexpr. | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + *** [1.6.2] Recursive reading. | |
| 615 | + | |
| 616 | +define Result(RegExprError,RegExpr) | |
| 617 | + read_regexpr | |
| 618 | + ( | |
| 619 | + Stream s, | |
| 620 | + List(RegExpr) already_read, | |
| 621 | + RightDelimiter delim | |
| 622 | + ) = | |
| 623 | + if next_exchar(s) is | |
| 624 | + { | |
| 625 | + failure then | |
| 626 | + if delim is | |
| 627 | + { | |
| 628 | + right_par then | |
| 629 | + error(premature_end_of_regexpr), | |
| 630 | + | |
| 631 | + end_of_regexpr then | |
| 632 | + if already_read is | |
| 633 | + { | |
| 634 | + [ ] then error(regexpr_is_empty), | |
| 635 | + [last . others] then | |
| 636 | + ok(cat_list(last,others)) | |
| 637 | + } | |
| 638 | + }, | |
| 639 | + | |
| 640 | + success(ec) then | |
| 641 | + if ec is | |
| 642 | + { | |
| 643 | + left_par then | |
| 644 | + if read_regexpr(s,[],right_par) is | |
| 645 | + { | |
| 646 | + error(msg) then | |
| 647 | + error(msg), | |
| 648 | + | |
| 649 | + ok(r1) then | |
| 650 | + read_regexpr(s,[r1 . already_read],delim) | |
| 651 | + }, | |
| 652 | + | |
| 653 | + right_par then | |
| 654 | + if delim is | |
| 655 | + { | |
| 656 | + right_par then | |
| 657 | + if already_read is | |
| 658 | + { | |
| 659 | + [ ] then | |
| 660 | + error(unexpected_right_par), | |
| 661 | + | |
| 662 | + [last . others] then | |
| 663 | + ok(cat_list(last,others)) | |
| 664 | + }, | |
| 665 | + | |
| 666 | + end_of_regexpr then | |
| 667 | + error(unexpected_right_par) | |
| 668 | + }, | |
| 669 | + | |
| 670 | + left_bracket then | |
| 671 | + if read_within_brackets(s) is | |
| 672 | + { | |
| 673 | + error(msg) then error(msg), | |
| 674 | + | |
| 675 | + ok(r1) then if already_read is | |
| 676 | + { | |
| 677 | + [ ] then | |
| 678 | + read_regexpr(s,[choice(r1)],delim), | |
| 679 | + | |
| 680 | + [last . others] then | |
| 681 | + read_regexpr(s,[choice(r1),last . others],delim) | |
| 682 | + } | |
| 683 | + }, | |
| 684 | + | |
| 685 | + right_bracket then | |
| 686 | + error(unexpected_right_bracket), | |
| 687 | + | |
| 688 | + star then | |
| 689 | + if already_read is | |
| 690 | + { | |
| 691 | + [ ] then | |
| 692 | + error(star_not_following_a_regexpr), | |
| 693 | + | |
| 694 | + [last . others] then | |
| 695 | + read_regexpr(s,[star(last) . others],delim) | |
| 696 | + }, | |
| 697 | + | |
| 698 | + plus then | |
| 699 | + if already_read is | |
| 700 | + { | |
| 701 | + [ ] then | |
| 702 | + error(plus_not_following_a_regexpr), | |
| 703 | + | |
| 704 | + [last . others] then | |
| 705 | + read_regexpr(s,[plus(last) . others],delim) | |
| 706 | + }, | |
| 707 | + | |
| 708 | + or then | |
| 709 | + if read_regexpr(s,[],delim) is | |
| 710 | + { | |
| 711 | + error(msg) then error(msg), | |
| 712 | + | |
| 713 | + ok(r1) then | |
| 714 | + if already_read is | |
| 715 | + { | |
| 716 | + [ ] then error(unexpected_vbar), | |
| 717 | + [h . t] then | |
| 718 | + ok(or(cat_list(h,t),r1)) | |
| 719 | + } | |
| 720 | + }, | |
| 721 | + | |
| 722 | + dot then | |
| 723 | + read_regexpr(s,[dot . already_read], delim), | |
| 724 | + | |
| 725 | + dollar then | |
| 726 | + read_regexpr(s,[end_of_line . already_read], delim), | |
| 727 | + | |
| 728 | + caret then | |
| 729 | + read_regexpr(s,[beginning_of_line . already_read], delim), | |
| 730 | + | |
| 731 | + hyphen then | |
| 732 | + error(misplaced_hyphen), | |
| 733 | + | |
| 734 | + question_mark then | |
| 735 | + if already_read is | |
| 736 | + { | |
| 737 | + [ ] then | |
| 738 | + error(question_mark_not_following_a_regexpr), | |
| 739 | + | |
| 740 | + [last . others] then | |
| 741 | + read_regexpr(s,[question_mark(last) . others],delim) | |
| 742 | + }, | |
| 743 | + | |
| 744 | + char(c) then | |
| 745 | + read_regexpr(s,[char(c) . already_read], delim) | |
| 746 | + } | |
| 747 | + }. | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + *** [1.6.3] Normalizing a regular expression. | |
| 753 | + | |
| 754 | + This amounts to add (^)? at the beginning of every regular expression not beginning by | |
| 755 | + ^ and ($)? at the end of any regular expression not ending by $. | |
| 756 | + | |
| 757 | +define Bool | |
| 758 | + begins_by_bol | |
| 759 | + ( | |
| 760 | + RegExpr re | |
| 761 | + ) = | |
| 762 | + if re is | |
| 763 | + { | |
| 764 | + char(Word8 _0) then false, | |
| 765 | + choice(List(Word8) _0) then false, | |
| 766 | + plus(RegExpr _0) then false, | |
| 767 | + star(RegExpr _0) then false, | |
| 768 | + cat(RegExpr _0,RegExpr _1) then begins_by_bol(_0), | |
| 769 | + or(RegExpr _0,RegExpr _1) then false, | |
| 770 | + beginning_of_line then true, | |
| 771 | + end_of_line then false, | |
| 772 | + dot then false, | |
| 773 | + question_mark(RegExpr _0) then false | |
| 774 | + }. | |
| 775 | + | |
| 776 | +define Bool | |
| 777 | + ends_by_eol | |
| 778 | + ( | |
| 779 | + RegExpr re | |
| 780 | + ) = | |
| 781 | + if re is | |
| 782 | + { | |
| 783 | + char(Word8 _0) then false, | |
| 784 | + choice(List(Word8) _0) then false, | |
| 785 | + plus(RegExpr _0) then false, | |
| 786 | + star(RegExpr _0) then false, | |
| 787 | + cat(RegExpr _0,RegExpr _1) then ends_by_eol(_1), | |
| 788 | + or(RegExpr _0,RegExpr _1) then false, | |
| 789 | + beginning_of_line then false, | |
| 790 | + end_of_line then true, | |
| 791 | + dot then false, | |
| 792 | + question_mark(RegExpr _0) then false | |
| 793 | + }. | |
| 794 | + | |
| 795 | + | |
| 796 | +define RegExpr | |
| 797 | + normalize | |
| 798 | + ( | |
| 799 | + RegExpr re | |
| 800 | + ) = | |
| 801 | + with re1 = if begins_by_bol(re) then re else cat(question_mark(beginning_of_line),re), | |
| 802 | + if ends_by_eol(re1) then re1 else cat(re1,question_mark(end_of_line)). | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + *** [1.6.4] The tool for parsing regular expressions. | |
| 808 | + | |
| 809 | +define Result(RegExprError,RegExpr) | |
| 810 | + parse_regular_expression | |
| 811 | + ( | |
| 812 | + Stream s, | |
| 813 | + ) = | |
| 814 | + if read_regexpr(s,[],end_of_regexpr) is | |
| 815 | + { | |
| 816 | + error(msg) then error(msg), | |
| 817 | + ok(re) then ok(normalize(re)) | |
| 818 | + }. | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + *** [1.7] Transforming a regular expression into a basic one. | |
| 825 | + | |
| 826 | + *** [1.7.1] Expanding a 'choice' of characters. | |
| 827 | + | |
| 828 | + Given list of characters (a 'choice sequence'), compute the correponding basic regular | |
| 829 | + expression. | |
| 830 | + | |
| 831 | +define BasicRegExpr($Token) | |
| 832 | + expand_choice | |
| 833 | + ( | |
| 834 | + List(Word8) l | |
| 835 | + ) = | |
| 836 | + if l is | |
| 837 | + { | |
| 838 | + [ ] then epsilon, | |
| 839 | + [h . t] then | |
| 840 | + if t is [ ] then char(h) else | |
| 841 | + or(char(h),expand_choice(t)) | |
| 842 | + }. | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
| 846 | + *** [1.7.2] The tool for converting to basic. | |
| 847 | + | |
| 848 | + Convert a regular expression to a basic one. | |
| 849 | + | |
| 850 | +public define BasicRegExpr($Token) | |
| 851 | + to_basic | |
| 852 | + ( | |
| 853 | + RegExpr r | |
| 854 | + ) = | |
| 855 | + if r is | |
| 856 | + { | |
| 857 | + char(c) then char(c), | |
| 858 | + choice(l) then expand_choice(l), | |
| 859 | + plus(r1) then with br = to_basic(r1), cat(br,star(br)), | |
| 860 | + star(r1) then star(to_basic(r1)), | |
| 861 | + cat(r1,r2) then cat(to_basic(r1),to_basic(r2)), | |
| 862 | + or(r1,r2) then or(to_basic(r1),to_basic(r2)), | |
| 863 | + beginning_of_line then beginning_of_line, | |
| 864 | + end_of_line then end_of_line, | |
| 865 | + dot then expand_choice(reverse_append(range(0,'\n'-1), | |
| 866 | + range('\n'+1,255))), | |
| 867 | + question_mark(r1) then or(epsilon,to_basic(r1)) | |
| 868 | + }. | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + *** [1.8] Formating error messages into English. | |
| 874 | + | |
| 875 | +public define String | |
| 876 | + to_English | |
| 877 | + ( | |
| 878 | + RegExprError e | |
| 879 | + ) = | |
| 880 | + if e is | |
| 881 | + { | |
| 882 | + premature_end_of_regexpr then "Premature end of regular expression.", | |
| 883 | + unexpected_right_par then "Unexpected right parenthese.", | |
| 884 | + unexpected_right_bracket then "Unexpected right bracket.", | |
| 885 | + regexpr_is_empty then "Regular expression is empty.", | |
| 886 | + star_not_following_a_regexpr then "Found '*' not following any regular expression.", | |
| 887 | + plus_not_following_a_regexpr then "Found '+' not following any regular expression.", | |
| 888 | + question_mark_not_following_a_regexpr then "Found '?' not following any regular expression.", | |
| 889 | + non_character_within_brackets then "Non character within brackets.", | |
| 890 | + misplaced_hyphen then "Misplaced hyphen.", | |
| 891 | + unexpected_vbar then "Misplaced vertical bar.", | |
| 892 | + empty_lexer_description then "Empty lexer description." | |
| 893 | + }. | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + *** [2] Lexing streams. | |
| 903 | + | |
| 904 | + *** [2.1] The type 'LexingStream'. | |
| 905 | + | |
| 906 | + A lexing stream provides tools which are adhoc for using low level fast lexers as | |
| 907 | + defined in section 13 of predefined.anubis: | |
| 908 | + | |
| 909 | + - a variable 'buffer_v' containing the current buffer, | |
| 910 | + - a variable 'start_v' giving the starting position of the current lexeme within the buffer, | |
| 911 | + - a variable 'last_accept_v' giving the last accepting position (if any), | |
| 912 | + - a variable 'current_v' giving the currrent position of reading within the buffer, | |
| 913 | + - a function 'reload_buffer' for loading new bytes from the input. | |
| 914 | + | |
| 915 | + | |
| 916 | +public type LexingStream: | |
| 917 | + lexing_stream | |
| 918 | + ( | |
| 919 | + Var(ByteArray) buffer_v, // the current buffer | |
| 920 | + Var(Int) start_v, // start of lexem in buffer | |
| 921 | + Var(FastLexerLastAccepted) last_accept_v, // last accepting position (if any) | |
| 922 | + Var(Int) current_v, // position of reading in buffer | |
| 923 | + Int -> Maybe(One) reload_buffer // command for loading the sequel in the buffer | |
| 924 | + ). | |
| 925 | + | |
| 926 | + While we are reading a lexeme, we keep the starting position (offset of first character | |
| 927 | + of the current lexeme) in 'start_v' so as to be able to extract the lexeme. We also | |
| 928 | + keep the last position at which a lexeme was accepted. This is because the lexer always | |
| 929 | + tries to read the longuest possible lexeme. If at some point the lexeme is rejected, | |
| 930 | + and if there is a last accepting position, the current position comes back to this last | |
| 931 | + accepting position, and the lexeme is accepted. | |
| 932 | + | |
| 933 | + 'reload_buffer' works as follows. It returns 'failure' is there is nothing more to be | |
| 934 | + read from the actual input (the connection is down, the end of the file has been | |
| 935 | + reached or time is out). In this case, the current buffer is unchanged. | |
| 936 | + | |
| 937 | + Otherwise, it reads a chunk of characters (say V) from the actual input, extracts the | |
| 938 | + part of the current buffer starting at the argument (say U), and establishes U+V as | |
| 939 | + then new current buffer. The other variables are updated accordingly. | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + *** [2.2] Constructing lexing streams. | |
| 945 | + | |
| 946 | + *** [2.2.1] From a byte array. | |
| 947 | + | |
| 948 | +public define LexingStream | |
| 949 | + make_lexing_stream | |
| 950 | + ( | |
| 951 | + ByteArray b | |
| 952 | + ) = | |
| 953 | + lexing_stream(var(b), // buffer | |
| 954 | + var(0), // starting position | |
| 955 | + var(none), // last accepting position | |
| 956 | + var(0), // current position | |
| 957 | + (Int u) |-> failure). // buffer cannot be reloaded | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + *** [2.2.2] From a string. | |
| 963 | + | |
| 964 | +public define LexingStream | |
| 965 | + make_lexing_stream | |
| 966 | + ( | |
| 967 | + String s | |
| 968 | + ) = | |
| 969 | + make_lexing_stream(to_byte_array(s)). | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + *** [2.2.3] From a read only stream. | |
| 975 | + | |
| 976 | +public define Maybe(LexingStream) | |
| 977 | + make_lexing_stream | |
| 978 | + ( | |
| 979 | + RStream stream, | |
| 980 | + Int buffer_size, | |
| 981 | + Int timeout | |
| 982 | + ) = | |
| 983 | + if read(stream,buffer_size,timeout) is | |
| 984 | + { | |
| 985 | + error then failure, | |
| 986 | + timeout then failure, | |
| 987 | + ok(buffer) then | |
| 988 | + with buffer_v = var(buffer), | |
| 989 | + start_v = var((Int)0), | |
| 990 | + last_accepted_v = var((FastLexerLastAccepted)none), | |
| 991 | + current_v = var((Int)0), | |
| 992 | + reload_buffer = (Int i) |-> | |
| 993 | + if read(stream,buffer_size,timeout) is | |
| 994 | + { | |
| 995 | + error then failure, | |
| 996 | + timeout then failure, | |
| 997 | + ok(more) then | |
| 998 | + //print("Buffer reloaded ("+abs_to_decimal(length(more))+" bytes).\n"); | |
| 999 | + if length(more) = 0 | |
| 1000 | + then (with old_buffer = *buffer_v, | |
| 1001 | + old_length = length(old_buffer), | |
| 1002 | + dropped = *start_v, // number of bytes dropped from old buffer | |
| 1003 | + buffer_v <- extract(old_buffer,dropped,old_length); | |
| 1004 | + start_v <- 0; | |
| 1005 | + current_v <- *current_v - dropped; | |
| 1006 | + last_accepted_v <- | |
| 1007 | + if *last_accepted_v is | |
| 1008 | + { | |
| 1009 | + none then none, | |
| 1010 | + last(s,a) then last(s,a - dropped) | |
| 1011 | + }; | |
| 1012 | + failure) | |
| 1013 | + else (with old_buffer = *buffer_v, | |
| 1014 | + old_length = length(old_buffer), | |
| 1015 | + dropped = *start_v, // number of bytes dropped from old buffer | |
| 1016 | + buffer_v <- extract(old_buffer,dropped,old_length)+more; | |
| 1017 | + start_v <- 0; | |
| 1018 | + current_v <- *current_v - dropped; | |
| 1019 | + last_accepted_v <- | |
| 1020 | + if *last_accepted_v is | |
| 1021 | + { | |
| 1022 | + none then none, | |
| 1023 | + last(s,a) then last(s,a - dropped) | |
| 1024 | + }; | |
| 1025 | + success(unique)) | |
| 1026 | + }, | |
| 1027 | + success(lexing_stream(buffer_v, | |
| 1028 | + start_v, | |
| 1029 | + last_accepted_v, | |
| 1030 | + current_v, | |
| 1031 | + reload_buffer)) | |
| 1032 | + }. | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + *** [2.2.4] From a read/write stream. | |
| 1038 | + | |
| 1039 | +public define Maybe(LexingStream) | |
| 1040 | + make_lexing_stream | |
| 1041 | + ( | |
| 1042 | + RWStream stream, | |
| 1043 | + Int buffer_size, | |
| 1044 | + Int timeout | |
| 1045 | + ) = | |
| 1046 | + make_lexing_stream(weaken(stream),buffer_size,timeout). | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + *** [2.2.5] From an SSL connection. | |
| 1052 | + | |
| 1053 | +public define Maybe(LexingStream) | |
| 1054 | + make_lexing_stream | |
| 1055 | + ( | |
| 1056 | + SSL_Connection stream, | |
| 1057 | + Int buffer_size, | |
| 1058 | + Int timeout | |
| 1059 | + ) = | |
| 1060 | + if (Maybe(ByteArray))read(stream,buffer_size,timeout) is | |
| 1061 | + { | |
| 1062 | + failure then failure, | |
| 1063 | + success(buffer) then | |
| 1064 | + with buffer_v = var(buffer), | |
| 1065 | + start_v = var((Int)0), | |
| 1066 | + last_accepted_v = var((FastLexerLastAccepted)none), | |
| 1067 | + current_v = var((Int)0), | |
| 1068 | + reload_buffer = (Int i) |-> | |
| 1069 | + if (Maybe(ByteArray))read(stream,buffer_size,timeout) is | |
| 1070 | + { | |
| 1071 | + failure then failure, | |
| 1072 | + success(more) then | |
| 1073 | + if length(more) = 0 | |
| 1074 | + then failure | |
| 1075 | + else with old_buffer = *buffer_v, | |
| 1076 | + old_length = length(old_buffer), | |
| 1077 | + dropped = *start_v, // number of bytes dropped from old buffer | |
| 1078 | + buffer_v <- extract(old_buffer,dropped,old_length)+more; | |
| 1079 | + start_v <- 0; | |
| 1080 | + current_v <- *current_v - dropped; | |
| 1081 | + last_accepted_v <- | |
| 1082 | + if *last_accepted_v is | |
| 1083 | + { | |
| 1084 | + none then none, | |
| 1085 | + last(s,a) then last(s,a - dropped) | |
| 1086 | + }; | |
| 1087 | + success(unique) | |
| 1088 | + }, | |
| 1089 | + success(lexing_stream(buffer_v, | |
| 1090 | + start_v, | |
| 1091 | + last_accepted_v, | |
| 1092 | + current_v, | |
| 1093 | + reload_buffer)) | |
| 1094 | + }. | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + *** [3] Constructing the automaton. | |
| 1102 | + | |
| 1103 | + The description of a lexer is given as a list of 'LexerItem($Token)', where the | |
| 1104 | + parameter '$Token' represents the type of tokens. Each lexer item is made of a regular | |
| 1105 | + expression and an action. If the action is 'failure', the token just read is ignored | |
| 1106 | + and the lexer tries to read the next one. Otherwise, the action is applied to the | |
| 1107 | + lexeme just read, and the result of the action is returned by the lexer. The type | |
| 1108 | + 'LexerOutput($Token)' is defined in 'regexpr_parser.anubis'. | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + A DFA is presented as a list of states. Each state is either accepting or | |
| 1112 | + rejecting. Each state has a name (of type Word32), and a list of transitions. Accepting | |
| 1113 | + states also have the corresponding 'action'. | |
| 1114 | + | |
| 1115 | + Each transition has a 'label' and the name of a state (the target state for this | |
| 1116 | + transition). Labels are of the following sorts: | |
| 1117 | + | |
| 1118 | +public type DFA_label: | |
| 1119 | + char(Word8), | |
| 1120 | + beginning_of_line, | |
| 1121 | + end_of_line. | |
| 1122 | + | |
| 1123 | +public type DFA_transition: | |
| 1124 | + transition(DFA_label label, | |
| 1125 | + Word32 target_name). | |
| 1126 | + | |
| 1127 | +public type DFA_state($Token): | |
| 1128 | + rejecting (Word32 name, | |
| 1129 | + List(DFA_transition) transitions), | |
| 1130 | + | |
| 1131 | + accepting (Word32 name, | |
| 1132 | + List(DFA_transition) transitions, | |
| 1133 | + Maybe(ByteArray -> LexerOutput($Token)) action). | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + Now, here is the tool for making the DFA. The type 'RegExprError' is defined in | |
| 1138 | + 'regexpr_parser.anubis'. | |
| 1139 | + | |
| 1140 | +public define Result(RegExprError,List(DFA_state($Token))) | |
| 1141 | + make_DFA | |
| 1142 | + ( | |
| 1143 | + List(LexerItem($Token)) lexer_description | |
| 1144 | + ). | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + *** [3.1] Pre-labels. | |
| 1149 | + | |
| 1150 | + These are the labels before the renameing of the DFA. | |
| 1151 | + | |
| 1152 | + 'beginning_of_line' and 'end_of_line' are also treated as special characters, even if | |
| 1153 | + they cannot be present as such in the input. The fast lexer detects their presence | |
| 1154 | + based on the neighbouring of the character '\n', and uses special transitions in that | |
| 1155 | + case. | |
| 1156 | + | |
| 1157 | + On the contrary, 'actions' cannot be considered as matching anything in the | |
| 1158 | + input. However, in a given state and action may be present among transitions, just | |
| 1159 | + meaning that in this state, if no transition may be followed, the action must be | |
| 1160 | + chosen instead. | |
| 1161 | + | |
| 1162 | + | |
| 1163 | +public type DFA_pre_label($Token): | |
| 1164 | + char(Word8), | |
| 1165 | + beginning_of_line, | |
| 1166 | + end_of_line, | |
| 1167 | + action(Maybe(ByteArray -> LexerOutput($Token))). | |
| 1168 | + | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
| 1172 | + *** [3.2] Decorating basic regular expressions. | |
| 1173 | + | |
| 1174 | + Given a basic regular expression, we associate a unique integer to each of its leaves | |
| 1175 | + (when seen as a tree), which are either a character a beginning of line or an end of | |
| 1176 | + line. Such an integer is called a 'position'. | |
| 1177 | + | |
| 1178 | + Furthermore, we add three decorations to each Basic regular | |
| 1179 | + expression: | |
| 1180 | + | |
| 1181 | + - a flag 'nullable', which, when true, means that the regular expression may match | |
| 1182 | + the empty string, | |
| 1183 | + | |
| 1184 | + - a list of integers, representing all positions which may correspond to the first | |
| 1185 | + character of a matching string, | |
| 1186 | + | |
| 1187 | + - a list of integers, representing all positions which may correspond to the last | |
| 1188 | + character in a matching string. | |
| 1189 | + | |
| 1190 | + Actually, these two lists are lists of pairs (Word32,Label), where | |
| 1191 | + the label corresponds to the position. | |
| 1192 | + | |
| 1193 | +type DecoratedBasicRegExpr($Token): | |
| 1194 | + char (Word8, | |
| 1195 | + Word32 pos, | |
| 1196 | + Bool nullable, | |
| 1197 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1198 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1199 | + | |
| 1200 | + bol (Word32 pos, | |
| 1201 | + Bool nullable, | |
| 1202 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1203 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1204 | + | |
| 1205 | + eol (Word32 pos, | |
| 1206 | + Bool nullable, | |
| 1207 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1208 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1209 | + | |
| 1210 | + epsilon (Bool nullable, | |
| 1211 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1212 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1213 | + | |
| 1214 | + or (DecoratedBasicRegExpr($Token),DecoratedBasicRegExpr($Token), | |
| 1215 | + Bool nullable, | |
| 1216 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1217 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1218 | + | |
| 1219 | + cat (DecoratedBasicRegExpr($Token),DecoratedBasicRegExpr($Token), | |
| 1220 | + Bool nullable, | |
| 1221 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1222 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1223 | + | |
| 1224 | + star (DecoratedBasicRegExpr($Token), | |
| 1225 | + Bool nullable, | |
| 1226 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1227 | + List((Word32,DFA_pre_label($Token))) lastpos), | |
| 1228 | + | |
| 1229 | + action (Maybe(ByteArray -> LexerOutput($Token)), | |
| 1230 | + Word32 pos, | |
| 1231 | + Bool nullable, | |
| 1232 | + List((Word32,DFA_pre_label($Token))) firstpos, | |
| 1233 | + List((Word32,DFA_pre_label($Token))) lastpos). | |
| 1234 | + | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + The following function adds positions and decorations to a regular expression. Since we | |
| 1238 | + have to generate position names, we give the first position to be used, and the | |
| 1239 | + function returns the regular expression (with positions and decorations) and the next | |
| 1240 | + position free for further use. The computation is simply recursive (there is no 'graph | |
| 1241 | + walk' to do, only a 'tree walk'). | |
| 1242 | + | |
| 1243 | + | |
| 1244 | +define (DecoratedBasicRegExpr($Token),Word32) | |
| 1245 | + decorate | |
| 1246 | + ( | |
| 1247 | + BasicRegExpr($Token) r, | |
| 1248 | + Word32 n | |
| 1249 | + ) = | |
| 1250 | + if r is | |
| 1251 | + { | |
| 1252 | + char(c) then | |
| 1253 | + (char(c,n,false,[(n,char(c))],[(n,char(c))]), n+1), | |
| 1254 | + | |
| 1255 | + star(r1) then | |
| 1256 | + if decorate(r1,n) is (rp1,m) then | |
| 1257 | + (star(rp1, | |
| 1258 | + true, | |
| 1259 | + firstpos(rp1), | |
| 1260 | + lastpos(rp1)),m), | |
| 1261 | + | |
| 1262 | + or(r1,r2) then | |
| 1263 | + if decorate(r1,n) is (rp1,m) then | |
| 1264 | + if decorate(r2,m) is (rp2,l) then | |
| 1265 | + (or(rp1,rp2, | |
| 1266 | + if nullable(rp1) then true else nullable(rp2), | |
| 1267 | + append(firstpos(rp1),firstpos(rp2)), | |
| 1268 | + append(lastpos(rp1),lastpos(rp2))),l), | |
| 1269 | + | |
| 1270 | + cat(r1,r2) then | |
| 1271 | + if decorate(r1,n) is (rp1,m) then | |
| 1272 | + if decorate(r2,m) is (rp2,l) then | |
| 1273 | + (cat(rp1,rp2, | |
| 1274 | + if nullable(rp1) then nullable(rp2) else false, | |
| 1275 | + if nullable(rp1) then append(firstpos(rp1),firstpos(rp2)) else firstpos(rp1), | |
| 1276 | + if nullable(rp2) then append(lastpos(rp1),lastpos(rp2)) else lastpos(rp2)),l), | |
| 1277 | + | |
| 1278 | + epsilon then | |
| 1279 | + (epsilon(true,[],[]),n), | |
| 1280 | + | |
| 1281 | + beginning_of_line then | |
| 1282 | + (bol(n,false,[(n,beginning_of_line)],[(n,beginning_of_line)]),n+1), | |
| 1283 | + | |
| 1284 | + end_of_line then | |
| 1285 | + (eol(n,false,[(n,end_of_line)],[(n,end_of_line)]),n+1), | |
| 1286 | + | |
| 1287 | + action(a) then | |
| 1288 | + (action(a,n,false,[(n,action(a))],[(n,action(a))]),n+1) | |
| 1289 | + }. | |
| 1290 | + | |
| 1291 | + | |
| 1292 | + Notice that the 'firstpos' and 'lastpos' fields in decorated regular expressions are | |
| 1293 | + always increasingly ordered lists of distinct integers (when ignoring labels), as may | |
| 1294 | + be easily verified by induction from the previous definition. Hint: when we write | |
| 1295 | + | |
| 1296 | + if decorate(r1,n) is (rp1,m) | |
| 1297 | + | |
| 1298 | + any position i in rp1 is such that n =< i < m. | |
| 1299 | + | |
| 1300 | + | |
| 1301 | + | |
| 1302 | + *** [3.3] Computing the follow table. | |
| 1303 | + | |
| 1304 | + | |
| 1305 | + A 'follow table' tells us which positions can follow a given position (when scanning a | |
| 1306 | + string). It also gives the label attached to a position. Its type is: | |
| 1307 | + | |
| 1308 | +type FollowTable($Token): | |
| 1309 | + empty, | |
| 1310 | + follow_table(Word32, // position | |
| 1311 | + DFA_pre_label($Token), // label | |
| 1312 | + List(Word32), // following positions | |
| 1313 | + FollowTable($Token) next). | |
| 1314 | + | |
| 1315 | + | |
| 1316 | + Our lists of Word32s will have to remain increasingly sorted (for the purpose of | |
| 1317 | + comparison). | |
| 1318 | + | |
| 1319 | + The following function merges two lists sorted in increasing order, so that the result | |
| 1320 | + is still increasingly sorted. | |
| 1321 | + | |
| 1322 | +define List(Word32) | |
| 1323 | + merge_sorted | |
| 1324 | + ( | |
| 1325 | + List(Word32) l1, | |
| 1326 | + List(Word32) l2 | |
| 1327 | + ) = | |
| 1328 | + if l1 is | |
| 1329 | + { | |
| 1330 | + [ ] then l2, | |
| 1331 | + [h1 . t1] then | |
| 1332 | + if l2 is | |
| 1333 | + { | |
| 1334 | + [ ] then l1, | |
| 1335 | + [h2 . t2] then | |
| 1336 | + if h1 = h2 // avoid duplications | |
| 1337 | + then [h1 . merge_sorted(t1,t2)] | |
| 1338 | + else if h1 -< h2 | |
| 1339 | + then [h1 . merge_sorted(t1,l2)] | |
| 1340 | + else [h2 . merge_sorted(l1,t2)] | |
| 1341 | + } | |
| 1342 | + }. | |
| 1343 | + | |
| 1344 | + | |
| 1345 | + 'heads' takes a list of pairs, and returns the list of all heads of these pairs. Remark | |
| 1346 | + that if we apply 'heads' to either a 'firstpos' or a 'lastpos' datum, we get a list of | |
| 1347 | + increasingly ordered distinct integers. | |
| 1348 | + | |
| 1349 | +define List($T) | |
| 1350 | + heads | |
| 1351 | + ( | |
| 1352 | + List(($T,$U)) l | |
| 1353 | + ) = | |
| 1354 | + if l is | |
| 1355 | + { | |
| 1356 | + [ ] then [ ], | |
| 1357 | + [h . t] then if h is (u,v) then | |
| 1358 | + [u . heads(t)] | |
| 1359 | + }. | |
| 1360 | + | |
| 1361 | + | |
| 1362 | + | |
| 1363 | + Adding entries to a follow table. Given: | |
| 1364 | + | |
| 1365 | + - a list of keys (e1,...,ek) of type (Word32,DFA_pre_label($Token)) | |
| 1366 | + - a list of values (t1,...,tn) of type (Word32,DFA_pre_label($Token)) | |
| 1367 | + - a A-list of triplets of type (Word32,DFA_pre_label($Token),List(Word32)), | |
| 1368 | + | |
| 1369 | + update that A-list, adding keys e1,...,en if they are not already in the A-list, and | |
| 1370 | + putting each head of ti as a value for each ej. The third element of each triplet (a | |
| 1371 | + list of integers) should always remain inceasingly sorted, and have distinct elements. | |
| 1372 | + | |
| 1373 | + First, assume there is only one key (and its label) to add: | |
| 1374 | + | |
| 1375 | + | |
| 1376 | +define FollowTable($Token) | |
| 1377 | + add_follow_entry | |
| 1378 | + ( | |
| 1379 | + Word32 key, | |
| 1380 | + DFA_pre_label($Token) c, | |
| 1381 | + List((Word32,DFA_pre_label($Token))) values, | |
| 1382 | + FollowTable($Token) previous | |
| 1383 | + ) = | |
| 1384 | + if previous is | |
| 1385 | + { | |
| 1386 | + empty then follow_table(key,c,heads(values),empty), | |
| 1387 | + follow_table(k1,c1,v1,t) then | |
| 1388 | + if key = k1 | |
| 1389 | + then follow_table(k1,c1,merge_sorted(heads(values),v1),t) | |
| 1390 | + else follow_table(k1,c1,v1,add_follow_entry(key,c,values,t)) | |
| 1391 | + }. | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + Now, add several keys. | |
| 1395 | + | |
| 1396 | +define FollowTable($Token) | |
| 1397 | + add_follow_entries | |
| 1398 | + ( | |
| 1399 | + List((Word32,DFA_pre_label($Token))) keys, | |
| 1400 | + List((Word32,DFA_pre_label($Token))) values, | |
| 1401 | + FollowTable($Token) previous | |
| 1402 | + ) = | |
| 1403 | + if keys is | |
| 1404 | + { | |
| 1405 | + [ ] then previous, | |
| 1406 | + [k1 . ks] then | |
| 1407 | + if k1 is (k,c) then | |
| 1408 | + add_follow_entries(ks,values,add_follow_entry(k,c,values,previous)) | |
| 1409 | + }. | |
| 1410 | + | |
| 1411 | + Appending two follow tables (it is assumed that they have no key in common). | |
| 1412 | + | |
| 1413 | +define FollowTable($Token) | |
| 1414 | + append | |
| 1415 | + ( | |
| 1416 | + FollowTable($Token) t1, | |
| 1417 | + FollowTable($Token) t2 | |
| 1418 | + ) = | |
| 1419 | + if t1 is | |
| 1420 | + { | |
| 1421 | + empty then t2, | |
| 1422 | + follow_table(p,l,n,tail1) then follow_table(p,l,n,append(tail1,t2)) | |
| 1423 | + }. | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + Making the follow_table from a decorated basic regular expression. | |
| 1427 | + | |
| 1428 | +define FollowTable($Token) | |
| 1429 | + make_follow_table | |
| 1430 | + ( | |
| 1431 | + DecoratedBasicRegExpr($Token) r | |
| 1432 | + ) = | |
| 1433 | + if r is | |
| 1434 | + { | |
| 1435 | + char(c,n,nb,fp,lp) then follow_table(n,char(c),[],empty), | |
| 1436 | + bol(n,nb,fp,lp) then follow_table(n,beginning_of_line,[],empty), | |
| 1437 | + eol(n,nb,fp,lp) then follow_table(n,end_of_line,[],empty), | |
| 1438 | + epsilon(nb,fp,lp) then empty, | |
| 1439 | + or(r1,r2,nb,fp,lp) then append(make_follow_table(r1),make_follow_table(r2)), | |
| 1440 | + /* we can use append because r1 and r2 cannot share a | |
| 1441 | + key. */ | |
| 1442 | + | |
| 1443 | + cat(r1,r2,nb,fp,lp) then | |
| 1444 | + with t = append(make_follow_table(r1),make_follow_table(r2)), | |
| 1445 | + /* same remark on append */ | |
| 1446 | + l1 = lastpos(r1), | |
| 1447 | + f2 = firstpos(r2), | |
| 1448 | + add_follow_entries(l1,f2,t), | |
| 1449 | + | |
| 1450 | + star(r1,nb,fp,lp) then | |
| 1451 | + with t = make_follow_table(r1), | |
| 1452 | + f = firstpos(r1), | |
| 1453 | + l = lastpos(r1), | |
| 1454 | + add_follow_entries(l,f,t), | |
| 1455 | + | |
| 1456 | + action(a,n,nb,fb,lp) then follow_table(n,action(a),[],empty) | |
| 1457 | + }. | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
| 1461 | + | |
| 1462 | + | |
| 1463 | + Finding an entry in a follow table. | |
| 1464 | + | |
| 1465 | +define (Word32,DFA_pre_label($Token),List(Word32)) | |
| 1466 | + follow_table_entry | |
| 1467 | + ( | |
| 1468 | + Word32 p, | |
| 1469 | + FollowTable($Token) l | |
| 1470 | + ) = | |
| 1471 | + if l is | |
| 1472 | + { | |
| 1473 | + empty then alert, // we should always find it | |
| 1474 | + follow_table(n,c,pos,t) then | |
| 1475 | + if p = n | |
| 1476 | + then (n,c,pos) | |
| 1477 | + else follow_table_entry(p,t) | |
| 1478 | + }. | |
| 1479 | + | |
| 1480 | + | |
| 1481 | + | |
| 1482 | + | |
| 1483 | + | |
| 1484 | + | |
| 1485 | + | |
| 1486 | + | |
| 1487 | + | |
| 1488 | + | |
| 1489 | + Names of states in the DFA are primarily increasingly sorted lists of Word32s. They are | |
| 1490 | + transformed into Word32 when the DFA is renameed (see below). A transition is just a | |
| 1491 | + pair made of a label and a state name. | |
| 1492 | + | |
| 1493 | +type DFA_pre_transition($Token): | |
| 1494 | + transition(DFA_pre_label($Token) label, | |
| 1495 | + List(Word32) target_name). | |
| 1496 | + | |
| 1497 | + | |
| 1498 | + A state is made of a state name and a list of transitions. | |
| 1499 | + | |
| 1500 | +type DFA_pre_state($Token): | |
| 1501 | + state(List(Word32) name, | |
| 1502 | + Maybe(List(DFA_pre_transition($Token))) transitions). | |
| 1503 | + | |
| 1504 | + | |
| 1505 | + The reason why the field 'transitions' has a 'Maybe' is that we may consider | |
| 1506 | + 'incomplete' states, which did not yet receive their transitions. | |
| 1507 | + | |
| 1508 | + Note: A DFA is not a tree in general, but a graph. This is the reason why states have | |
| 1509 | + names. Since we cannot construct circular data in Anubis, the presence of names allows | |
| 1510 | + nevertheless the construction of graphs (including circularities). However, we cannot | |
| 1511 | + refer directly to a state, but only to its name. | |
| 1512 | + | |
| 1513 | + We explain now how the automaton is constructed for a decorated basic regular | |
| 1514 | + expression 'r'. | |
| 1515 | + | |
| 1516 | + First of all, there is an initial state, whose name is firstpos(r). What it means is | |
| 1517 | + that in this state, we expect to read a character corresponding to one of these | |
| 1518 | + positions. | |
| 1519 | + | |
| 1520 | + More generally, for any state 's', the name of the state is the list of all positions | |
| 1521 | + which may match the next character to be read from the input. | |
| 1522 | + | |
| 1523 | + Since, we don't care about unreachable states, we construct the automaton, starting | |
| 1524 | + with the initial state, and adding all the states required by the transitions, until no | |
| 1525 | + more state may be added. Of course, this process terminates, since the set of all | |
| 1526 | + possible state names is obviously finite (its cardinal is at most 2^p, where p is the | |
| 1527 | + number of positions in r). | |
| 1528 | + | |
| 1529 | + For a given state, with name [p_1,...,pk], the transitions are given by the labels of | |
| 1530 | + p_1,...,p_k. Nevertheless, several positions may have the same label. Hence, for a | |
| 1531 | + given label, let q_1,...,q_j be those among p_1,...,p_k which have this label. The | |
| 1532 | + target state for the corresponding transition is obtained by taking all the positions | |
| 1533 | + which may follow one of q_1,...,q_j. | |
| 1534 | + | |
| 1535 | + That's all ! | |
| 1536 | + | |
| 1537 | + | |
| 1538 | + Empty state names. What does it mean that the name of a state is empty ? This means | |
| 1539 | + that reaching this state produces an error. Indeed, a state accepts a string if and | |
| 1540 | + only if it contains a position labelled by an action, and has transitions to other | |
| 1541 | + states if and only if it contains a position labelled by a character (or | |
| 1542 | + 'end_of_file'). | |
| 1543 | + | |
| 1544 | + A state which contains an action is an accepting state. Nevertheless, it may also have | |
| 1545 | + transitions. Hence, the lexer may eventually accept a longer sequence. But following | |
| 1546 | + the transitions may also lead to an error. Hence the lexer must always keep the most | |
| 1547 | + recently found solution, and use it (if it exists) if it enters a dead end (and in that | |
| 1548 | + case, there is no error at all). | |
| 1549 | + | |
| 1550 | + When using a solution, the lexer must also apply the action. This action must have been | |
| 1551 | + saved by the lexer. Hence it is necessary to number actions, and to create a function | |
| 1552 | + for each action. | |
| 1553 | + | |
| 1554 | + | |
| 1555 | + | |
| 1556 | + | |
| 1557 | + Given a state name [p_1,...,p_k], and the follow table, the function | |
| 1558 | + 'prepare_transitions' produces a list of pairs | |
| 1559 | + | |
| 1560 | + (a , l) | |
| 1561 | + | |
| 1562 | + where 'a' is a label, and 'l' the list of all positions with label 'a' which may follow | |
| 1563 | + one of p1,...,pk. We need an auxiliary function 'insert'. | |
| 1564 | + | |
| 1565 | + | |
| 1566 | + | |
| 1567 | + | |
| 1568 | + | |
| 1569 | +define List(DFA_pre_transition($Token)) | |
| 1570 | + insert | |
| 1571 | + ( | |
| 1572 | + DFA_pre_label($Token) c, | |
| 1573 | + List(Word32) l, | |
| 1574 | + List(DFA_pre_transition($Token)) q | |
| 1575 | + ) = | |
| 1576 | + if q is | |
| 1577 | + { | |
| 1578 | + [ ] then [transition(c,l)], | |
| 1579 | + [h . t] then | |
| 1580 | + if h is transition(c1,l1) then | |
| 1581 | + if c = c1 | |
| 1582 | + then [transition(c,merge_sorted(l,l1)) . t] | |
| 1583 | + else [h . insert(c,l,t)] | |
| 1584 | + }. | |
| 1585 | + | |
| 1586 | + | |
| 1587 | +define List(DFA_pre_transition($Token)) | |
| 1588 | + prepare_transitions | |
| 1589 | + ( | |
| 1590 | + List(Word32) name, | |
| 1591 | + FollowTable($Token) ft | |
| 1592 | + ) = | |
| 1593 | + if name is | |
| 1594 | + { | |
| 1595 | + [ ] then [ ], | |
| 1596 | + [p1 . p_others] then | |
| 1597 | + if follow_table_entry(p1,ft) is (p,c,l) then | |
| 1598 | + with q = prepare_transitions(p_others,ft), | |
| 1599 | + insert(c,l,q) | |
| 1600 | + }. | |
| 1601 | + | |
| 1602 | + | |
| 1603 | + | |
| 1604 | + | |
| 1605 | + Now, we compute our DFA, i.e a list of DFA_pre_state($Token)s. We begin with only one state in the | |
| 1606 | + list. The name of this state is firstpos(r), and it has not yet received its | |
| 1607 | + transitions. In other words, it is: | |
| 1608 | + | |
| 1609 | + state(firstpos(r),failure) | |
| 1610 | + | |
| 1611 | + Then, we enter an 'infinite' loop. At each pass, we look for a state which did not yet | |
| 1612 | + receive its transitions. If there is no such state, the DFA is ready (and we exit the | |
| 1613 | + loop). Otherwise, we add its transitions to the state, and this may create new states | |
| 1614 | + (without their transitions) in the DFA. | |
| 1615 | + | |
| 1616 | + We need a function to separate (if possible) an incomplete state from a list of states: | |
| 1617 | + | |
| 1618 | +define Maybe((DFA_pre_state($Token),List(DFA_pre_state($Token)))) | |
| 1619 | + separate_incomplete_state | |
| 1620 | + ( | |
| 1621 | + List(DFA_pre_state($Token)) l | |
| 1622 | + ) = | |
| 1623 | + if l is | |
| 1624 | + { | |
| 1625 | + [ ] then failure, | |
| 1626 | + [s1 . so] then | |
| 1627 | + if transitions(s1) is | |
| 1628 | + { | |
| 1629 | + failure then | |
| 1630 | + success((s1,so)), | |
| 1631 | + success(_) then | |
| 1632 | + if separate_incomplete_state(so) is | |
| 1633 | + { | |
| 1634 | + failure then failure, | |
| 1635 | + success(p) then if p is (i,m) then | |
| 1636 | + success((i,[s1 . m])) | |
| 1637 | + } | |
| 1638 | + } | |
| 1639 | + }. | |
| 1640 | + | |
| 1641 | + | |
| 1642 | + We need a function to extract the list of target names from a list of transitions. | |
| 1643 | + | |
| 1644 | +define List(List(Word32)) | |
| 1645 | + get_targets | |
| 1646 | + ( | |
| 1647 | + List(DFA_pre_transition($Token)) l | |
| 1648 | + ) = | |
| 1649 | + if l is | |
| 1650 | + { | |
| 1651 | + [ ] then [ ], | |
| 1652 | + [h . t] then if h is transition(n,target) then | |
| 1653 | + [target . get_targets(t)] | |
| 1654 | + }. | |
| 1655 | + | |
| 1656 | + | |
| 1657 | + We need a predicate to test if a list of states contains a state of | |
| 1658 | + a given name. | |
| 1659 | + | |
| 1660 | +define Bool | |
| 1661 | + is_state_name_in | |
| 1662 | + ( | |
| 1663 | + List(DFA_pre_state($Token)) l, | |
| 1664 | + List(Word32) n // sorted list of integers | |
| 1665 | + ) = | |
| 1666 | + if l is | |
| 1667 | + { | |
| 1668 | + [ ] then false, | |
| 1669 | + [h . t] then | |
| 1670 | + if h is state(m,tr) then | |
| 1671 | + if n = m // comparing sorted lists of integers | |
| 1672 | + then true | |
| 1673 | + else is_state_name_in(t,n) | |
| 1674 | + }. | |
| 1675 | + | |
| 1676 | + | |
| 1677 | + We need a function to add new states to a list of states. The new states are given in | |
| 1678 | + the form of a list of state names and are added without their transitions. | |
| 1679 | + | |
| 1680 | +define List(DFA_pre_state($Token)) | |
| 1681 | + add_new_states | |
| 1682 | + ( | |
| 1683 | + List(List(Word32)) names, | |
| 1684 | + List(DFA_pre_state($Token)) states | |
| 1685 | + ) = | |
| 1686 | + if names is | |
| 1687 | + { | |
| 1688 | + [ ] then states, | |
| 1689 | + [h . t] then | |
| 1690 | + if is_state_name_in(states,h) | |
| 1691 | + then add_new_states(t,states) | |
| 1692 | + else add_new_states(t,[state(h,failure) . states]) | |
| 1693 | + }. | |
| 1694 | + | |
| 1695 | + | |
| 1696 | + | |
| 1697 | + We need a function to complete a state which did not yet receive its transitions. | |
| 1698 | + | |
| 1699 | +define List(DFA_pre_state($Token)) | |
| 1700 | + complete_state | |
| 1701 | + ( | |
| 1702 | + DFA_pre_state($Token) i, // incomplete state | |
| 1703 | + List(DFA_pre_state($Token)) o, // other states | |
| 1704 | + FollowTable($Token) ft | |
| 1705 | + ) = | |
| 1706 | + with trans = prepare_transitions(name(i),ft), | |
| 1707 | + targets = get_targets(trans), | |
| 1708 | + add_new_states(targets,[state(name(i),success(trans)) . o]). | |
| 1709 | + | |
| 1710 | + | |
| 1711 | + Now, here is our 'infinite' loop. | |
| 1712 | + | |
| 1713 | +define List(DFA_pre_state($Token)) | |
| 1714 | + make_DFA_pre | |
| 1715 | + ( | |
| 1716 | + List(DFA_pre_state($Token)) l, | |
| 1717 | + FollowTable($Token) ft | |
| 1718 | + ) = | |
| 1719 | + if separate_incomplete_state(l) is | |
| 1720 | + { | |
| 1721 | + failure then l, // the DFA is ready | |
| 1722 | + | |
| 1723 | + success(p) then if p is (s,o) then | |
| 1724 | + with new = complete_state(s,o,ft), | |
| 1725 | + make_DFA_pre(new,ft) | |
| 1726 | + }. | |
| 1727 | + | |
| 1728 | + | |
| 1729 | + | |
| 1730 | + | |
| 1731 | + | |
| 1732 | + *** [3.5] Renaming the states of the DFA. | |
| 1733 | + | |
| 1734 | + Names of states in our DFA are lists of integers. We need to replace them by integers. | |
| 1735 | + | |
| 1736 | + From a DFA whose state names are lists of integers, we create a list of pairs (old,new) | |
| 1737 | + where new is a new name (an integer) and old an old name (a list of integers). | |
| 1738 | + | |
| 1739 | +define List((List(Word32),Word32)) // an association list | |
| 1740 | + name_list | |
| 1741 | + ( | |
| 1742 | + List(DFA_pre_state($Token)) l, | |
| 1743 | + Word32 first_new_name | |
| 1744 | + ) = | |
| 1745 | + if l is | |
| 1746 | + { | |
| 1747 | + [ ] then [ ], | |
| 1748 | + [h . t] then | |
| 1749 | + if h is state(old_name,tr) then | |
| 1750 | + [(old_name,first_new_name) . name_list(t,first_new_name+1)] | |
| 1751 | + }. | |
| 1752 | + | |
| 1753 | + | |
| 1754 | + Given an old name and our association list, we can get the new name. | |
| 1755 | + | |
| 1756 | +define Word32 | |
| 1757 | + get_new_name | |
| 1758 | + ( | |
| 1759 | + List(Word32) old_name, | |
| 1760 | + List((List(Word32),Word32)) nlist | |
| 1761 | + ) = | |
| 1762 | + if nlist is | |
| 1763 | + { | |
| 1764 | + [ ] then alert, // the new name should always exist | |
| 1765 | + [h . t] then if h is (o,n) then | |
| 1766 | + if old_name = o | |
| 1767 | + then n | |
| 1768 | + else get_new_name(old_name,t) | |
| 1769 | + }. | |
| 1770 | + | |
| 1771 | + | |
| 1772 | + Now, we rename all transitions in a given state. At the same time we separate actual | |
| 1773 | + transitions from actions. This is why the following function returns a pair made of a | |
| 1774 | + list of transitions, and maybe an action. Since the action is of type: | |
| 1775 | + | |
| 1776 | + Maybe(ByteArray -> LexerOutput($Token)) | |
| 1777 | + | |
| 1778 | + the non mandatory action is of type: | |
| 1779 | + | |
| 1780 | + Maybe(Maybe(ByteArray -> LexerOutput($Token))) | |
| 1781 | + | |
| 1782 | + | |
| 1783 | +define (List(DFA_transition),Maybe(Maybe(ByteArray -> LexerOutput($Token)))) | |
| 1784 | + rename | |
| 1785 | + ( | |
| 1786 | + List(DFA_pre_transition($Token)) l, | |
| 1787 | + List((List(Word32),Word32)) nlist | |
| 1788 | + ) = | |
| 1789 | + if l is | |
| 1790 | + { | |
| 1791 | + [ ] then ([ ],failure), | |
| 1792 | + [h . t] then | |
| 1793 | + if rename(t,nlist) is (trs,mbmba) then | |
| 1794 | + if h is transition(pre_label,target) then | |
| 1795 | + if pre_label is | |
| 1796 | + { | |
| 1797 | + char(c) then | |
| 1798 | + ([transition(char(c),get_new_name(target,nlist)) . trs],mbmba), | |
| 1799 | + beginning_of_line then | |
| 1800 | + ([transition(beginning_of_line,get_new_name(target,nlist)) . trs],mbmba), | |
| 1801 | + end_of_line then | |
| 1802 | + ([transition(end_of_line,get_new_name(target,nlist)) . trs],mbmba), | |
| 1803 | + action(mba) then if mbmba is | |
| 1804 | + { | |
| 1805 | + failure then (trs,success(mba)), | |
| 1806 | + success(x) then // two actions in the same state: choose the first one. | |
| 1807 | + (trs,success(mba)) | |
| 1808 | + } | |
| 1809 | + } | |
| 1810 | + }. | |
| 1811 | + | |
| 1812 | + | |
| 1813 | + Now, we rename all the states. | |
| 1814 | + | |
| 1815 | +define List(DFA_state($Token)) | |
| 1816 | + rename | |
| 1817 | + ( | |
| 1818 | + List(DFA_pre_state($Token)) l, | |
| 1819 | + List((List(Word32),Word32)) nlist | |
| 1820 | + ) = | |
| 1821 | + if l is | |
| 1822 | + { | |
| 1823 | + [ ] then [ ], | |
| 1824 | + [h . t] then | |
| 1825 | + if h is state(old_name,mbtrans) then | |
| 1826 | + if mbtrans is | |
| 1827 | + { | |
| 1828 | + failure then alert, // pre-states must have been completed | |
| 1829 | + success(trans) then | |
| 1830 | + if rename(trans,nlist) is (trs,mbmba) then | |
| 1831 | + if mbmba is | |
| 1832 | + { | |
| 1833 | + failure then | |
| 1834 | + [rejecting(get_new_name(old_name,nlist),trs) . rename(t,nlist)] | |
| 1835 | + success(mba) then | |
| 1836 | + [accepting(get_new_name(old_name,nlist),trs,mba) . rename(t,nlist)] | |
| 1837 | + } | |
| 1838 | + } | |
| 1839 | + }. | |
| 1840 | + | |
| 1841 | + | |
| 1842 | + | |
| 1843 | + *** [3.5] Making the DFA. | |
| 1844 | + | |
| 1845 | + | |
| 1846 | + | |
| 1847 | + | |
| 1848 | +define Result(RegExprError,BasicRegExpr($Token)) | |
| 1849 | + prepare_global_regexpr | |
| 1850 | + ( | |
| 1851 | + List(LexerItem($Token)) lexer_description | |
| 1852 | + ) = | |
| 1853 | + if lexer_description is | |
| 1854 | + { | |
| 1855 | + [ ] then error(empty_lexer_description), | |
| 1856 | + [h . t] then if h is lexer_item(re,a) then | |
| 1857 | + if parse_regular_expression(make_stream(re)) is | |
| 1858 | + { | |
| 1859 | + error(msg) then error(msg), | |
| 1860 | + ok(re1) then if t is | |
| 1861 | + { | |
| 1862 | + [ ] then | |
| 1863 | + ok(cat(to_basic(re1),action(a))), | |
| 1864 | + [_ . _] then if prepare_global_regexpr(t) is | |
| 1865 | + { | |
| 1866 | + error(msg) then error(msg), | |
| 1867 | + ok(p) then | |
| 1868 | + ok(or(cat(to_basic(re1),action(a)),p)) | |
| 1869 | + } | |
| 1870 | + } | |
| 1871 | + } | |
| 1872 | + }. | |
| 1873 | + | |
| 1874 | + | |
| 1875 | + | |
| 1876 | +public define Result(RegExprError,List(DFA_state($Token))) | |
| 1877 | + make_DFA | |
| 1878 | + ( | |
| 1879 | + List(LexerItem($Token)) lexer_description | |
| 1880 | + ) = | |
| 1881 | + if prepare_global_regexpr(lexer_description) is | |
| 1882 | + { | |
| 1883 | + error(msg) then error(msg), | |
| 1884 | + ok(re) then if decorate(re,0) is (br,_) then | |
| 1885 | + with dfa = reverse(make_DFA_pre([state(heads(firstpos(br)),failure)], | |
| 1886 | + make_follow_table(br))), | |
| 1887 | + ok(rename(dfa,name_list(dfa,0))) | |
| 1888 | + }. | |
| 1889 | + | |
| 1890 | + | |
| 1891 | + | |
| 1892 | + | |
| 1893 | + | |
| 1894 | + *** [3.6] Translating a DFA into a fast lexer description. | |
| 1895 | + | |
| 1896 | + The types 'FastLexerTransition' and 'FastLexerState' is defined in 'predefined.anubis' | |
| 1897 | + section 13. | |
| 1898 | + | |
| 1899 | + | |
| 1900 | +define List(FastLexerTransition) | |
| 1901 | + to_fast_lexer_transitions | |
| 1902 | + ( | |
| 1903 | + List(DFA_transition) l | |
| 1904 | + ) = | |
| 1905 | + if l is | |
| 1906 | + { | |
| 1907 | + [ ] then [ ], | |
| 1908 | + [h . t] then if h is transition(label,target) then | |
| 1909 | + [if label is | |
| 1910 | + { | |
| 1911 | + char(c) then transition(c,target), | |
| 1912 | + beginning_of_line then beginning_of_line(target), | |
| 1913 | + end_of_line then end_of_line(target) | |
| 1914 | + } . to_fast_lexer_transitions(t)] | |
| 1915 | + }. | |
| 1916 | + | |
| 1917 | + | |
| 1918 | +public define List(FastLexerState) | |
| 1919 | + to_fast_lexer_description | |
| 1920 | + ( | |
| 1921 | + List(DFA_state($Token)) l | |
| 1922 | + ) = | |
| 1923 | + if l is | |
| 1924 | + { | |
| 1925 | + [ ] then [ ], | |
| 1926 | + [h . t] then [if h is | |
| 1927 | + { | |
| 1928 | + rejecting(n,trs) then rejecting(to_fast_lexer_transitions(trs)) | |
| 1929 | + accepting(n,trs,a) then accepting(to_fast_lexer_transitions(trs)) | |
| 1930 | + } . to_fast_lexer_description(t)] | |
| 1931 | + }. | |
| 1932 | + | |
| 1933 | + | |
| 1934 | + | |
| 1935 | + | |
| 1936 | + | |
| 1937 | + | |
| 1938 | + *** [4] Constructing the lexer. | |
| 1939 | + | |
| 1940 | + The low level fast lexer (see 'predefined.anubis', section 13) does not care about | |
| 1941 | + actions. Hence, we must manage actions in parallel. To this end we use the following | |
| 1942 | + type: | |
| 1943 | + | |
| 1944 | + MVar(Maybe(ByteArray -> LexerOutput($Token))) | |
| 1945 | + | |
| 1946 | + The action for state 'n' (assumed to be an accepting state because the multiple | |
| 1947 | + variable is never used for rejecting states) is the value stored in slot 'n'. The | |
| 1948 | + default value is 'failure' meaning 'ignore this token and read the next | |
| 1949 | + one'. Otherwise, the function is applied to the lexeme just read, and the lexer returns | |
| 1950 | + the result of this function. | |
| 1951 | + | |
| 1952 | + The multiple variable is filled up by: | |
| 1953 | + | |
| 1954 | +define One | |
| 1955 | + fill_actions | |
| 1956 | + ( | |
| 1957 | + List(DFA_state($Token)) dfa, | |
| 1958 | + MVar(Maybe(ByteArray -> LexerOutput($Token))) v | |
| 1959 | + ) = | |
| 1960 | + if dfa is | |
| 1961 | + { | |
| 1962 | + [ ] then unique, | |
| 1963 | + [h . t] then | |
| 1964 | + if h is | |
| 1965 | + { | |
| 1966 | + rejecting(name,trs) then unique, | |
| 1967 | + accepting(name,trs,action) then | |
| 1968 | + v(name) <- action | |
| 1969 | + }; | |
| 1970 | + fill_actions(t,v) | |
| 1971 | + }. | |
| 1972 | + | |
| 1973 | + | |
| 1974 | + Making the multiple variable for actions is performed by: | |
| 1975 | + | |
| 1976 | +define MVar(Maybe(ByteArray -> LexerOutput($Token))) | |
| 1977 | + get_actions | |
| 1978 | + ( | |
| 1979 | + List(DFA_state($Token)) dfa | |
| 1980 | + ) = | |
| 1981 | + with ns = length(dfa), // total number of states | |
| 1982 | + v = mvar(truncate_to_Word32(ns), | |
| 1983 | + (Maybe(ByteArray -> LexerOutput($Token)))failure), | |
| 1984 | + fill_actions(dfa,v); v. | |
| 1985 | + | |
| 1986 | + | |
| 1987 | + | |
| 1988 | + Now we plug the lexer to a lexing stream | |
| 1989 | + | |
| 1990 | + | |
| 1991 | +define One -> LexerOutput($Token) | |
| 1992 | + plug_lexer | |
| 1993 | + ( | |
| 1994 | + LexingStream stream, | |
| 1995 | + (ByteArray input, | |
| 1996 | + FastLexerLastAccepted last_accepted, | |
| 1997 | + FastLexerBeginningOfLine bol, | |
| 1998 | + FastLexerEndOfLine eol, | |
| 1999 | + Int position, | |
| 2000 | + Word32 starting_state) -> FastLexerOutput lexer, | |
| 2001 | + MVar(Maybe(ByteArray -> LexerOutput($Token))) actions | |
| 2002 | + ) = | |
| 2003 | + with bol_v = var((FastLexerBeginningOfLine)at_beginning_of_line), | |
| 2004 | + eol_v = var((FastLexerEndOfLine)not_at_end_of_line), | |
| 2005 | + if stream is lexing_stream(buffer_v,start_v,last_accept_v,current_v,reload_buffer) then | |
| 2006 | + (One _) |-l-> if lexer(*buffer_v, | |
| 2007 | + *last_accept_v, | |
| 2008 | + *bol_v, | |
| 2009 | + *eol_v, | |
| 2010 | + *current_v, | |
| 2011 | + 0) // reading a new token always starts in state 0 | |
| 2012 | + is | |
| 2013 | + { | |
| 2014 | + rejected(state,end,a) then | |
| 2015 | + if a is | |
| 2016 | + { | |
| 2017 | + not_at_end_of_input then | |
| 2018 | + with result = (LexerOutput($Token))error(extract(*buffer_v,*start_v,end)), | |
| 2019 | + current_v <- end+1; | |
| 2020 | + start_v <- end+1; | |
| 2021 | + last_accept_v <- none; | |
| 2022 | + result, | |
| 2023 | + | |
| 2024 | + at_end_of_input then | |
| 2025 | + if reload_buffer(*start_v) is | |
| 2026 | + { | |
| 2027 | + failure then //print("At end (1).\n"); | |
| 2028 | + end_of_input, // really at end of input | |
| 2029 | + success(_) then | |
| 2030 | + l(unique) // continue reading this token | |
| 2031 | + } | |
| 2032 | + } | |
| 2033 | + | |
| 2034 | + accepted(state,end,a) then | |
| 2035 | + if a is | |
| 2036 | + { | |
| 2037 | + not_at_end_of_input then | |
| 2038 | + if *actions(state) is | |
| 2039 | + { | |
| 2040 | + failure then | |
| 2041 | + current_v <- end; | |
| 2042 | + start_v <- end; | |
| 2043 | + last_accept_v <- none; | |
| 2044 | + l(unique), // ignore and try to read the next token | |
| 2045 | + | |
| 2046 | + success(f) then | |
| 2047 | + with result = f(extract(*buffer_v,*start_v,end)), | |
| 2048 | + current_v <- end; | |
| 2049 | + start_v <- end; | |
| 2050 | + last_accept_v <- none; | |
| 2051 | + result | |
| 2052 | + }, | |
| 2053 | + | |
| 2054 | + at_end_of_input then | |
| 2055 | + if reload_buffer(*start_v) is | |
| 2056 | + { | |
| 2057 | + failure then | |
| 2058 | + if *actions(state) is | |
| 2059 | + { | |
| 2060 | + failure then //print("At end (2).\n"); | |
| 2061 | + end_of_input, // ignore and don't try to continue | |
| 2062 | + success(f) then | |
| 2063 | + with result = f(extract(*buffer_v,*start_v,end)), | |
| 2064 | + current_v <- end; | |
| 2065 | + start_v <- end; | |
| 2066 | + last_accept_v <- none; | |
| 2067 | + result | |
| 2068 | + }, | |
| 2069 | + | |
| 2070 | + success(_) then l(unique) // continue reading this token | |
| 2071 | + } | |
| 2072 | + } | |
| 2073 | + }. | |
| 2074 | + | |
| 2075 | + | |
| 2076 | + | |
| 2077 | + Finally, the tool for making a lexer. | |
| 2078 | + | |
| 2079 | +public define Result(RegExprError, LexingStream -> One -> LexerOutput($Token)) | |
| 2080 | + make_lexer | |
| 2081 | + ( | |
| 2082 | + List(LexerItem($Token)) lexer_description | |
| 2083 | + ) = | |
| 2084 | + if make_DFA(lexer_description) is | |
| 2085 | + { | |
| 2086 | + error(msg) then error(msg), | |
| 2087 | + ok(List(DFA_state($Token)) dfa) then | |
| 2088 | + if make_fast_lexer(to_fast_lexer_description(dfa)) is | |
| 2089 | + { | |
| 2090 | + unknown_state(n) then alert, // cannot happen | |
| 2091 | + ok(fl) then ok((LexingStream ls) |-> plug_lexer(ls,fl,get_actions(dfa))) | |
| 2092 | + } | |
| 2093 | + }. | |
| 2094 | + | |
| 2095 | + | |
| 2096 | + | |
| 0 | 2097 | \ No newline at end of file | ... | ... |
anubis_distrib/library/lexical_analysis/fast_lexer_example_1.anubis
0 → 100644
| 1 | + | |
| 2 | + | |
| 3 | + The Anubis Project | |
| 4 | + | |
| 5 | + Tools for lexical analysis. | |
| 6 | + A simple example. | |
| 7 | + | |
| 8 | + Copyright (c) Constructive Mathematics 2007-2008. | |
| 9 | + | |
| 10 | + | |
| 11 | + Author: Alain Prouté | |
| 12 | + | |
| 13 | + | |
| 14 | + In this file we present a simple example of use of 'fast_lexer.anubis'. The program | |
| 15 | + generated is a very simplified version of the Unix tool 'grep': | |
| 16 | + | |
| 17 | +global define One | |
| 18 | + fast_lexer_example_1 | |
| 19 | + ( | |
| 20 | + List(String) args | |
| 21 | + ). | |
| 22 | + | |
| 23 | + This program receives a regular expression and a filename as its arguments. Its purpose | |
| 24 | + is to print to the standard output all the sequences in the file matching the regular | |
| 25 | + expression, with line numbers. | |
| 26 | + | |
| 27 | +define String | |
| 28 | + usage = | |
| 29 | + "Usage: anbexec fast_lexer_example_1 <regular expression> <file name>\n". | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + --- That's all for the public part ! -------------------------------------------------- | |
| 34 | + Nevertheless, since this is an example, you may have to read the sequel, which is fully | |
| 35 | + commented. | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + -------------------------------- Table of Contents ------------------------------------ | |
| 41 | + | |
| 42 | + *** [1] Tokens. | |
| 43 | + *** [2] Preparing the lexer description. | |
| 44 | + *** [3] Preparing the lexing stream. | |
| 45 | + *** [4] The main loop. | |
| 46 | + *** [5] Carrying on. | |
| 47 | + | |
| 48 | + --------------------------------------------------------------------------------------- | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + First of all, we must access the tool: | |
| 53 | + | |
| 54 | +read lexical_analysis/fast_lexer.anubis | |
| 55 | +read lexical_analysis/dfa_compiler.anubis | |
| 56 | +read lexical_analysis/regexpr_parser.anubis | |
| 57 | +read lexical_analysis/lexing_stream.anubis | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + *** [1] Tokens. | |
| 62 | + | |
| 63 | + The first thing to do is to define the type for representing tokens since 'fast lexer' | |
| 64 | + has a parameter '$Token'. In the case of this example, this type is very simple: | |
| 65 | + | |
| 66 | +type Token: | |
| 67 | + matching(String), | |
| 68 | + newline. | |
| 69 | + | |
| 70 | + since each recognized sequence is just considered as a string. However, we also have to | |
| 71 | + recognize newline characters in order to be able to count lines. | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + *** [2] Preparing the lexer description. | |
| 77 | + | |
| 78 | + Before you can construct your lexer, you must prepare a 'lexer description'. It's of type: | |
| 79 | + | |
| 80 | + List(LexerItem(Token)) | |
| 81 | + | |
| 82 | + We have one lexer item for the given regular expression, another one for | |
| 83 | + newlines. However, we need a third one for ignoring everything else. | |
| 84 | + | |
| 85 | + | |
| 86 | +define List(LexerItem(Token)) | |
| 87 | + prepare_lexer_description | |
| 88 | + ( | |
| 89 | + String regular_expression | |
| 90 | + ) = | |
| 91 | + [ | |
| 92 | + /* recognize sequences matching the given regular expression */ | |
| 93 | + lexer_item(regular_expression, | |
| 94 | + success((ByteArray b) |-> token(matching(to_string(b))))), | |
| 95 | + | |
| 96 | + /* recognize newline characters */ | |
| 97 | + lexer_item("\n", | |
| 98 | + success((ByteArray b) |-> token(newline))), | |
| 99 | + | |
| 100 | + /* ignore everything else */ | |
| 101 | + lexer_item(".", /* "." represents any character except '\n' */ | |
| 102 | + failure) | |
| 103 | + ]. | |
| 104 | + | |
| 105 | + | |
| 106 | + The lexer will be constructed below by applying the function 'make_lexer' (declared in | |
| 107 | + 'fast_lexer.anubis') to this lexer description. | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + *** [3] Preparing the lexing stream. | |
| 114 | + | |
| 115 | + Lexical analysis is performed from an input stream (of type 'LexingStream'). In the | |
| 116 | + case of this example, the input stream is constructed from the given filename. Of | |
| 117 | + course, this may fail since the file may eventually not be opened or read. | |
| 118 | + | |
| 119 | +define Maybe(LexingStream) | |
| 120 | + prepare_input | |
| 121 | + ( | |
| 122 | + String filename | |
| 123 | + ) = | |
| 124 | + /* try to open the file ('predefined.anubis' section 5.1) */ | |
| 125 | + if file(filename,read) is | |
| 126 | + { | |
| 127 | + failure then failure, | |
| 128 | + success(f) then make_lexing_stream(f, /* the opened file */ | |
| 129 | + 1000, /* size of buffer for the lexing stream */ | |
| 130 | + 100) /* timeout (seconds) */ | |
| 131 | + }. | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + *** [4] The main loop. | |
| 138 | + | |
| 139 | + Assuming our lexer is ready as a function of type 'One -> LexerOutput(Token)' (i.e. the | |
| 140 | + lexing stream is already plugged into it), we construct the main loop of this program. | |
| 141 | + It consists in calling the lexer repeatedly until it returns 'end_of_input'. | |
| 142 | + | |
| 143 | + When it returns an error (actually a lexical error), we print this error. However, this | |
| 144 | + should never happen, because our lexer has a lexer item for ignoring anything not | |
| 145 | + matching one of the first two lexer items. | |
| 146 | + | |
| 147 | + In this loop we also count lines. There is no need for a Var(Int) for that | |
| 148 | + purpose. It's much better to use a 'deterministic local variable' in the form of an | |
| 149 | + extra argument to our function. The function will be called with the value 1 for this | |
| 150 | + argument, which simulates the initialisation of the variable. | |
| 151 | + | |
| 152 | +define One | |
| 153 | + main_loop | |
| 154 | + ( | |
| 155 | + One -> LexerOutput(Token) lexer, | |
| 156 | + Int lineno /* no need for a Var(Int) */ | |
| 157 | + ) = | |
| 158 | + /* get the next token or whatever */ | |
| 159 | + if lexer(unique) is | |
| 160 | + { | |
| 161 | + end_of_input then /* no more token: exit the main loop */ | |
| 162 | + unique, | |
| 163 | + | |
| 164 | + error(b) then | |
| 165 | + /* should never happen with this lexer (see the above comment) */ | |
| 166 | + print("Error: ["+to_string(b)+"]\n"); | |
| 167 | + /* nevertheless we continue the lexical analysis */ | |
| 168 | + main_loop(lexer,lineno), | |
| 169 | + | |
| 170 | + token(t) then | |
| 171 | + /* a token has been recognized */ | |
| 172 | + if t is | |
| 173 | + { | |
| 174 | + matching(s) then /* print the current line number and the recognized sequence */ | |
| 175 | + print(abs_to_decimal(lineno)+": "+s+"\n"); | |
| 176 | + /* continue with the same lineno */ | |
| 177 | + main_loop(lexer,lineno), | |
| 178 | + | |
| 179 | + newline then /* continue with an incremented lineno */ | |
| 180 | + main_loop(lexer,lineno+1) | |
| 181 | + } | |
| 182 | + }. | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + *** [5] Carrying on. | |
| 189 | + | |
| 190 | + | |
| 191 | +read tools/basis.anubis (needed for UTime soustraction) | |
| 192 | + | |
| 193 | + | |
| 194 | + Now we can define our tool. We have to: | |
| 195 | + | |
| 196 | + - check that the user gave the two required arguments on the command line, | |
| 197 | + - prepare the lexer description, | |
| 198 | + - prepare the input stream, | |
| 199 | + - run the main loop. | |
| 200 | + | |
| 201 | +global define One | |
| 202 | + fast_lexer_example_1 | |
| 203 | + ( | |
| 204 | + List(String) args | |
| 205 | + ) = | |
| 206 | + /* check for first argument */ | |
| 207 | + if args is | |
| 208 | + { | |
| 209 | + [ ] then print(usage), | |
| 210 | + [re . t] then | |
| 211 | + /* check for second argument */ | |
| 212 | + if t is | |
| 213 | + { | |
| 214 | + [ ] then print(usage), | |
| 215 | + [filename . _] then | |
| 216 | + /* prepare the lexer description and make the lexer */ | |
| 217 | + if make_lexer(prepare_lexer_description(re)) is | |
| 218 | + { | |
| 219 | + error(msg) then print("Syntax error in regular expression: "+to_English(msg)+"\n"), | |
| 220 | + ok(lexer) then | |
| 221 | + /* prepare the input stream */ | |
| 222 | + if prepare_input(filename) is | |
| 223 | + { | |
| 224 | + failure then print("cannot open or read file '"+filename+"'.\n"), | |
| 225 | + success(ls) then | |
| 226 | + with start_time = unow, | |
| 227 | + /* run the main loop */ | |
| 228 | + main_loop(lexer(ls),1); | |
| 229 | + if unow - start_time is utime(secs,microsecs) then | |
| 230 | + print("Duration: "+abs_to_decimal(secs)+" seconds, "+abs_to_decimal(microsecs)+" microseconds.\n") | |
| 231 | + } | |
| 232 | + } | |
| 233 | + } | |
| 234 | + }. | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 0 | 241 | \ No newline at end of file | ... | ... |
anubis_distrib/library/lexical_analysis/lexer_maker_v2_example.lexer
0 → 100644
| 1 | + | |
| 2 | + | |
| 3 | + This is an example of use of 'lexer_maker'. | |
| 4 | + | |
| 5 | +read tools/basis.anubis | |
| 6 | + | |
| 7 | + | |
| 8 | + We want to test email addresses. Below is a regular expression for that | |
| 9 | + purpose. Actually, this expression is too naïve. A real one would be more complicated. | |
| 10 | + | |
| 11 | +#ETL | |
| 12 | + | |
| 13 | +#email_tester String | |
| 14 | +[a-zA-Z0-9\-_]+(\.[a-zA-Z0-9\-_]+)*@[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+ { (ls,token(text)) } | |
| 15 | +# | |
| 16 | + | |
| 17 | + | |
| 18 | + Since '@' is a normal character, a string needs to contain exactly one '@' for being | |
| 19 | + accepted. What is accepted before and after this '@' is described by: | |
| 20 | + | |
| 21 | + [a-zA-Z]+(\.[a-zA-Z]+)* | |
| 22 | + | |
| 23 | + The first part: [a-zA-Z]+ means ``at least one letter''. The last part: (\.[a-zA-Z]+)* | |
| 24 | + means: ``a dot followed by at least one letter, and this may be repeated any number of | |
| 25 | + times (including zero)''. | |
| 26 | + | |
| 27 | + | |
| 28 | + This part of the source file is the 'postambule' (just Anubis text, which is copied 'as | |
| 29 | + is' to the lexer_maker output file). | |
| 30 | + | |
| 31 | + The above stuff produces a function named 'email_tester' into the lexer_maker output | |
| 32 | + file. This function is used below: | |
| 33 | + | |
| 34 | +global define One | |
| 35 | + test_email_address | |
| 36 | + ( | |
| 37 | + List(String) args | |
| 38 | + ) = | |
| 39 | + if args is | |
| 40 | + { | |
| 41 | + [ ] then print("Usage: test_email_address <address> ... <address>\n"), | |
| 42 | + [_ . _] then | |
| 43 | + map_forget((String s) |-> | |
| 44 | + with ls = lexer_state(make_stream(s),[],[],email_tester,true,false,failure), | |
| 45 | + if email_tester(ls) is (_,result) then if result is | |
| 46 | + { | |
| 47 | + end_of_file then print("End of input.\n"), | |
| 48 | + token(t) then with result1 = implode(t), | |
| 49 | + if length(result1) = length(s) | |
| 50 | + then print(s+" (accepted)\n") | |
| 51 | + else print(s+" (truncated as: "+result1+")\n"), | |
| 52 | + error then print(s+" (rejected)\n") | |
| 53 | + }, | |
| 54 | + args) | |
| 55 | + }. | |
| 56 | + | |
| 57 | + | |
| 58 | + | ... | ... |
anubis_distrib/library/lexical_analysis/testing_fast_lexer.anubis
0 → 100644
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + This is just for testing 'fast_lexer.anubis'. | |
| 10 | + | |
| 11 | +read tools/basis.anubis | |
| 12 | +read tools/streams.anubis | |
| 13 | + | |
| 14 | +read regexpr_parser.anubis | |
| 15 | +read dfa_compiler.anubis | |
| 16 | + | |
| 17 | + | |
| 18 | +define String | |
| 19 | + format | |
| 20 | + ( | |
| 21 | + DFA_pre_label(String) l | |
| 22 | + ) = | |
| 23 | + if l is | |
| 24 | + { | |
| 25 | + char(c) then implode[c], | |
| 26 | + beginning_of_line then "^", | |
| 27 | + end_of_line then "$", | |
| 28 | + action(mbf) then if mbf is | |
| 29 | + { | |
| 30 | + failure then "<ignore>", | |
| 31 | + success(f) then if f(constant_byte_array(0,0)) is | |
| 32 | + token(s) then s else alert | |
| 33 | + } | |
| 34 | + }. | |
| 35 | + | |
| 36 | +define String | |
| 37 | + format | |
| 38 | + ( | |
| 39 | + DFA_label l | |
| 40 | + ) = | |
| 41 | + if l is | |
| 42 | + { | |
| 43 | + char(c) then implode[c], | |
| 44 | + beginning_of_line then "^", | |
| 45 | + end_of_line then "$" | |
| 46 | + }. | |
| 47 | + | |
| 48 | +define Printable_tree | |
| 49 | + format | |
| 50 | + ( | |
| 51 | + DFA_transition t | |
| 52 | + ) = | |
| 53 | + if t is transition(label,target_name) then | |
| 54 | + ["'", format(label), "'>", target_name, " "]. | |
| 55 | + | |
| 56 | + | |
| 57 | +define Printable_tree | |
| 58 | + format | |
| 59 | + ( | |
| 60 | + List(DFA_transition) l | |
| 61 | + ) = | |
| 62 | + if l is | |
| 63 | + { | |
| 64 | + [ ] then [ ], | |
| 65 | + [h . t] then [format(h) . format(t)] | |
| 66 | + }. | |
| 67 | + | |
| 68 | +define Printable_tree | |
| 69 | + format | |
| 70 | + ( | |
| 71 | + DFA_state(String) s | |
| 72 | + ) = | |
| 73 | + if s is | |
| 74 | + { | |
| 75 | + rejecting(n,trs) then ["\n", to_decimal(n), " (rejecting) ", format(trs)], | |
| 76 | + accepting(n,trs,mba) then ["\n", to_decimal(n), " (accepting) ", format(trs), | |
| 77 | + if mba is | |
| 78 | + { | |
| 79 | + failure then "<ignore>", | |
| 80 | + success(a) then "<action "+ | |
| 81 | + if a(constant_byte_array(0,0)) is | |
| 82 | + { | |
| 83 | + end_of_input then alert, | |
| 84 | + error(_) then alert, | |
| 85 | + token(s1) then s1 | |
| 86 | + }+">" | |
| 87 | + }] | |
| 88 | + }. | |
| 89 | + | |
| 90 | + | |
| 91 | +define Printable_tree | |
| 92 | + format | |
| 93 | + ( | |
| 94 | + List(DFA_state(String)) l | |
| 95 | + ) = | |
| 96 | + if l is | |
| 97 | + { | |
| 98 | + [ ] then ["\n------------------------\n"], | |
| 99 | + [h . t] then | |
| 100 | + [format(h) . format(t)] | |
| 101 | + }. | |
| 102 | + | |
| 103 | + | |
| 104 | +define One | |
| 105 | + syntax | |
| 106 | + = | |
| 107 | + print("Usage: fast_lexer_test <regular expression> ... <regular expression>\n\n"). | |
| 108 | + | |
| 109 | + | |
| 110 | +define String | |
| 111 | + format | |
| 112 | + ( | |
| 113 | + RegExpr e | |
| 114 | + ) = | |
| 115 | + if e is | |
| 116 | + { | |
| 117 | + char(Word8 c) then implode([c]), | |
| 118 | + choice(l) then "["+implode(l)+"]", | |
| 119 | + plus(RegExpr e1) then "("+format(e1)+"+"+")", | |
| 120 | + star(RegExpr e1) then "("+format(e1)+"*"+")", | |
| 121 | + cat(RegExpr e1,RegExpr e2) then format(e1)+format(e2), | |
| 122 | + or(RegExpr e1,RegExpr e2) then "("+format(e1)+"|"+format(e2)+")", | |
| 123 | + beginning_of_line then "^", | |
| 124 | + end_of_line then "$", | |
| 125 | + dot then ".", | |
| 126 | + question_mark(e1) then "("+format(e1)+")?" | |
| 127 | + }. | |
| 128 | + | |
| 129 | + | |
| 130 | +define String | |
| 131 | + format | |
| 132 | + ( | |
| 133 | + BasicRegExpr($Token) e | |
| 134 | + ) = | |
| 135 | + if e is | |
| 136 | + { | |
| 137 | + char(c) then implode([c]), | |
| 138 | + star(e1) then "("+format(e1)+"*"+")", | |
| 139 | + or(e1,e2) then "("+format(e1)+"|"+format(e2)+")", | |
| 140 | + cat(e1,e2) then format(e1)+format(e2), | |
| 141 | + epsilon then "()", | |
| 142 | + beginning_of_line then "^", | |
| 143 | + end_of_line then "$", | |
| 144 | + action(a) then "<action>" | |
| 145 | + }. | |
| 146 | + | |
| 147 | + | |
| 148 | +define List(LexerItem(String)) | |
| 149 | + prepare_lexer_items | |
| 150 | + ( | |
| 151 | + List(String) regexprs, | |
| 152 | + Int i | |
| 153 | + ) = | |
| 154 | + if regexprs is | |
| 155 | + { | |
| 156 | + [ ] then [ ], | |
| 157 | + [h . t] then | |
| 158 | + [lexer_item(h,success((ByteArray b) |-> token(to_decimal(i)))) | |
| 159 | + . prepare_lexer_items(t,i+1)] | |
| 160 | + }. | |
| 161 | + | |
| 162 | + | |
| 163 | +define Printable_tree | |
| 164 | + format | |
| 165 | + ( | |
| 166 | + List(FastLexerTransition) l | |
| 167 | + ) = | |
| 168 | + if l is | |
| 169 | + { | |
| 170 | + [ ] then [ ], | |
| 171 | + [h . t] then if h is | |
| 172 | + { | |
| 173 | + transition(c,s) then | |
| 174 | + [implode[c], ":", s, " " . format(t)], | |
| 175 | + beginning_of_line(s) then | |
| 176 | + ["^:",s, " " . format(t)], | |
| 177 | + end_of_line(s) then | |
| 178 | + ["$:",s, " " . format(t)] | |
| 179 | + } | |
| 180 | + }. | |
| 181 | + | |
| 182 | +define Printable_tree | |
| 183 | + format | |
| 184 | + ( | |
| 185 | + List(FastLexerState) l, | |
| 186 | + Int i | |
| 187 | + ) = | |
| 188 | + if l is | |
| 189 | + { | |
| 190 | + [ ] then ["\n------------------------\n"], | |
| 191 | + [h . t] then if h is | |
| 192 | + { | |
| 193 | + rejecting(trs) then ["\n", i, " rejecting: ", format(trs) . format(t,i+1)], | |
| 194 | + accepting(trs) then ["\n", i, " accepting: ", format(trs) . format(t,i+1)] | |
| 195 | + } | |
| 196 | + }. | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | +define One | |
| 202 | + run_fast_lexer | |
| 203 | + ( | |
| 204 | + (ByteArray input, | |
| 205 | + FastLexerLastAccepted last_accepted, | |
| 206 | + FastLexerBeginningOfLine bol, | |
| 207 | + FastLexerEndOfLine eol, | |
| 208 | + Int position, | |
| 209 | + Word32 starting_state) -> FastLexerOutput fast | |
| 210 | + ) = | |
| 211 | + with text = prompt("Try it out (q to quit): ") + "\n", | |
| 212 | + if text = "q\n" then unique else | |
| 213 | + with ba = to_byte_array(text), | |
| 214 | + if fast(ba, | |
| 215 | + none, | |
| 216 | + at_beginning_of_line, | |
| 217 | + not_at_end_of_line, | |
| 218 | + 0, | |
| 219 | + 0) is | |
| 220 | + { | |
| 221 | + rejected(n,e,a) then print("\""+to_string(extract(ba,0,e))+ | |
| 222 | + "\" rejected in state "+to_decimal(n)+"\n"), | |
| 223 | + accepted(n,e,a) then print("\""+to_string(extract(ba,0,e))+ | |
| 224 | + "\" accepted in state "+to_decimal(n)+"\n") | |
| 225 | + }; | |
| 226 | + run_fast_lexer(fast). | |
| 227 | + | |
| 228 | + | |
| 229 | +define One | |
| 230 | + run_fast_lexer | |
| 231 | + ( | |
| 232 | + List(FastLexerState) l | |
| 233 | + ) = | |
| 234 | + if make_fast_lexer(l) is | |
| 235 | + { | |
| 236 | + unknown_state(n) then print("\nUnknown state: "+to_decimal(n)), | |
| 237 | + ok(fast) then run_fast_lexer(fast) | |
| 238 | + }. | |
| 239 | + | |
| 240 | + | |
| 241 | +global define One | |
| 242 | + fast_lexer_test | |
| 243 | + ( | |
| 244 | + List(String) args | |
| 245 | + ) = | |
| 246 | + if args is [] then syntax else | |
| 247 | + map_forget((String e) |-> if parse_regular_expression(make_stream(e)) is | |
| 248 | + { | |
| 249 | + error(msg) then print("*** Error: "+to_English(msg)+"\n\n"), | |
| 250 | + ok(re) then print("Regular expression "+e+" is correct.\n"); | |
| 251 | + print("Read as: "+format(re)+"\n"); | |
| 252 | + print("Basic equivalent: "+ | |
| 253 | + format((BasicRegExpr(String))to_basic(re))+"\n\n") | |
| 254 | + }, | |
| 255 | + args); | |
| 256 | + if make_DFA(prepare_lexer_items(args,0)) is | |
| 257 | + { | |
| 258 | + error(msg) then print("*** Error: "+to_English(msg)+"\n\n"), | |
| 259 | + ok(auto) then with fl = to_fast_lexer_description(auto), | |
| 260 | + print("Automaton:\n------------------------ "); | |
| 261 | + print(format(auto)); | |
| 262 | + print("Fast Lexer:\n------------------------ "); | |
| 263 | + print(format(fl,0)); | |
| 264 | + run_fast_lexer(fl) | |
| 265 | + }. | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 0 | 273 | \ No newline at end of file | ... | ... |