Anubis / Anubis

Browse Code »

Commit 60ef83694df06a3e880e457cebe9ccaeff00b39e

Authored by Alain Prouté 2008-08-22 16:17:29 +0000

1 parent 0bc83921

Updated directory lexical_analysis in anubis_distrib/library/

Showing 4 changed files with 2666 additions and 0 deletions Show diff stats

Inline Side-by-side

anubis_distrib/library/lexical_analysis/fast_lexer.anubis 0 → 100644

Wrap text Show/Hide comments View file @60ef836

	1	+
	2	+
	3	+
	4	+ The Anubis Project
	5	+
	6	+ A tool for producing fast buffered lexers.
	7	+
	8	+ Copyright (c) Constructive Mathematics 2008.
	9	+
	10	+
	11	+ Author: Alain Prouté
	12	+
	13	+
	14	+
	15	+ *** Introduction.
	16	+
	17	+ This tool is more or less equivalent to the Unix tool LEX/FLEX. It replaces the
	18	+ previous version of the same (at least similar) tool 'lexer_maker_v2.anubis' which
	19	+ produces lexers which are too slow and is now obsolete.
	20	+
	21	+ If you want to use this tool, you will have to add:
	22	+
	23	+ read lexical_analysis/fast_lexer.anubis
	24	+
	25	+ into your source file.
	26	+
	27	+
	28	+ Consider a 'source' from which bytes can be read, such as a file, a network connection
	29	+ (maybe an SSL connection), a string or a byte array, etc... There are tools for
	30	+ getting the bytes from this source one after the other, but in general we are better
	31	+ interested into particular sequences of bytes which are called `tokens'. As an
	32	+ example, if the source is the following string:
	33	+
	34	+ "344 + 87"
	35	+
	36	+ we prefer to read the three 'tokens': "344", "+" and "87" directly (ignoring white
	37	+ spaces) rather than the sequence of bytes '3', '4', '4', ' ', '+', ' ', '8' and '7'.
	38	+
	39	+ A 'lexer' is precisely the gadget which will do this job easily and fast (and even
	40	+ better than described above). It uses lexing streams, which are buffered for
	41	+ better performances.
	42	+
	43	+
	44	+
	45	+ ---------------------------------- Table of Contents ----------------------------------
	46	+
	47	+ *** (1) Regular expressions.
	48	+ *** (2) Lexer output.
	49	+ *** (3) Lexing streams.
	50	+ *** (4) Constructing a lexer.
	51	+ *** (5) Plugging several lexers on the same input stream.
	52	+
	53	+ ---------------------------------------------------------------------------------------
	54	+
	55	+
	56	+
	57	+
	58	+ *** (1) Regular expressions.
	59	+
	60	+ Regular expressions are character strings which are used for describing particular sets
	61	+ of tokens. Regular expressions are written using ASCII characters, but some of them
	62	+ have a special meaning. They are the following:
	63	+
	64	+ ( ) [ ] - \ * + \| . $ ^ ?
	65	+
	66	+ All other characters just represent themself. For example, the regular expression
	67	+ 'abcd' represents only the token 'abcd'.
	68	+
	69	+ Parentheses do not represent anything. They are just used for delimiting regular
	70	+ expressions. For example '(abcd)' represents the same thing as 'abcd'.
	71	+
	72	+ The regular expression '[abcd]' represents the 4 tokens: 'a', 'b', 'c' and 'd'. In
	73	+ other words, characters between brackets represent all the tokens made of one and only
	74	+ one of these characters. There is a shortcut for ranges of characters. Instead of
	75	+ writting
	76	+
	77	+ [abcdefghijklmnopqrstuvwxyz]
	78	+
	79	+ you may just write '[a-z]'. For example, the regular expression '[a-zA-Z0-9]'
	80	+ represents any token made of one and only one alphanumeric character.
	81	+
	82	+ If you add a caret just after the opening bracket, the regular expression represents
	83	+ all one byte tokens for all bytes non present within the brackets (i.e. the
	84	+ 'complement' in some sens of the previous set). For example, the regular expression
	85	+ '[^a-z]' represents all one byte tokens whose unique character is not a lower case
	86	+ letter. Note: a byte is any Word8, so that '[^a-z]' also matches characters of code
	87	+ above 127.
	88	+
	89	+ If 'A' is a regular expression, 'A+' represents any non empty concatenation of tokens
	90	+ represented by 'A'. For example, '[a-z]+' represents any non empty sequence of
	91	+ lowercase letters. Similarly, 'A*' represents all the tokens represented by 'A+', plus
	92	+ the empty token (the token made of no character at all).
	93	+
	94	+ If 'A' and 'B' are regular expressions, 'AB' is a regular expression representing any
	95	+ concatenation of a token represented by 'A' and a token represented by 'B'. For
	96	+ example, 'a+b+' represents any non empty sequence of 'a' followed by any non empty
	97	+ sequence of 'b'. As another example, '[A-Z][A-Za-z]*' represents any sequence of
	98	+ letters beginning by an upper case letter (hence actually non empty).
	99	+
	100	+ The backslash character quotes the subsequent character. For example the regular
	101	+ expression '\(' represents the token made of the single character '('. Of course, this
	102	+ is useful for special characters. However, the sequences '\n', '\r' and '\t' represent
	103	+ respectively a line feed, a carriage return and a tabulator.
	104	+
	105	+ If 'A' and 'B' are regular expressions, 'A\|B' is a regular expression representing all
	106	+ the tokens represented by 'A' and all the tokens represented by 'B'. For example,
	107	+ '(a+)\|(b+)' represents all non empty sequences containing either only a's or only b's.
	108	+
	109	+ The dot '.' represents any character except '\n'.
	110	+
	111	+ If 'A' is a regular expression '^A' represents any token represented by 'A' provided
	112	+ that it appears at the begining of a line. Similarly, 'A$' represents any token
	113	+ represented by 'A' provided that it ends at the end of a line. For example the regular
	114	+ expression '//.*$' matches a one line Anubis (or C++) comment, and the regular
	115	+ expression '^define' matches the keyword 'define' only when it is found in the leftmost
	116	+ column.
	117	+
	118	+ If 'A' is a regular expression, 'A?' represents all the tokens represented by 'A' plus
	119	+ the empty token.
	120	+
	121	+
	122	+ When you construct a lexer you provide one or several regular expression. These regular
	123	+ expression may be syntactically incorrect. For this reason, we have the following type
	124	+ for classifying the possible errors:
	125	+
	126	+public type RegExprError:
	127	+ premature_end_of_regexpr,
	128	+ unexpected_right_par,
	129	+ unexpected_right_bracket,
	130	+ regexpr_is_empty,
	131	+ star_not_following_a_regexpr,
	132	+ plus_not_following_a_regexpr,
	133	+ question_mark_not_following_a_regexpr,
	134	+ non_character_within_brackets,
	135	+ misplaced_hyphen,
	136	+ unexpected_vbar,
	137	+ empty_lexer_description.
	138	+
	139	+
	140	+ For your convenience, the next function transforms such an error into a message in
	141	+ English.
	142	+
	143	+public define String
	144	+ to_English
	145	+ (
	146	+ RegExprError e
	147	+ ).
	148	+
	149	+
	150	+
	151	+
	152	+ *** (2) Lexer output.
	153	+
	154	+ A single lexer may recognize different sorts of tokens. For example, a lexer may
	155	+ recognize 'symbols' (represented say by the regular expression '[a-zA-Z]+'), and
	156	+ integers (represented say by the regular expression '[0-9]+'). The role of the lexer is
	157	+ not only to recognize such tokens, but also to return them in such a way that their
	158	+ sort is obvious. For this reason, it is convenient to define a type of tokens with one
	159	+ alternative for each sort of token. In the case of our example, this type could be:
	160	+
	161	+ type Token:
	162	+ symbol(String name),
	163	+ integer(Int value).
	164	+
	165	+ The type of tokens for a given lexer is represented in this file by the type parameter
	166	+ '$Token'. A lexer returns a datum of type:
	167	+
	168	+public type LexerOutput($Token):
	169	+ end_of_input,
	170	+ error(ByteArray),
	171	+ token($Token).
	172	+
	173	+ The lexer returns 'end_of_input' when there is no hope that a next token may be read
	174	+ from the input source. In the case of a file this means that the end of the file has
	175	+ been reached. In the case of a network connection, this means that the connection has
	176	+ been closed or that time is out. In the case of a string or a byte array, this means
	177	+ that the end of the string or byte array has been reached.
	178	+
	179	+ The lexer returns 'error(b)' when no token can be read from the input (but the end of
	180	+ the input has not been reached). Some bytes may have been read from the input, which
	181	+ could have been the beginning of a token until the first byte which cannot be part of a
	182	+ token. Next time the lexer will be called, it will continue to read from after this
	183	+ sequence.
	184	+
	185	+ When a token has been recognized, the lexer has the token at its disposal in the form
	186	+ of a byte array. In order to transform this byte array into a datum of type '$Token'
	187	+ you have to provide a function of type 'ByteArray -> LexerOutput($Token)'. For
	188	+ example, if a 'symbol' is to be recognized, the corresponding function could be
	189	+ something like this:
	190	+
	191	+ (ByteArray b) \|-> token(symbol(to_string(b)))
	192	+
	193	+ If an integer is to be recognized, the corresponding function could be:
	194	+
	195	+ (ByteArray b) \|-> if decimal_scan(to_string(b)) is
	196	+ {
	197	+ failure then error(b),
	198	+ success(n) then token(integer(n))
	199	+ }
	200	+
	201	+ So, in the case of our example (using the type 'Token' above), the lexer may be
	202	+ described by the following list of 'lexer items':
	203	+
	204	+ [
	205	+ lexer_item("[A-Za-z]+",
	206	+ success((ByteArray b) \|-> token(symbol(to_string(b))))),
	207	+ lexer_item("[0-9]+",
	208	+ success((ByteArray b) \|-> if decimal_scan(to_string(b)) is
	209	+ {
	210	+ failure then error(b),
	211	+ success(n) then token(integer(n))
	212	+ }))
	213	+ ]
	214	+
	215	+ where the type 'LexerItem($Token)' is defined as follows:
	216	+
	217	+public type LexerItem($Token):
	218	+ lexer_item(String regular_expression,
	219	+ Maybe(ByteArray -> LexerOutput($Token)) action).
	220	+
	221	+ If you don't provide a function in a lexer item (using 'failure' instead of 'success'),
	222	+ the recognized token is just ignored and the lexer tries to read the next token.
	223	+
	224	+ Notice that the most usual use of a lexer is to call it repeatedly until it returns
	225	+ 'end_of_input'. However, in some circumstances, we want to check for example if a whole
	226	+ string matches a regular expression. In this case the lexer is called a first time, and
	227	+ if it returns a token it must be called a second time in order to check that we have
	228	+ reached the end of the input.
	229	+
	230	+
	231	+
	232	+
	233	+ *** (3) Lexing streams.
	234	+
	235	+ The lexer recognizes tokens by reading characters from some input. The actual input may
	236	+ be either a file, a network connection, a string, a byte array, or anything able to
	237	+ provide characters. From any of the above you may construct a 'lexing stream'.
	238	+
	239	+public type LexingStream:... (an opaque type)
	240	+
	241	+public define LexingStream make_lexing_stream(ByteArray b).
	242	+public define LexingStream make_lexing_stream(String s).
	243	+public define Maybe(LexingStream) make_lexing_stream(RStream stream,
	244	+ Int buffer_size,
	245	+ Int timeout).
	246	+public define Maybe(LexingStream) make_lexing_stream(RWStream stream,
	247	+ Int buffer_size,
	248	+ Int timeout).
	249	+public define Maybe(LexingStream) make_lexing_stream(SSL_Connection stream,
	250	+ Int buffer_size,
	251	+ Int timeout).
	252	+
	253	+ In the case of a file or network connection (first argument of type 'RStream',
	254	+ 'RWStream', 'SSL_Connection') byte arrays are used for buffering the input. The maximal
	255	+ size of these buffers must be provided as the second argument. The choice has no
	256	+ incidence on the behavior of the lexer, except with respect to performances, and the
	257	+ lexer can still return tokens longer than this size. The timeout is in seconds and
	258	+ used each time the buffer is reloaded from the actual input. When time is out, the
	259	+ lexer gives up as if the end of the input was reached. So, you may have to give a
	260	+ rather high value to this timeout.
	261	+
	262	+ 'make_lexing_stream' returns 'failure' if a read error or timeout occurs when the
	263	+ buffer is loaded for the first time.
	264	+
	265	+ In the case of a byte array or a string, the situation is much simpler. The buffer is
	266	+ the byte array or the string itself, no time out is needed and the result has no
	267	+ 'Maybe'.
	268	+
	269	+ If you need another kind of lexing stream, have a look at the private part of this
	270	+ file, in particular at the actual definition of type 'LexingStream', and write down
	271	+ another such function.
	272	+
	273	+
	274	+
	275	+
	276	+ *** (4) Constructing a lexer.
	277	+
	278	+ In order to construct a lexer use the following:
	279	+
	280	+public define Result(RegExprError, LexingStream -> One -> LexerOutput($Token))
	281	+ make_lexer
	282	+ (
	283	+ List(LexerItem($Token)) lexer_description
	284	+ ).
	285	+
	286	+ Thus, a lexer is constructed (if no error occurs) as a function of type:
	287	+
	288	+ LexingStream -> One -> LexerOutput($Token)
	289	+
	290	+ Applying this function to a lexing stream is understood as 'plugging' it to the
	291	+ stream. The result is a function of type:
	292	+
	293	+ One -> LexerOutput($Token)
	294	+
	295	+ to be used repeatedly until it returns 'end_of_input'.
	296	+
	297	+
	298	+
	299	+
	300	+
	301	+ *** (5) Plugging several lexers on the same input.
	302	+
	303	+ It is often the case that we have to use several lexers on the same input. This is
	304	+ equivalent to saying that we have only one lexer in this input but with several
	305	+ different 'states' in the sens of LEX/FLEX for example. In our system there is no
	306	+ notion of 'state' for lexers, but several lexers may use the same lexing stream
	307	+ concurrently. You can plug them to the same lexing stream, and use them repeatedly in
	308	+ any order depending on the sort of thing you want to read from the stream.
	309	+
	310	+
	311	+
	312	+
	313	+
	314	+ --- That's all for the public part ! --------------------------------------------------
	315	+
	316	+
	317	+read tools/basis.anubis
	318	+read tools/streams.anubis
	319	+
	320	+
	321	+ -------------------------------- Table of Contents ------------------------------------
	322	+
	323	+
	324	+ ---------------------------------------------------------------------------------------
	325	+
	326	+
	327	+
	328	+
	329	+ *** [1] Parsing regular expressions.
	330	+
	331	+
	332	+ *** [1.1] Regular expressions.
	333	+
	334	+ Regular expressions are formalized as follows.
	335	+
	336	+public type RegExpr:
	337	+ char(Word8), // a
	338	+ choice(List(Word8)), // [abc]
	339	+ plus(RegExpr), // a+
	340	+ star(RegExpr), // a*
	341	+ cat(RegExpr,RegExpr), // ab
	342	+ or(RegExpr,RegExpr), // (a\|b)
	343	+ beginning_of_line, // ^
	344	+ end_of_line, // $
	345	+ dot, // .
	346	+ question_mark(RegExpr). // a?
	347	+
	348	+
	349	+
	350	+ *** [1.2] Basic regular expressions.
	351	+
	352	+ Basic regular expressions are enough for representing all regular expressions. In other
	353	+ words any regular expression is equivalent to a basic regular expression. Furthermore,
	354	+ at some point of the construction of lexers we have to handle 'actions'. We introduce
	355	+ them here even if we generates them only in 'dfa_compiler.anubis'. This also makes the
	356	+ type 'LexerOutput($Token)' required at this point.
	357	+
	358	+public type BasicRegExpr($Token):
	359	+ char(Word8),
	360	+ star(BasicRegExpr($Token)),
	361	+ or(BasicRegExpr($Token),BasicRegExpr($Token)),
	362	+ cat(BasicRegExpr($Token),BasicRegExpr($Token)),
	363	+ epsilon, // matches the empty sequence of characters
	364	+ beginning_of_line,
	365	+ end_of_line,
	366	+ action(Maybe(ByteArray -> LexerOutput($Token))).
	367	+
	368	+ The role of 'epsilon', which matches only the empty lexeme, if to provide a
	369	+ representation for the empty choice '[]', and for regular expressions of the form 'A?',
	370	+ which are translated into 'or(A,epsilon)'.
	371	+
	372	+ The following function transforms a regular expression into an equivalent basic regular
	373	+ expression.
	374	+
	375	+public define BasicRegExpr($Token)
	376	+ to_basic
	377	+ (
	378	+ RegExpr e
	379	+ ).
	380	+
	381	+
	382	+
	383	+ *** [1.3] 'Extended' characters.
	384	+
	385	+ 'Extended' characters (used in regular expressions) are defined (and classified) as
	386	+ follows.
	387	+
	388	+type ExChar:
	389	+ left_par, // (
	390	+ right_par, // )
	391	+ left_bracket, // [
	392	+ right_bracket, // ]
	393	+ star, // *
	394	+ plus, // +
	395	+ or, // \|
	396	+ dot, // .
	397	+ dollar, // $
	398	+ caret, // ^
	399	+ hyphen, // -
	400	+ question_mark, // ?
	401	+ char(Word8). // a, b, c, ...
	402	+
	403	+
	404	+
	405	+
	406	+ *** [1.4] Getting the next (extended) character from the input stream.
	407	+
	408	+ The next function reads an extended character from the input stream. It returns
	409	+ 'failure' as it encounters the end of the input.
	410	+
	411	+define Maybe(ExChar)
	412	+ next_exchar
	413	+ (
	414	+ Stream s
	415	+ ) =
	416	+ if read_byte(s) is
	417	+ {
	418	+ failure then failure,
	419	+ success(c) then
	420	+ if c = '\'
	421	+ then if read_byte(s) is
	422	+ {
	423	+ failure then failure,
	424	+ success(d) then
	425	+ if d = 'n' then success(char('\n')) else
	426	+ if d = 'r' then success(char('\r')) else
	427	+ if d = 't' then success(char('\t')) else
	428	+ success(char(d))
	429	+ }
	430	+ else if c = '(' then success(left_par)
	431	+ else if c = ')' then success(right_par)
	432	+ else if c = '[' then success(left_bracket)
	433	+ else if c = ']' then success(right_bracket)
	434	+ else if c = '\|' then success(or)
	435	+ else if c = '*' then success(star)
	436	+ else if c = '+' then success(plus)
	437	+ else if c = '.' then success(dot)
	438	+ else if c = '$' then success(dollar)
	439	+ else if c = '^' then success(caret)
	440	+ else if c = '-' then success(hyphen)
	441	+ else if c = '?' then success(question_mark)
	442	+ else success(char(c))
	443	+ }.
	444	+
	445	+
	446	+
	447	+
	448	+
	449	+
	450	+ *** [1.5] Tools.
	451	+
	452	+ *** [1.5.1] Truncating a Word32 to a Word8.
	453	+
	454	+define Word8
	455	+ truncate_to_Word8
	456	+ (
	457	+ Word32 x
	458	+ ) =
	459	+ if x is word32(l1,_) then if l1 is word16(l2,_) then l2.
	460	+
	461	+
	462	+
	463	+ *** [1.5.2] Creating a range of consecutive characters.
	464	+
	465	+ Given a first character and a last character, create the list of all characters between
	466	+ these two (included).
	467	+
	468	+define List(Word8)
	469	+ range
	470	+ (
	471	+ Word8 a,
	472	+ Word8 z
	473	+ ) =
	474	+ if z = a then [a] else [a . range(a+1,z)].
	475	+
	476	+
	477	+
	478	+
	479	+ *** [1.5.3] Computing the complement of a set of characters.
	480	+
	481	+ Compute the 'complement' of a choice, i.e. the list of all characters which do not
	482	+ belong to the given choice.
	483	+
	484	+define List(Word8)
	485	+ complement_choice
	486	+ (
	487	+ List(Word8) l,
	488	+ List(Word8) result,
	489	+ Word32 n
	490	+ ) =
	491	+ if n = -1 then result else
	492	+ with c = truncate_to_Word8(n),
	493	+ if member(l,c)
	494	+ then complement_choice(l,result,n-1)
	495	+ else complement_choice(l,[c . result],n-1).
	496	+
	497	+
	498	+
	499	+
	500	+
	501	+ *** [1.5.4] Concatenating a list of regular expression (in reverse order).
	502	+
	503	+ Concatenate a (non empty) list of RegExpr in reverse order:
	504	+
	505	+define RegExpr
	506	+ cat_list
	507	+ (
	508	+ RegExpr last,
	509	+ List(RegExpr) others
	510	+ ) =
	511	+ if others is
	512	+ {
	513	+ [ ] then last,
	514	+ [h . t] then cat(cat_list(h,t),last)
	515	+ }.
	516	+
	517	+
	518	+
	519	+
	520	+ *** [1.5.5] Reading a 'choice' of characters.
	521	+
	522	+ Reading a 'choice', i.e. the characters within square brackets.
	523	+
	524	+define Result(RegExprError,List(Word8))
	525	+ read_choice
	526	+ (
	527	+ Stream s,
	528	+ List(Word8) already_read
	529	+ ) =
	530	+ if next_exchar(s) is
	531	+ {
	532	+ failure then error(premature_end_of_regexpr),
	533	+ success(x) then
	534	+ if x is right_bracket then ok(already_read) else
	535	+ if x is char(c) then read_choice(s,[c . already_read]) else
	536	+ if x is hyphen then
	537	+ if already_read is
	538	+ {
	539	+ [ ] then error(misplaced_hyphen),
	540	+ [a . others] then
	541	+ if next_exchar(s) is
	542	+ {
	543	+ failure then error(premature_end_of_regexpr),
	544	+ success(y) then
	545	+ if y is char(z)
	546	+ then read_choice(s,reverse_append(range(a,z),others))
	547	+ else error(non_character_within_brackets)
	548	+ }
	549	+ }
	550	+ else error(non_character_within_brackets)
	551	+ }.
	552	+
	553	+
	554	+
	555	+
	556	+
	557	+ *** [1.5.6] Reading a complemented 'choice' of characters.
	558	+
	559	+ The same one but giving the complement of the 'choice'.
	560	+
	561	+define Result(RegExprError,List(Word8))
	562	+ read_counter_choice
	563	+ (
	564	+ Stream s,
	565	+ List(Word8) already_read
	566	+ ) =
	567	+ if read_choice(s,already_read) is
	568	+ {
	569	+ error(msg) then error(msg),
	570	+ ok(l) then ok(complement_choice(l,[],255))
	571	+ }.
	572	+
	573	+
	574	+
	575	+
	576	+ *** [1.5.7] Reading a 'choice' (general case).
	577	+
	578	+ The following function is called when a left bracket has been read. It reads extended
	579	+ characters until the right bracket is found.
	580	+
	581	+define Result(RegExprError,List(Word8))
	582	+ read_within_brackets
	583	+ (
	584	+ Stream s
	585	+ ) =
	586	+ if next_exchar(s) is
	587	+ {
	588	+ failure then error(premature_end_of_regexpr),
	589	+ success(x) then
	590	+ if x = caret
	591	+ then read_counter_choice(s,[])
	592	+ else if x is char(c) then read_choice(s,[c])
	593	+ else error(non_character_within_brackets)
	594	+ }.
	595	+
	596	+
	597	+
	598	+
	599	+
	600	+
	601	+ *** [1.6] Reading a regular expression.
	602	+
	603	+
	604	+
	605	+ *** [1.6.1] Right delimiters.
	606	+
	607	+type RightDelimiter:
	608	+ right_par,
	609	+ end_of_regexpr.
	610	+
	611	+
	612	+
	613	+
	614	+ *** [1.6.2] Recursive reading.
	615	+
	616	+define Result(RegExprError,RegExpr)
	617	+ read_regexpr
	618	+ (
	619	+ Stream s,
	620	+ List(RegExpr) already_read,
	621	+ RightDelimiter delim
	622	+ ) =
	623	+ if next_exchar(s) is
	624	+ {
	625	+ failure then
	626	+ if delim is
	627	+ {
	628	+ right_par then
	629	+ error(premature_end_of_regexpr),
	630	+
	631	+ end_of_regexpr then
	632	+ if already_read is
	633	+ {
	634	+ [ ] then error(regexpr_is_empty),
	635	+ [last . others] then
	636	+ ok(cat_list(last,others))
	637	+ }
	638	+ },
	639	+
	640	+ success(ec) then
	641	+ if ec is
	642	+ {
	643	+ left_par then
	644	+ if read_regexpr(s,[],right_par) is
	645	+ {
	646	+ error(msg) then
	647	+ error(msg),
	648	+
	649	+ ok(r1) then
	650	+ read_regexpr(s,[r1 . already_read],delim)
	651	+ },
	652	+
	653	+ right_par then
	654	+ if delim is
	655	+ {
	656	+ right_par then
	657	+ if already_read is
	658	+ {
	659	+ [ ] then
	660	+ error(unexpected_right_par),
	661	+
	662	+ [last . others] then
	663	+ ok(cat_list(last,others))
	664	+ },
	665	+
	666	+ end_of_regexpr then
	667	+ error(unexpected_right_par)
	668	+ },
	669	+
	670	+ left_bracket then
	671	+ if read_within_brackets(s) is
	672	+ {
	673	+ error(msg) then error(msg),
	674	+
	675	+ ok(r1) then if already_read is
	676	+ {
	677	+ [ ] then
	678	+ read_regexpr(s,[choice(r1)],delim),
	679	+
	680	+ [last . others] then
	681	+ read_regexpr(s,[choice(r1),last . others],delim)
	682	+ }
	683	+ },
	684	+
	685	+ right_bracket then
	686	+ error(unexpected_right_bracket),
	687	+
	688	+ star then
	689	+ if already_read is
	690	+ {
	691	+ [ ] then
	692	+ error(star_not_following_a_regexpr),
	693	+
	694	+ [last . others] then
	695	+ read_regexpr(s,[star(last) . others],delim)
	696	+ },
	697	+
	698	+ plus then
	699	+ if already_read is
	700	+ {
	701	+ [ ] then
	702	+ error(plus_not_following_a_regexpr),
	703	+
	704	+ [last . others] then
	705	+ read_regexpr(s,[plus(last) . others],delim)
	706	+ },
	707	+
	708	+ or then
	709	+ if read_regexpr(s,[],delim) is
	710	+ {
	711	+ error(msg) then error(msg),
	712	+
	713	+ ok(r1) then
	714	+ if already_read is
	715	+ {
	716	+ [ ] then error(unexpected_vbar),
	717	+ [h . t] then
	718	+ ok(or(cat_list(h,t),r1))
	719	+ }
	720	+ },
	721	+
	722	+ dot then
	723	+ read_regexpr(s,[dot . already_read], delim),
	724	+
	725	+ dollar then
	726	+ read_regexpr(s,[end_of_line . already_read], delim),
	727	+
	728	+ caret then
	729	+ read_regexpr(s,[beginning_of_line . already_read], delim),
	730	+
	731	+ hyphen then
	732	+ error(misplaced_hyphen),
	733	+
	734	+ question_mark then
	735	+ if already_read is
	736	+ {
	737	+ [ ] then
	738	+ error(question_mark_not_following_a_regexpr),
	739	+
	740	+ [last . others] then
	741	+ read_regexpr(s,[question_mark(last) . others],delim)
	742	+ },
	743	+
	744	+ char(c) then
	745	+ read_regexpr(s,[char(c) . already_read], delim)
	746	+ }
	747	+ }.
	748	+
	749	+
	750	+
	751	+
	752	+ *** [1.6.3] Normalizing a regular expression.
	753	+
	754	+ This amounts to add (^)? at the beginning of every regular expression not beginning by
	755	+ ^ and ($)? at the end of any regular expression not ending by $.
	756	+
	757	+define Bool
	758	+ begins_by_bol
	759	+ (
	760	+ RegExpr re
	761	+ ) =
	762	+ if re is
	763	+ {
	764	+ char(Word8 _0) then false,
	765	+ choice(List(Word8) _0) then false,
	766	+ plus(RegExpr _0) then false,
	767	+ star(RegExpr _0) then false,
	768	+ cat(RegExpr _0,RegExpr _1) then begins_by_bol(_0),
	769	+ or(RegExpr _0,RegExpr _1) then false,
	770	+ beginning_of_line then true,
	771	+ end_of_line then false,
	772	+ dot then false,
	773	+ question_mark(RegExpr _0) then false
	774	+ }.
	775	+
	776	+define Bool
	777	+ ends_by_eol
	778	+ (
	779	+ RegExpr re
	780	+ ) =
	781	+ if re is
	782	+ {
	783	+ char(Word8 _0) then false,
	784	+ choice(List(Word8) _0) then false,
	785	+ plus(RegExpr _0) then false,
	786	+ star(RegExpr _0) then false,
	787	+ cat(RegExpr _0,RegExpr _1) then ends_by_eol(_1),
	788	+ or(RegExpr _0,RegExpr _1) then false,
	789	+ beginning_of_line then false,
	790	+ end_of_line then true,
	791	+ dot then false,
	792	+ question_mark(RegExpr _0) then false
	793	+ }.
	794	+
	795	+
	796	+define RegExpr
	797	+ normalize
	798	+ (
	799	+ RegExpr re
	800	+ ) =
	801	+ with re1 = if begins_by_bol(re) then re else cat(question_mark(beginning_of_line),re),
	802	+ if ends_by_eol(re1) then re1 else cat(re1,question_mark(end_of_line)).
	803	+
	804	+
	805	+
	806	+
	807	+ *** [1.6.4] The tool for parsing regular expressions.
	808	+
	809	+define Result(RegExprError,RegExpr)
	810	+ parse_regular_expression
	811	+ (
	812	+ Stream s,
	813	+ ) =
	814	+ if read_regexpr(s,[],end_of_regexpr) is
	815	+ {
	816	+ error(msg) then error(msg),
	817	+ ok(re) then ok(normalize(re))
	818	+ }.
	819	+
	820	+
	821	+
	822	+
	823	+
	824	+ *** [1.7] Transforming a regular expression into a basic one.
	825	+
	826	+ *** [1.7.1] Expanding a 'choice' of characters.
	827	+
	828	+ Given list of characters (a 'choice sequence'), compute the correponding basic regular
	829	+ expression.
	830	+
	831	+define BasicRegExpr($Token)
	832	+ expand_choice
	833	+ (
	834	+ List(Word8) l
	835	+ ) =
	836	+ if l is
	837	+ {
	838	+ [ ] then epsilon,
	839	+ [h . t] then
	840	+ if t is [ ] then char(h) else
	841	+ or(char(h),expand_choice(t))
	842	+ }.
	843	+
	844	+
	845	+
	846	+ *** [1.7.2] The tool for converting to basic.
	847	+
	848	+ Convert a regular expression to a basic one.
	849	+
	850	+public define BasicRegExpr($Token)
	851	+ to_basic
	852	+ (
	853	+ RegExpr r
	854	+ ) =
	855	+ if r is
	856	+ {
	857	+ char(c) then char(c),
	858	+ choice(l) then expand_choice(l),
	859	+ plus(r1) then with br = to_basic(r1), cat(br,star(br)),
	860	+ star(r1) then star(to_basic(r1)),
	861	+ cat(r1,r2) then cat(to_basic(r1),to_basic(r2)),
	862	+ or(r1,r2) then or(to_basic(r1),to_basic(r2)),
	863	+ beginning_of_line then beginning_of_line,
	864	+ end_of_line then end_of_line,
	865	+ dot then expand_choice(reverse_append(range(0,'\n'-1),
	866	+ range('\n'+1,255))),
	867	+ question_mark(r1) then or(epsilon,to_basic(r1))
	868	+ }.
	869	+
	870	+
	871	+
	872	+
	873	+ *** [1.8] Formating error messages into English.
	874	+
	875	+public define String
	876	+ to_English
	877	+ (
	878	+ RegExprError e
	879	+ ) =
	880	+ if e is
	881	+ {
	882	+ premature_end_of_regexpr then "Premature end of regular expression.",
	883	+ unexpected_right_par then "Unexpected right parenthese.",
	884	+ unexpected_right_bracket then "Unexpected right bracket.",
	885	+ regexpr_is_empty then "Regular expression is empty.",
	886	+ star_not_following_a_regexpr then "Found '*' not following any regular expression.",
	887	+ plus_not_following_a_regexpr then "Found '+' not following any regular expression.",
	888	+ question_mark_not_following_a_regexpr then "Found '?' not following any regular expression.",
	889	+ non_character_within_brackets then "Non character within brackets.",
	890	+ misplaced_hyphen then "Misplaced hyphen.",
	891	+ unexpected_vbar then "Misplaced vertical bar.",
	892	+ empty_lexer_description then "Empty lexer description."
	893	+ }.
	894	+
	895	+
	896	+
	897	+
	898	+
	899	+
	900	+
	901	+
	902	+ *** [2] Lexing streams.
	903	+
	904	+ *** [2.1] The type 'LexingStream'.
	905	+
	906	+ A lexing stream provides tools which are adhoc for using low level fast lexers as
	907	+ defined in section 13 of predefined.anubis:
	908	+
	909	+ - a variable 'buffer_v' containing the current buffer,
	910	+ - a variable 'start_v' giving the starting position of the current lexeme within the buffer,
	911	+ - a variable 'last_accept_v' giving the last accepting position (if any),
	912	+ - a variable 'current_v' giving the currrent position of reading within the buffer,
	913	+ - a function 'reload_buffer' for loading new bytes from the input.
	914	+
	915	+
	916	+public type LexingStream:
	917	+ lexing_stream
	918	+ (
	919	+ Var(ByteArray) buffer_v, // the current buffer
	920	+ Var(Int) start_v, // start of lexem in buffer
	921	+ Var(FastLexerLastAccepted) last_accept_v, // last accepting position (if any)
	922	+ Var(Int) current_v, // position of reading in buffer
	923	+ Int -> Maybe(One) reload_buffer // command for loading the sequel in the buffer
	924	+ ).
	925	+
	926	+ While we are reading a lexeme, we keep the starting position (offset of first character
	927	+ of the current lexeme) in 'start_v' so as to be able to extract the lexeme. We also
	928	+ keep the last position at which a lexeme was accepted. This is because the lexer always
	929	+ tries to read the longuest possible lexeme. If at some point the lexeme is rejected,
	930	+ and if there is a last accepting position, the current position comes back to this last
	931	+ accepting position, and the lexeme is accepted.
	932	+
	933	+ 'reload_buffer' works as follows. It returns 'failure' is there is nothing more to be
	934	+ read from the actual input (the connection is down, the end of the file has been
	935	+ reached or time is out). In this case, the current buffer is unchanged.
	936	+
	937	+ Otherwise, it reads a chunk of characters (say V) from the actual input, extracts the
	938	+ part of the current buffer starting at the argument (say U), and establishes U+V as
	939	+ then new current buffer. The other variables are updated accordingly.
	940	+
	941	+
	942	+
	943	+
	944	+ *** [2.2] Constructing lexing streams.
	945	+
	946	+ *** [2.2.1] From a byte array.
	947	+
	948	+public define LexingStream
	949	+ make_lexing_stream
	950	+ (
	951	+ ByteArray b
	952	+ ) =
	953	+ lexing_stream(var(b), // buffer
	954	+ var(0), // starting position
	955	+ var(none), // last accepting position
	956	+ var(0), // current position
	957	+ (Int u) \|-> failure). // buffer cannot be reloaded
	958	+
	959	+
	960	+
	961	+
	962	+ *** [2.2.2] From a string.
	963	+
	964	+public define LexingStream
	965	+ make_lexing_stream
	966	+ (
	967	+ String s
	968	+ ) =
	969	+ make_lexing_stream(to_byte_array(s)).
	970	+
	971	+
	972	+
	973	+
	974	+ *** [2.2.3] From a read only stream.
	975	+
	976	+public define Maybe(LexingStream)
	977	+ make_lexing_stream
	978	+ (
	979	+ RStream stream,
	980	+ Int buffer_size,
	981	+ Int timeout
	982	+ ) =
	983	+ if read(stream,buffer_size,timeout) is
	984	+ {
	985	+ error then failure,
	986	+ timeout then failure,
	987	+ ok(buffer) then
	988	+ with buffer_v = var(buffer),
	989	+ start_v = var((Int)0),
	990	+ last_accepted_v = var((FastLexerLastAccepted)none),
	991	+ current_v = var((Int)0),
	992	+ reload_buffer = (Int i) \|->
	993	+ if read(stream,buffer_size,timeout) is
	994	+ {
	995	+ error then failure,
	996	+ timeout then failure,
	997	+ ok(more) then
	998	+ //print("Buffer reloaded ("+abs_to_decimal(length(more))+" bytes).\n");
	999	+ if length(more) = 0
	1000	+ then (with old_buffer = *buffer_v,
	1001	+ old_length = length(old_buffer),
	1002	+ dropped = *start_v, // number of bytes dropped from old buffer
	1003	+ buffer_v <- extract(old_buffer,dropped,old_length);
	1004	+ start_v <- 0;
	1005	+ current_v <- *current_v - dropped;
	1006	+ last_accepted_v <-
	1007	+ if *last_accepted_v is
	1008	+ {
	1009	+ none then none,
	1010	+ last(s,a) then last(s,a - dropped)
	1011	+ };
	1012	+ failure)
	1013	+ else (with old_buffer = *buffer_v,
	1014	+ old_length = length(old_buffer),
	1015	+ dropped = *start_v, // number of bytes dropped from old buffer
	1016	+ buffer_v <- extract(old_buffer,dropped,old_length)+more;
	1017	+ start_v <- 0;
	1018	+ current_v <- *current_v - dropped;
	1019	+ last_accepted_v <-
	1020	+ if *last_accepted_v is
	1021	+ {
	1022	+ none then none,
	1023	+ last(s,a) then last(s,a - dropped)
	1024	+ };
	1025	+ success(unique))
	1026	+ },
	1027	+ success(lexing_stream(buffer_v,
	1028	+ start_v,
	1029	+ last_accepted_v,
	1030	+ current_v,
	1031	+ reload_buffer))
	1032	+ }.
	1033	+
	1034	+
	1035	+
	1036	+
	1037	+ *** [2.2.4] From a read/write stream.
	1038	+
	1039	+public define Maybe(LexingStream)
	1040	+ make_lexing_stream
	1041	+ (
	1042	+ RWStream stream,
	1043	+ Int buffer_size,
	1044	+ Int timeout
	1045	+ ) =
	1046	+ make_lexing_stream(weaken(stream),buffer_size,timeout).
	1047	+
	1048	+
	1049	+
	1050	+
	1051	+ *** [2.2.5] From an SSL connection.
	1052	+
	1053	+public define Maybe(LexingStream)
	1054	+ make_lexing_stream
	1055	+ (
	1056	+ SSL_Connection stream,
	1057	+ Int buffer_size,
	1058	+ Int timeout
	1059	+ ) =
	1060	+ if (Maybe(ByteArray))read(stream,buffer_size,timeout) is
	1061	+ {
	1062	+ failure then failure,
	1063	+ success(buffer) then
	1064	+ with buffer_v = var(buffer),
	1065	+ start_v = var((Int)0),
	1066	+ last_accepted_v = var((FastLexerLastAccepted)none),
	1067	+ current_v = var((Int)0),
	1068	+ reload_buffer = (Int i) \|->
	1069	+ if (Maybe(ByteArray))read(stream,buffer_size,timeout) is
	1070	+ {
	1071	+ failure then failure,
	1072	+ success(more) then
	1073	+ if length(more) = 0
	1074	+ then failure
	1075	+ else with old_buffer = *buffer_v,
	1076	+ old_length = length(old_buffer),
	1077	+ dropped = *start_v, // number of bytes dropped from old buffer
	1078	+ buffer_v <- extract(old_buffer,dropped,old_length)+more;
	1079	+ start_v <- 0;
	1080	+ current_v <- *current_v - dropped;
	1081	+ last_accepted_v <-
	1082	+ if *last_accepted_v is
	1083	+ {
	1084	+ none then none,
	1085	+ last(s,a) then last(s,a - dropped)
	1086	+ };
	1087	+ success(unique)
	1088	+ },
	1089	+ success(lexing_stream(buffer_v,
	1090	+ start_v,
	1091	+ last_accepted_v,
	1092	+ current_v,
	1093	+ reload_buffer))
	1094	+ }.
	1095	+
	1096	+
	1097	+
	1098	+
	1099	+
	1100	+
	1101	+ *** [3] Constructing the automaton.
	1102	+
	1103	+ The description of a lexer is given as a list of 'LexerItem($Token)', where the
	1104	+ parameter '$Token' represents the type of tokens. Each lexer item is made of a regular
	1105	+ expression and an action. If the action is 'failure', the token just read is ignored
	1106	+ and the lexer tries to read the next one. Otherwise, the action is applied to the
	1107	+ lexeme just read, and the result of the action is returned by the lexer. The type
	1108	+ 'LexerOutput($Token)' is defined in 'regexpr_parser.anubis'.
	1109	+
	1110	+
	1111	+ A DFA is presented as a list of states. Each state is either accepting or
	1112	+ rejecting. Each state has a name (of type Word32), and a list of transitions. Accepting
	1113	+ states also have the corresponding 'action'.
	1114	+
	1115	+ Each transition has a 'label' and the name of a state (the target state for this
	1116	+ transition). Labels are of the following sorts:
	1117	+
	1118	+public type DFA_label:
	1119	+ char(Word8),
	1120	+ beginning_of_line,
	1121	+ end_of_line.
	1122	+
	1123	+public type DFA_transition:
	1124	+ transition(DFA_label label,
	1125	+ Word32 target_name).
	1126	+
	1127	+public type DFA_state($Token):
	1128	+ rejecting (Word32 name,
	1129	+ List(DFA_transition) transitions),
	1130	+
	1131	+ accepting (Word32 name,
	1132	+ List(DFA_transition) transitions,
	1133	+ Maybe(ByteArray -> LexerOutput($Token)) action).
	1134	+
	1135	+
	1136	+
	1137	+ Now, here is the tool for making the DFA. The type 'RegExprError' is defined in
	1138	+ 'regexpr_parser.anubis'.
	1139	+
	1140	+public define Result(RegExprError,List(DFA_state($Token)))
	1141	+ make_DFA
	1142	+ (
	1143	+ List(LexerItem($Token)) lexer_description
	1144	+ ).
	1145	+
	1146	+
	1147	+
	1148	+ *** [3.1] Pre-labels.
	1149	+
	1150	+ These are the labels before the renameing of the DFA.
	1151	+
	1152	+ 'beginning_of_line' and 'end_of_line' are also treated as special characters, even if
	1153	+ they cannot be present as such in the input. The fast lexer detects their presence
	1154	+ based on the neighbouring of the character '\n', and uses special transitions in that
	1155	+ case.
	1156	+
	1157	+ On the contrary, 'actions' cannot be considered as matching anything in the
	1158	+ input. However, in a given state and action may be present among transitions, just
	1159	+ meaning that in this state, if no transition may be followed, the action must be
	1160	+ chosen instead.
	1161	+
	1162	+
	1163	+public type DFA_pre_label($Token):
	1164	+ char(Word8),
	1165	+ beginning_of_line,
	1166	+ end_of_line,
	1167	+ action(Maybe(ByteArray -> LexerOutput($Token))).
	1168	+
	1169	+
	1170	+
	1171	+
	1172	+ *** [3.2] Decorating basic regular expressions.
	1173	+
	1174	+ Given a basic regular expression, we associate a unique integer to each of its leaves
	1175	+ (when seen as a tree), which are either a character a beginning of line or an end of
	1176	+ line. Such an integer is called a 'position'.
	1177	+
	1178	+ Furthermore, we add three decorations to each Basic regular
	1179	+ expression:
	1180	+
	1181	+ - a flag 'nullable', which, when true, means that the regular expression may match
	1182	+ the empty string,
	1183	+
	1184	+ - a list of integers, representing all positions which may correspond to the first
	1185	+ character of a matching string,
	1186	+
	1187	+ - a list of integers, representing all positions which may correspond to the last
	1188	+ character in a matching string.
	1189	+
	1190	+ Actually, these two lists are lists of pairs (Word32,Label), where
	1191	+ the label corresponds to the position.
	1192	+
	1193	+type DecoratedBasicRegExpr($Token):
	1194	+ char (Word8,
	1195	+ Word32 pos,
	1196	+ Bool nullable,
	1197	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1198	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1199	+
	1200	+ bol (Word32 pos,
	1201	+ Bool nullable,
	1202	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1203	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1204	+
	1205	+ eol (Word32 pos,
	1206	+ Bool nullable,
	1207	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1208	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1209	+
	1210	+ epsilon (Bool nullable,
	1211	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1212	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1213	+
	1214	+ or (DecoratedBasicRegExpr($Token),DecoratedBasicRegExpr($Token),
	1215	+ Bool nullable,
	1216	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1217	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1218	+
	1219	+ cat (DecoratedBasicRegExpr($Token),DecoratedBasicRegExpr($Token),
	1220	+ Bool nullable,
	1221	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1222	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1223	+
	1224	+ star (DecoratedBasicRegExpr($Token),
	1225	+ Bool nullable,
	1226	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1227	+ List((Word32,DFA_pre_label($Token))) lastpos),
	1228	+
	1229	+ action (Maybe(ByteArray -> LexerOutput($Token)),
	1230	+ Word32 pos,
	1231	+ Bool nullable,
	1232	+ List((Word32,DFA_pre_label($Token))) firstpos,
	1233	+ List((Word32,DFA_pre_label($Token))) lastpos).
	1234	+
	1235	+
	1236	+
	1237	+ The following function adds positions and decorations to a regular expression. Since we
	1238	+ have to generate position names, we give the first position to be used, and the
	1239	+ function returns the regular expression (with positions and decorations) and the next
	1240	+ position free for further use. The computation is simply recursive (there is no 'graph
	1241	+ walk' to do, only a 'tree walk').
	1242	+
	1243	+
	1244	+define (DecoratedBasicRegExpr($Token),Word32)
	1245	+ decorate
	1246	+ (
	1247	+ BasicRegExpr($Token) r,
	1248	+ Word32 n
	1249	+ ) =
	1250	+ if r is
	1251	+ {
	1252	+ char(c) then
	1253	+ (char(c,n,false,[(n,char(c))],[(n,char(c))]), n+1),
	1254	+
	1255	+ star(r1) then
	1256	+ if decorate(r1,n) is (rp1,m) then
	1257	+ (star(rp1,
	1258	+ true,
	1259	+ firstpos(rp1),
	1260	+ lastpos(rp1)),m),
	1261	+
	1262	+ or(r1,r2) then
	1263	+ if decorate(r1,n) is (rp1,m) then
	1264	+ if decorate(r2,m) is (rp2,l) then
	1265	+ (or(rp1,rp2,
	1266	+ if nullable(rp1) then true else nullable(rp2),
	1267	+ append(firstpos(rp1),firstpos(rp2)),
	1268	+ append(lastpos(rp1),lastpos(rp2))),l),
	1269	+
	1270	+ cat(r1,r2) then
	1271	+ if decorate(r1,n) is (rp1,m) then
	1272	+ if decorate(r2,m) is (rp2,l) then
	1273	+ (cat(rp1,rp2,
	1274	+ if nullable(rp1) then nullable(rp2) else false,
	1275	+ if nullable(rp1) then append(firstpos(rp1),firstpos(rp2)) else firstpos(rp1),
	1276	+ if nullable(rp2) then append(lastpos(rp1),lastpos(rp2)) else lastpos(rp2)),l),
	1277	+
	1278	+ epsilon then
	1279	+ (epsilon(true,[],[]),n),
	1280	+
	1281	+ beginning_of_line then
	1282	+ (bol(n,false,[(n,beginning_of_line)],[(n,beginning_of_line)]),n+1),
	1283	+
	1284	+ end_of_line then
	1285	+ (eol(n,false,[(n,end_of_line)],[(n,end_of_line)]),n+1),
	1286	+
	1287	+ action(a) then
	1288	+ (action(a,n,false,[(n,action(a))],[(n,action(a))]),n+1)
	1289	+ }.
	1290	+
	1291	+
	1292	+ Notice that the 'firstpos' and 'lastpos' fields in decorated regular expressions are
	1293	+ always increasingly ordered lists of distinct integers (when ignoring labels), as may
	1294	+ be easily verified by induction from the previous definition. Hint: when we write
	1295	+
	1296	+ if decorate(r1,n) is (rp1,m)
	1297	+
	1298	+ any position i in rp1 is such that n =< i < m.
	1299	+
	1300	+
	1301	+
	1302	+ *** [3.3] Computing the follow table.
	1303	+
	1304	+
	1305	+ A 'follow table' tells us which positions can follow a given position (when scanning a
	1306	+ string). It also gives the label attached to a position. Its type is:
	1307	+
	1308	+type FollowTable($Token):
	1309	+ empty,
	1310	+ follow_table(Word32, // position
	1311	+ DFA_pre_label($Token), // label
	1312	+ List(Word32), // following positions
	1313	+ FollowTable($Token) next).
	1314	+
	1315	+
	1316	+ Our lists of Word32s will have to remain increasingly sorted (for the purpose of
	1317	+ comparison).
	1318	+
	1319	+ The following function merges two lists sorted in increasing order, so that the result
	1320	+ is still increasingly sorted.
	1321	+
	1322	+define List(Word32)
	1323	+ merge_sorted
	1324	+ (
	1325	+ List(Word32) l1,
	1326	+ List(Word32) l2
	1327	+ ) =
	1328	+ if l1 is
	1329	+ {
	1330	+ [ ] then l2,
	1331	+ [h1 . t1] then
	1332	+ if l2 is
	1333	+ {
	1334	+ [ ] then l1,
	1335	+ [h2 . t2] then
	1336	+ if h1 = h2 // avoid duplications
	1337	+ then [h1 . merge_sorted(t1,t2)]
	1338	+ else if h1 -< h2
	1339	+ then [h1 . merge_sorted(t1,l2)]
	1340	+ else [h2 . merge_sorted(l1,t2)]
	1341	+ }
	1342	+ }.
	1343	+
	1344	+
	1345	+ 'heads' takes a list of pairs, and returns the list of all heads of these pairs. Remark
	1346	+ that if we apply 'heads' to either a 'firstpos' or a 'lastpos' datum, we get a list of
	1347	+ increasingly ordered distinct integers.
	1348	+
	1349	+define List($T)
	1350	+ heads
	1351	+ (
	1352	+ List(($T,$U)) l
	1353	+ ) =
	1354	+ if l is
	1355	+ {
	1356	+ [ ] then [ ],
	1357	+ [h . t] then if h is (u,v) then
	1358	+ [u . heads(t)]
	1359	+ }.
	1360	+
	1361	+
	1362	+
	1363	+ Adding entries to a follow table. Given:
	1364	+
	1365	+ - a list of keys (e1,...,ek) of type (Word32,DFA_pre_label($Token))
	1366	+ - a list of values (t1,...,tn) of type (Word32,DFA_pre_label($Token))
	1367	+ - a A-list of triplets of type (Word32,DFA_pre_label($Token),List(Word32)),
	1368	+
	1369	+ update that A-list, adding keys e1,...,en if they are not already in the A-list, and
	1370	+ putting each head of ti as a value for each ej. The third element of each triplet (a
	1371	+ list of integers) should always remain inceasingly sorted, and have distinct elements.
	1372	+
	1373	+ First, assume there is only one key (and its label) to add:
	1374	+
	1375	+
	1376	+define FollowTable($Token)
	1377	+ add_follow_entry
	1378	+ (
	1379	+ Word32 key,
	1380	+ DFA_pre_label($Token) c,
	1381	+ List((Word32,DFA_pre_label($Token))) values,
	1382	+ FollowTable($Token) previous
	1383	+ ) =
	1384	+ if previous is
	1385	+ {
	1386	+ empty then follow_table(key,c,heads(values),empty),
	1387	+ follow_table(k1,c1,v1,t) then
	1388	+ if key = k1
	1389	+ then follow_table(k1,c1,merge_sorted(heads(values),v1),t)
	1390	+ else follow_table(k1,c1,v1,add_follow_entry(key,c,values,t))
	1391	+ }.
	1392	+
	1393	+
	1394	+ Now, add several keys.
	1395	+
	1396	+define FollowTable($Token)
	1397	+ add_follow_entries
	1398	+ (
	1399	+ List((Word32,DFA_pre_label($Token))) keys,
	1400	+ List((Word32,DFA_pre_label($Token))) values,
	1401	+ FollowTable($Token) previous
	1402	+ ) =
	1403	+ if keys is
	1404	+ {
	1405	+ [ ] then previous,
	1406	+ [k1 . ks] then
	1407	+ if k1 is (k,c) then
	1408	+ add_follow_entries(ks,values,add_follow_entry(k,c,values,previous))
	1409	+ }.
	1410	+
	1411	+ Appending two follow tables (it is assumed that they have no key in common).
	1412	+
	1413	+define FollowTable($Token)
	1414	+ append
	1415	+ (
	1416	+ FollowTable($Token) t1,
	1417	+ FollowTable($Token) t2
	1418	+ ) =
	1419	+ if t1 is
	1420	+ {
	1421	+ empty then t2,
	1422	+ follow_table(p,l,n,tail1) then follow_table(p,l,n,append(tail1,t2))
	1423	+ }.
	1424	+
	1425	+
	1426	+ Making the follow_table from a decorated basic regular expression.
	1427	+
	1428	+define FollowTable($Token)
	1429	+ make_follow_table
	1430	+ (
	1431	+ DecoratedBasicRegExpr($Token) r
	1432	+ ) =
	1433	+ if r is
	1434	+ {
	1435	+ char(c,n,nb,fp,lp) then follow_table(n,char(c),[],empty),
	1436	+ bol(n,nb,fp,lp) then follow_table(n,beginning_of_line,[],empty),
	1437	+ eol(n,nb,fp,lp) then follow_table(n,end_of_line,[],empty),
	1438	+ epsilon(nb,fp,lp) then empty,
	1439	+ or(r1,r2,nb,fp,lp) then append(make_follow_table(r1),make_follow_table(r2)),
	1440	+ /* we can use append because r1 and r2 cannot share a
	1441	+ key. */
	1442	+
	1443	+ cat(r1,r2,nb,fp,lp) then
	1444	+ with t = append(make_follow_table(r1),make_follow_table(r2)),
	1445	+ /* same remark on append */
	1446	+ l1 = lastpos(r1),
	1447	+ f2 = firstpos(r2),
	1448	+ add_follow_entries(l1,f2,t),
	1449	+
	1450	+ star(r1,nb,fp,lp) then
	1451	+ with t = make_follow_table(r1),
	1452	+ f = firstpos(r1),
	1453	+ l = lastpos(r1),
	1454	+ add_follow_entries(l,f,t),
	1455	+
	1456	+ action(a,n,nb,fb,lp) then follow_table(n,action(a),[],empty)
	1457	+ }.
	1458	+
	1459	+
	1460	+
	1461	+
	1462	+
	1463	+ Finding an entry in a follow table.
	1464	+
	1465	+define (Word32,DFA_pre_label($Token),List(Word32))
	1466	+ follow_table_entry
	1467	+ (
	1468	+ Word32 p,
	1469	+ FollowTable($Token) l
	1470	+ ) =
	1471	+ if l is
	1472	+ {
	1473	+ empty then alert, // we should always find it
	1474	+ follow_table(n,c,pos,t) then
	1475	+ if p = n
	1476	+ then (n,c,pos)
	1477	+ else follow_table_entry(p,t)
	1478	+ }.
	1479	+
	1480	+
	1481	+
	1482	+
	1483	+
	1484	+
	1485	+
	1486	+
	1487	+
	1488	+
	1489	+ Names of states in the DFA are primarily increasingly sorted lists of Word32s. They are
	1490	+ transformed into Word32 when the DFA is renameed (see below). A transition is just a
	1491	+ pair made of a label and a state name.
	1492	+
	1493	+type DFA_pre_transition($Token):
	1494	+ transition(DFA_pre_label($Token) label,
	1495	+ List(Word32) target_name).
	1496	+
	1497	+
	1498	+ A state is made of a state name and a list of transitions.
	1499	+
	1500	+type DFA_pre_state($Token):
	1501	+ state(List(Word32) name,
	1502	+ Maybe(List(DFA_pre_transition($Token))) transitions).
	1503	+
	1504	+
	1505	+ The reason why the field 'transitions' has a 'Maybe' is that we may consider
	1506	+ 'incomplete' states, which did not yet receive their transitions.
	1507	+
	1508	+ Note: A DFA is not a tree in general, but a graph. This is the reason why states have
	1509	+ names. Since we cannot construct circular data in Anubis, the presence of names allows
	1510	+ nevertheless the construction of graphs (including circularities). However, we cannot
	1511	+ refer directly to a state, but only to its name.
	1512	+
	1513	+ We explain now how the automaton is constructed for a decorated basic regular
	1514	+ expression 'r'.
	1515	+
	1516	+ First of all, there is an initial state, whose name is firstpos(r). What it means is
	1517	+ that in this state, we expect to read a character corresponding to one of these
	1518	+ positions.
	1519	+
	1520	+ More generally, for any state 's', the name of the state is the list of all positions
	1521	+ which may match the next character to be read from the input.
	1522	+
	1523	+ Since, we don't care about unreachable states, we construct the automaton, starting
	1524	+ with the initial state, and adding all the states required by the transitions, until no
	1525	+ more state may be added. Of course, this process terminates, since the set of all
	1526	+ possible state names is obviously finite (its cardinal is at most 2^p, where p is the
	1527	+ number of positions in r).
	1528	+
	1529	+ For a given state, with name [p_1,...,pk], the transitions are given by the labels of
	1530	+ p_1,...,p_k. Nevertheless, several positions may have the same label. Hence, for a
	1531	+ given label, let q_1,...,q_j be those among p_1,...,p_k which have this label. The
	1532	+ target state for the corresponding transition is obtained by taking all the positions
	1533	+ which may follow one of q_1,...,q_j.
	1534	+
	1535	+ That's all !
	1536	+
	1537	+
	1538	+ Empty state names. What does it mean that the name of a state is empty ? This means
	1539	+ that reaching this state produces an error. Indeed, a state accepts a string if and
	1540	+ only if it contains a position labelled by an action, and has transitions to other
	1541	+ states if and only if it contains a position labelled by a character (or
	1542	+ 'end_of_file').
	1543	+
	1544	+ A state which contains an action is an accepting state. Nevertheless, it may also have
	1545	+ transitions. Hence, the lexer may eventually accept a longer sequence. But following
	1546	+ the transitions may also lead to an error. Hence the lexer must always keep the most
	1547	+ recently found solution, and use it (if it exists) if it enters a dead end (and in that
	1548	+ case, there is no error at all).
	1549	+
	1550	+ When using a solution, the lexer must also apply the action. This action must have been
	1551	+ saved by the lexer. Hence it is necessary to number actions, and to create a function
	1552	+ for each action.
	1553	+
	1554	+
	1555	+
	1556	+
	1557	+ Given a state name [p_1,...,p_k], and the follow table, the function
	1558	+ 'prepare_transitions' produces a list of pairs
	1559	+
	1560	+ (a , l)
	1561	+
	1562	+ where 'a' is a label, and 'l' the list of all positions with label 'a' which may follow
	1563	+ one of p1,...,pk. We need an auxiliary function 'insert'.
	1564	+
	1565	+
	1566	+
	1567	+
	1568	+
	1569	+define List(DFA_pre_transition($Token))
	1570	+ insert
	1571	+ (
	1572	+ DFA_pre_label($Token) c,
	1573	+ List(Word32) l,
	1574	+ List(DFA_pre_transition($Token)) q
	1575	+ ) =
	1576	+ if q is
	1577	+ {
	1578	+ [ ] then [transition(c,l)],
	1579	+ [h . t] then
	1580	+ if h is transition(c1,l1) then
	1581	+ if c = c1
	1582	+ then [transition(c,merge_sorted(l,l1)) . t]
	1583	+ else [h . insert(c,l,t)]
	1584	+ }.
	1585	+
	1586	+
	1587	+define List(DFA_pre_transition($Token))
	1588	+ prepare_transitions
	1589	+ (
	1590	+ List(Word32) name,
	1591	+ FollowTable($Token) ft
	1592	+ ) =
	1593	+ if name is
	1594	+ {
	1595	+ [ ] then [ ],
	1596	+ [p1 . p_others] then
	1597	+ if follow_table_entry(p1,ft) is (p,c,l) then
	1598	+ with q = prepare_transitions(p_others,ft),
	1599	+ insert(c,l,q)
	1600	+ }.
	1601	+
	1602	+
	1603	+
	1604	+
	1605	+ Now, we compute our DFA, i.e a list of DFA_pre_state($Token)s. We begin with only one state in the
	1606	+ list. The name of this state is firstpos(r), and it has not yet received its
	1607	+ transitions. In other words, it is:
	1608	+
	1609	+ state(firstpos(r),failure)
	1610	+
	1611	+ Then, we enter an 'infinite' loop. At each pass, we look for a state which did not yet
	1612	+ receive its transitions. If there is no such state, the DFA is ready (and we exit the
	1613	+ loop). Otherwise, we add its transitions to the state, and this may create new states
	1614	+ (without their transitions) in the DFA.
	1615	+
	1616	+ We need a function to separate (if possible) an incomplete state from a list of states:
	1617	+
	1618	+define Maybe((DFA_pre_state($Token),List(DFA_pre_state($Token))))
	1619	+ separate_incomplete_state
	1620	+ (
	1621	+ List(DFA_pre_state($Token)) l
	1622	+ ) =
	1623	+ if l is
	1624	+ {
	1625	+ [ ] then failure,
	1626	+ [s1 . so] then
	1627	+ if transitions(s1) is
	1628	+ {
	1629	+ failure then
	1630	+ success((s1,so)),
	1631	+ success(_) then
	1632	+ if separate_incomplete_state(so) is
	1633	+ {
	1634	+ failure then failure,
	1635	+ success(p) then if p is (i,m) then
	1636	+ success((i,[s1 . m]))
	1637	+ }
	1638	+ }
	1639	+ }.
	1640	+
	1641	+
	1642	+ We need a function to extract the list of target names from a list of transitions.
	1643	+
	1644	+define List(List(Word32))
	1645	+ get_targets
	1646	+ (
	1647	+ List(DFA_pre_transition($Token)) l
	1648	+ ) =
	1649	+ if l is
	1650	+ {
	1651	+ [ ] then [ ],
	1652	+ [h . t] then if h is transition(n,target) then
	1653	+ [target . get_targets(t)]
	1654	+ }.
	1655	+
	1656	+
	1657	+ We need a predicate to test if a list of states contains a state of
	1658	+ a given name.
	1659	+
	1660	+define Bool
	1661	+ is_state_name_in
	1662	+ (
	1663	+ List(DFA_pre_state($Token)) l,
	1664	+ List(Word32) n // sorted list of integers
	1665	+ ) =
	1666	+ if l is
	1667	+ {
	1668	+ [ ] then false,
	1669	+ [h . t] then
	1670	+ if h is state(m,tr) then
	1671	+ if n = m // comparing sorted lists of integers
	1672	+ then true
	1673	+ else is_state_name_in(t,n)
	1674	+ }.
	1675	+
	1676	+
	1677	+ We need a function to add new states to a list of states. The new states are given in
	1678	+ the form of a list of state names and are added without their transitions.
	1679	+
	1680	+define List(DFA_pre_state($Token))
	1681	+ add_new_states
	1682	+ (
	1683	+ List(List(Word32)) names,
	1684	+ List(DFA_pre_state($Token)) states
	1685	+ ) =
	1686	+ if names is
	1687	+ {
	1688	+ [ ] then states,
	1689	+ [h . t] then
	1690	+ if is_state_name_in(states,h)
	1691	+ then add_new_states(t,states)
	1692	+ else add_new_states(t,[state(h,failure) . states])
	1693	+ }.
	1694	+
	1695	+
	1696	+
	1697	+ We need a function to complete a state which did not yet receive its transitions.
	1698	+
	1699	+define List(DFA_pre_state($Token))
	1700	+ complete_state
	1701	+ (
	1702	+ DFA_pre_state($Token) i, // incomplete state
	1703	+ List(DFA_pre_state($Token)) o, // other states
	1704	+ FollowTable($Token) ft
	1705	+ ) =
	1706	+ with trans = prepare_transitions(name(i),ft),
	1707	+ targets = get_targets(trans),
	1708	+ add_new_states(targets,[state(name(i),success(trans)) . o]).
	1709	+
	1710	+
	1711	+ Now, here is our 'infinite' loop.
	1712	+
	1713	+define List(DFA_pre_state($Token))
	1714	+ make_DFA_pre
	1715	+ (
	1716	+ List(DFA_pre_state($Token)) l,
	1717	+ FollowTable($Token) ft
	1718	+ ) =
	1719	+ if separate_incomplete_state(l) is
	1720	+ {
	1721	+ failure then l, // the DFA is ready
	1722	+
	1723	+ success(p) then if p is (s,o) then
	1724	+ with new = complete_state(s,o,ft),
	1725	+ make_DFA_pre(new,ft)
	1726	+ }.
	1727	+
	1728	+
	1729	+
	1730	+
	1731	+
	1732	+ *** [3.5] Renaming the states of the DFA.
	1733	+
	1734	+ Names of states in our DFA are lists of integers. We need to replace them by integers.
	1735	+
	1736	+ From a DFA whose state names are lists of integers, we create a list of pairs (old,new)
	1737	+ where new is a new name (an integer) and old an old name (a list of integers).
	1738	+
	1739	+define List((List(Word32),Word32)) // an association list
	1740	+ name_list
	1741	+ (
	1742	+ List(DFA_pre_state($Token)) l,
	1743	+ Word32 first_new_name
	1744	+ ) =
	1745	+ if l is
	1746	+ {
	1747	+ [ ] then [ ],
	1748	+ [h . t] then
	1749	+ if h is state(old_name,tr) then
	1750	+ [(old_name,first_new_name) . name_list(t,first_new_name+1)]
	1751	+ }.
	1752	+
	1753	+
	1754	+ Given an old name and our association list, we can get the new name.
	1755	+
	1756	+define Word32
	1757	+ get_new_name
	1758	+ (
	1759	+ List(Word32) old_name,
	1760	+ List((List(Word32),Word32)) nlist
	1761	+ ) =
	1762	+ if nlist is
	1763	+ {
	1764	+ [ ] then alert, // the new name should always exist
	1765	+ [h . t] then if h is (o,n) then
	1766	+ if old_name = o
	1767	+ then n
	1768	+ else get_new_name(old_name,t)
	1769	+ }.
	1770	+
	1771	+
	1772	+ Now, we rename all transitions in a given state. At the same time we separate actual
	1773	+ transitions from actions. This is why the following function returns a pair made of a
	1774	+ list of transitions, and maybe an action. Since the action is of type:
	1775	+
	1776	+ Maybe(ByteArray -> LexerOutput($Token))
	1777	+
	1778	+ the non mandatory action is of type:
	1779	+
	1780	+ Maybe(Maybe(ByteArray -> LexerOutput($Token)))
	1781	+
	1782	+
	1783	+define (List(DFA_transition),Maybe(Maybe(ByteArray -> LexerOutput($Token))))
	1784	+ rename
	1785	+ (
	1786	+ List(DFA_pre_transition($Token)) l,
	1787	+ List((List(Word32),Word32)) nlist
	1788	+ ) =
	1789	+ if l is
	1790	+ {
	1791	+ [ ] then ([ ],failure),
	1792	+ [h . t] then
	1793	+ if rename(t,nlist) is (trs,mbmba) then
	1794	+ if h is transition(pre_label,target) then
	1795	+ if pre_label is
	1796	+ {
	1797	+ char(c) then
	1798	+ ([transition(char(c),get_new_name(target,nlist)) . trs],mbmba),
	1799	+ beginning_of_line then
	1800	+ ([transition(beginning_of_line,get_new_name(target,nlist)) . trs],mbmba),
	1801	+ end_of_line then
	1802	+ ([transition(end_of_line,get_new_name(target,nlist)) . trs],mbmba),
	1803	+ action(mba) then if mbmba is
	1804	+ {
	1805	+ failure then (trs,success(mba)),
	1806	+ success(x) then // two actions in the same state: choose the first one.
	1807	+ (trs,success(mba))
	1808	+ }
	1809	+ }
	1810	+ }.
	1811	+
	1812	+
	1813	+ Now, we rename all the states.
	1814	+
	1815	+define List(DFA_state($Token))
	1816	+ rename
	1817	+ (
	1818	+ List(DFA_pre_state($Token)) l,
	1819	+ List((List(Word32),Word32)) nlist
	1820	+ ) =
	1821	+ if l is
	1822	+ {
	1823	+ [ ] then [ ],
	1824	+ [h . t] then
	1825	+ if h is state(old_name,mbtrans) then
	1826	+ if mbtrans is
	1827	+ {
	1828	+ failure then alert, // pre-states must have been completed
	1829	+ success(trans) then
	1830	+ if rename(trans,nlist) is (trs,mbmba) then
	1831	+ if mbmba is
	1832	+ {
	1833	+ failure then
	1834	+ [rejecting(get_new_name(old_name,nlist),trs) . rename(t,nlist)]
	1835	+ success(mba) then
	1836	+ [accepting(get_new_name(old_name,nlist),trs,mba) . rename(t,nlist)]
	1837	+ }
	1838	+ }
	1839	+ }.
	1840	+
	1841	+
	1842	+
	1843	+ *** [3.5] Making the DFA.
	1844	+
	1845	+
	1846	+
	1847	+
	1848	+define Result(RegExprError,BasicRegExpr($Token))
	1849	+ prepare_global_regexpr
	1850	+ (
	1851	+ List(LexerItem($Token)) lexer_description
	1852	+ ) =
	1853	+ if lexer_description is
	1854	+ {
	1855	+ [ ] then error(empty_lexer_description),
	1856	+ [h . t] then if h is lexer_item(re,a) then
	1857	+ if parse_regular_expression(make_stream(re)) is
	1858	+ {
	1859	+ error(msg) then error(msg),
	1860	+ ok(re1) then if t is
	1861	+ {
	1862	+ [ ] then
	1863	+ ok(cat(to_basic(re1),action(a))),
	1864	+ [_ . _] then if prepare_global_regexpr(t) is
	1865	+ {
	1866	+ error(msg) then error(msg),
	1867	+ ok(p) then
	1868	+ ok(or(cat(to_basic(re1),action(a)),p))
	1869	+ }
	1870	+ }
	1871	+ }
	1872	+ }.
	1873	+
	1874	+
	1875	+
	1876	+public define Result(RegExprError,List(DFA_state($Token)))
	1877	+ make_DFA
	1878	+ (
	1879	+ List(LexerItem($Token)) lexer_description
	1880	+ ) =
	1881	+ if prepare_global_regexpr(lexer_description) is
	1882	+ {
	1883	+ error(msg) then error(msg),
	1884	+ ok(re) then if decorate(re,0) is (br,_) then
	1885	+ with dfa = reverse(make_DFA_pre([state(heads(firstpos(br)),failure)],
	1886	+ make_follow_table(br))),
	1887	+ ok(rename(dfa,name_list(dfa,0)))
	1888	+ }.
	1889	+
	1890	+
	1891	+
	1892	+
	1893	+
	1894	+ *** [3.6] Translating a DFA into a fast lexer description.
	1895	+
	1896	+ The types 'FastLexerTransition' and 'FastLexerState' is defined in 'predefined.anubis'
	1897	+ section 13.
	1898	+
	1899	+
	1900	+define List(FastLexerTransition)
	1901	+ to_fast_lexer_transitions
	1902	+ (
	1903	+ List(DFA_transition) l
	1904	+ ) =
	1905	+ if l is
	1906	+ {
	1907	+ [ ] then [ ],
	1908	+ [h . t] then if h is transition(label,target) then
	1909	+ [if label is
	1910	+ {
	1911	+ char(c) then transition(c,target),
	1912	+ beginning_of_line then beginning_of_line(target),
	1913	+ end_of_line then end_of_line(target)
	1914	+ } . to_fast_lexer_transitions(t)]
	1915	+ }.
	1916	+
	1917	+
	1918	+public define List(FastLexerState)
	1919	+ to_fast_lexer_description
	1920	+ (
	1921	+ List(DFA_state($Token)) l
	1922	+ ) =
	1923	+ if l is
	1924	+ {
	1925	+ [ ] then [ ],
	1926	+ [h . t] then [if h is
	1927	+ {
	1928	+ rejecting(n,trs) then rejecting(to_fast_lexer_transitions(trs))
	1929	+ accepting(n,trs,a) then accepting(to_fast_lexer_transitions(trs))
	1930	+ } . to_fast_lexer_description(t)]
	1931	+ }.
	1932	+
	1933	+
	1934	+
	1935	+
	1936	+
	1937	+
	1938	+ *** [4] Constructing the lexer.
	1939	+
	1940	+ The low level fast lexer (see 'predefined.anubis', section 13) does not care about
	1941	+ actions. Hence, we must manage actions in parallel. To this end we use the following
	1942	+ type:
	1943	+
	1944	+ MVar(Maybe(ByteArray -> LexerOutput($Token)))
	1945	+
	1946	+ The action for state 'n' (assumed to be an accepting state because the multiple
	1947	+ variable is never used for rejecting states) is the value stored in slot 'n'. The
	1948	+ default value is 'failure' meaning 'ignore this token and read the next
	1949	+ one'. Otherwise, the function is applied to the lexeme just read, and the lexer returns
	1950	+ the result of this function.
	1951	+
	1952	+ The multiple variable is filled up by:
	1953	+
	1954	+define One
	1955	+ fill_actions
	1956	+ (
	1957	+ List(DFA_state($Token)) dfa,
	1958	+ MVar(Maybe(ByteArray -> LexerOutput($Token))) v
	1959	+ ) =
	1960	+ if dfa is
	1961	+ {
	1962	+ [ ] then unique,
	1963	+ [h . t] then
	1964	+ if h is
	1965	+ {
	1966	+ rejecting(name,trs) then unique,
	1967	+ accepting(name,trs,action) then
	1968	+ v(name) <- action
	1969	+ };
	1970	+ fill_actions(t,v)
	1971	+ }.
	1972	+
	1973	+
	1974	+ Making the multiple variable for actions is performed by:
	1975	+
	1976	+define MVar(Maybe(ByteArray -> LexerOutput($Token)))
	1977	+ get_actions
	1978	+ (
	1979	+ List(DFA_state($Token)) dfa
	1980	+ ) =
	1981	+ with ns = length(dfa), // total number of states
	1982	+ v = mvar(truncate_to_Word32(ns),
	1983	+ (Maybe(ByteArray -> LexerOutput($Token)))failure),
	1984	+ fill_actions(dfa,v); v.
	1985	+
	1986	+
	1987	+
	1988	+ Now we plug the lexer to a lexing stream
	1989	+
	1990	+
	1991	+define One -> LexerOutput($Token)
	1992	+ plug_lexer
	1993	+ (
	1994	+ LexingStream stream,
	1995	+ (ByteArray input,
	1996	+ FastLexerLastAccepted last_accepted,
	1997	+ FastLexerBeginningOfLine bol,
	1998	+ FastLexerEndOfLine eol,
	1999	+ Int position,
	2000	+ Word32 starting_state) -> FastLexerOutput lexer,
	2001	+ MVar(Maybe(ByteArray -> LexerOutput($Token))) actions
	2002	+ ) =
	2003	+ with bol_v = var((FastLexerBeginningOfLine)at_beginning_of_line),
	2004	+ eol_v = var((FastLexerEndOfLine)not_at_end_of_line),
	2005	+ if stream is lexing_stream(buffer_v,start_v,last_accept_v,current_v,reload_buffer) then
	2006	+ (One _) \|-l-> if lexer(*buffer_v,
	2007	+ *last_accept_v,
	2008	+ *bol_v,
	2009	+ *eol_v,
	2010	+ *current_v,
	2011	+ 0) // reading a new token always starts in state 0
	2012	+ is
	2013	+ {
	2014	+ rejected(state,end,a) then
	2015	+ if a is
	2016	+ {
	2017	+ not_at_end_of_input then
	2018	+ with result = (LexerOutput($Token))error(extract(buffer_v,start_v,end)),
	2019	+ current_v <- end+1;
	2020	+ start_v <- end+1;
	2021	+ last_accept_v <- none;
	2022	+ result,
	2023	+
	2024	+ at_end_of_input then
	2025	+ if reload_buffer(*start_v) is
	2026	+ {
	2027	+ failure then //print("At end (1).\n");
	2028	+ end_of_input, // really at end of input
	2029	+ success(_) then
	2030	+ l(unique) // continue reading this token
	2031	+ }
	2032	+ }
	2033	+
	2034	+ accepted(state,end,a) then
	2035	+ if a is
	2036	+ {
	2037	+ not_at_end_of_input then
	2038	+ if *actions(state) is
	2039	+ {
	2040	+ failure then
	2041	+ current_v <- end;
	2042	+ start_v <- end;
	2043	+ last_accept_v <- none;
	2044	+ l(unique), // ignore and try to read the next token
	2045	+
	2046	+ success(f) then
	2047	+ with result = f(extract(buffer_v,start_v,end)),
	2048	+ current_v <- end;
	2049	+ start_v <- end;
	2050	+ last_accept_v <- none;
	2051	+ result
	2052	+ },
	2053	+
	2054	+ at_end_of_input then
	2055	+ if reload_buffer(*start_v) is
	2056	+ {
	2057	+ failure then
	2058	+ if *actions(state) is
	2059	+ {
	2060	+ failure then //print("At end (2).\n");
	2061	+ end_of_input, // ignore and don't try to continue
	2062	+ success(f) then
	2063	+ with result = f(extract(buffer_v,start_v,end)),
	2064	+ current_v <- end;
	2065	+ start_v <- end;
	2066	+ last_accept_v <- none;
	2067	+ result
	2068	+ },
	2069	+
	2070	+ success(_) then l(unique) // continue reading this token
	2071	+ }
	2072	+ }
	2073	+ }.
	2074	+
	2075	+
	2076	+
	2077	+ Finally, the tool for making a lexer.
	2078	+
	2079	+public define Result(RegExprError, LexingStream -> One -> LexerOutput($Token))
	2080	+ make_lexer
	2081	+ (
	2082	+ List(LexerItem($Token)) lexer_description
	2083	+ ) =
	2084	+ if make_DFA(lexer_description) is
	2085	+ {
	2086	+ error(msg) then error(msg),
	2087	+ ok(List(DFA_state($Token)) dfa) then
	2088	+ if make_fast_lexer(to_fast_lexer_description(dfa)) is
	2089	+ {
	2090	+ unknown_state(n) then alert, // cannot happen
	2091	+ ok(fl) then ok((LexingStream ls) \|-> plug_lexer(ls,fl,get_actions(dfa)))
	2092	+ }
	2093	+ }.
	2094	+
	2095	+
	2096	+
0	2097	\ No newline at end of file
...	...

anubis_distrib/library/lexical_analysis/fast_lexer_example_1.anubis 0 → 100644

Wrap text Show/Hide comments View file @60ef836

	1	+
	2	+
	3	+ The Anubis Project
	4	+
	5	+ Tools for lexical analysis.
	6	+ A simple example.
	7	+
	8	+ Copyright (c) Constructive Mathematics 2007-2008.
	9	+
	10	+
	11	+ Author: Alain Prouté
	12	+
	13	+
	14	+ In this file we present a simple example of use of 'fast_lexer.anubis'. The program
	15	+ generated is a very simplified version of the Unix tool 'grep':
	16	+
	17	+global define One
	18	+ fast_lexer_example_1
	19	+ (
	20	+ List(String) args
	21	+ ).
	22	+
	23	+ This program receives a regular expression and a filename as its arguments. Its purpose
	24	+ is to print to the standard output all the sequences in the file matching the regular
	25	+ expression, with line numbers.
	26	+
	27	+define String
	28	+ usage =
	29	+ "Usage: anbexec fast_lexer_example_1 <regular expression> <file name>\n".
	30	+
	31	+
	32	+
	33	+ --- That's all for the public part ! --------------------------------------------------
	34	+ Nevertheless, since this is an example, you may have to read the sequel, which is fully
	35	+ commented.
	36	+
	37	+
	38	+
	39	+
	40	+ -------------------------------- Table of Contents ------------------------------------
	41	+
	42	+ *** [1] Tokens.
	43	+ *** [2] Preparing the lexer description.
	44	+ *** [3] Preparing the lexing stream.
	45	+ *** [4] The main loop.
	46	+ *** [5] Carrying on.
	47	+
	48	+ ---------------------------------------------------------------------------------------
	49	+
	50	+
	51	+
	52	+ First of all, we must access the tool:
	53	+
	54	+read lexical_analysis/fast_lexer.anubis
	55	+read lexical_analysis/dfa_compiler.anubis
	56	+read lexical_analysis/regexpr_parser.anubis
	57	+read lexical_analysis/lexing_stream.anubis
	58	+
	59	+
	60	+
	61	+ *** [1] Tokens.
	62	+
	63	+ The first thing to do is to define the type for representing tokens since 'fast lexer'
	64	+ has a parameter '$Token'. In the case of this example, this type is very simple:
	65	+
	66	+type Token:
	67	+ matching(String),
	68	+ newline.
	69	+
	70	+ since each recognized sequence is just considered as a string. However, we also have to
	71	+ recognize newline characters in order to be able to count lines.
	72	+
	73	+
	74	+
	75	+
	76	+ *** [2] Preparing the lexer description.
	77	+
	78	+ Before you can construct your lexer, you must prepare a 'lexer description'. It's of type:
	79	+
	80	+ List(LexerItem(Token))
	81	+
	82	+ We have one lexer item for the given regular expression, another one for
	83	+ newlines. However, we need a third one for ignoring everything else.
	84	+
	85	+
	86	+define List(LexerItem(Token))
	87	+ prepare_lexer_description
	88	+ (
	89	+ String regular_expression
	90	+ ) =
	91	+ [
	92	+ /* recognize sequences matching the given regular expression */
	93	+ lexer_item(regular_expression,
	94	+ success((ByteArray b) \|-> token(matching(to_string(b))))),
	95	+
	96	+ /* recognize newline characters */
	97	+ lexer_item("\n",
	98	+ success((ByteArray b) \|-> token(newline))),
	99	+
	100	+ /* ignore everything else */
	101	+ lexer_item(".", /* "." represents any character except '\n' */
	102	+ failure)
	103	+ ].
	104	+
	105	+
	106	+ The lexer will be constructed below by applying the function 'make_lexer' (declared in
	107	+ 'fast_lexer.anubis') to this lexer description.
	108	+
	109	+
	110	+
	111	+
	112	+
	113	+ *** [3] Preparing the lexing stream.
	114	+
	115	+ Lexical analysis is performed from an input stream (of type 'LexingStream'). In the
	116	+ case of this example, the input stream is constructed from the given filename. Of
	117	+ course, this may fail since the file may eventually not be opened or read.
	118	+
	119	+define Maybe(LexingStream)
	120	+ prepare_input
	121	+ (
	122	+ String filename
	123	+ ) =
	124	+ /* try to open the file ('predefined.anubis' section 5.1) */
	125	+ if file(filename,read) is
	126	+ {
	127	+ failure then failure,
	128	+ success(f) then make_lexing_stream(f, /* the opened file */
	129	+ 1000, /* size of buffer for the lexing stream */
	130	+ 100) /* timeout (seconds) */
	131	+ }.
	132	+
	133	+
	134	+
	135	+
	136	+
	137	+ *** [4] The main loop.
	138	+
	139	+ Assuming our lexer is ready as a function of type 'One -> LexerOutput(Token)' (i.e. the
	140	+ lexing stream is already plugged into it), we construct the main loop of this program.
	141	+ It consists in calling the lexer repeatedly until it returns 'end_of_input'.
	142	+
	143	+ When it returns an error (actually a lexical error), we print this error. However, this
	144	+ should never happen, because our lexer has a lexer item for ignoring anything not
	145	+ matching one of the first two lexer items.
	146	+
	147	+ In this loop we also count lines. There is no need for a Var(Int) for that
	148	+ purpose. It's much better to use a 'deterministic local variable' in the form of an
	149	+ extra argument to our function. The function will be called with the value 1 for this
	150	+ argument, which simulates the initialisation of the variable.
	151	+
	152	+define One
	153	+ main_loop
	154	+ (
	155	+ One -> LexerOutput(Token) lexer,
	156	+ Int lineno /* no need for a Var(Int) */
	157	+ ) =
	158	+ /* get the next token or whatever */
	159	+ if lexer(unique) is
	160	+ {
	161	+ end_of_input then /* no more token: exit the main loop */
	162	+ unique,
	163	+
	164	+ error(b) then
	165	+ /* should never happen with this lexer (see the above comment) */
	166	+ print("Error: ["+to_string(b)+"]\n");
	167	+ /* nevertheless we continue the lexical analysis */
	168	+ main_loop(lexer,lineno),
	169	+
	170	+ token(t) then
	171	+ /* a token has been recognized */
	172	+ if t is
	173	+ {
	174	+ matching(s) then /* print the current line number and the recognized sequence */
	175	+ print(abs_to_decimal(lineno)+": "+s+"\n");
	176	+ /* continue with the same lineno */
	177	+ main_loop(lexer,lineno),
	178	+
	179	+ newline then /* continue with an incremented lineno */
	180	+ main_loop(lexer,lineno+1)
	181	+ }
	182	+ }.
	183	+
	184	+
	185	+
	186	+
	187	+
	188	+ *** [5] Carrying on.
	189	+
	190	+
	191	+read tools/basis.anubis (needed for UTime soustraction)
	192	+
	193	+
	194	+ Now we can define our tool. We have to:
	195	+
	196	+ - check that the user gave the two required arguments on the command line,
	197	+ - prepare the lexer description,
	198	+ - prepare the input stream,
	199	+ - run the main loop.
	200	+
	201	+global define One
	202	+ fast_lexer_example_1
	203	+ (
	204	+ List(String) args
	205	+ ) =
	206	+ /* check for first argument */
	207	+ if args is
	208	+ {
	209	+ [ ] then print(usage),
	210	+ [re . t] then
	211	+ /* check for second argument */
	212	+ if t is
	213	+ {
	214	+ [ ] then print(usage),
	215	+ [filename . _] then
	216	+ /* prepare the lexer description and make the lexer */
	217	+ if make_lexer(prepare_lexer_description(re)) is
	218	+ {
	219	+ error(msg) then print("Syntax error in regular expression: "+to_English(msg)+"\n"),
	220	+ ok(lexer) then
	221	+ /* prepare the input stream */
	222	+ if prepare_input(filename) is
	223	+ {
	224	+ failure then print("cannot open or read file '"+filename+"'.\n"),
	225	+ success(ls) then
	226	+ with start_time = unow,
	227	+ /* run the main loop */
	228	+ main_loop(lexer(ls),1);
	229	+ if unow - start_time is utime(secs,microsecs) then
	230	+ print("Duration: "+abs_to_decimal(secs)+" seconds, "+abs_to_decimal(microsecs)+" microseconds.\n")
	231	+ }
	232	+ }
	233	+ }
	234	+ }.
	235	+
	236	+
	237	+
	238	+
	239	+
	240	+
0	241	\ No newline at end of file
...	...

anubis_distrib/library/lexical_analysis/lexer_maker_v2_example.lexer 0 → 100644

Wrap text Show/Hide comments View file @60ef836

	1	+
	2	+
	3	+ This is an example of use of 'lexer_maker'.
	4	+
	5	+read tools/basis.anubis
	6	+
	7	+
	8	+ We want to test email addresses. Below is a regular expression for that
	9	+ purpose. Actually, this expression is too naïve. A real one would be more complicated.
	10	+
	11	+#ETL
	12	+
	13	+#email_tester String
	14	+[a-zA-Z0-9\-_]+(\.[a-zA-Z0-9\-_]+)*@[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+ { (ls,token(text)) }
	15	+#
	16	+
	17	+
	18	+ Since '@' is a normal character, a string needs to contain exactly one '@' for being
	19	+ accepted. What is accepted before and after this '@' is described by:
	20	+
	21	+ [a-zA-Z]+(\.[a-zA-Z]+)*
	22	+
	23	+ The first part: [a-zA-Z]+ means ``at least one letter''. The last part: (\.[a-zA-Z]+)*
	24	+ means: ``a dot followed by at least one letter, and this may be repeated any number of
	25	+ times (including zero)''.
	26	+
	27	+
	28	+ This part of the source file is the 'postambule' (just Anubis text, which is copied 'as
	29	+ is' to the lexer_maker output file).
	30	+
	31	+ The above stuff produces a function named 'email_tester' into the lexer_maker output
	32	+ file. This function is used below:
	33	+
	34	+global define One
	35	+ test_email_address
	36	+ (
	37	+ List(String) args
	38	+ ) =
	39	+ if args is
	40	+ {
	41	+ [ ] then print("Usage: test_email_address <address> ... <address>\n"),
	42	+ [_ . _] then
	43	+ map_forget((String s) \|->
	44	+ with ls = lexer_state(make_stream(s),[],[],email_tester,true,false,failure),
	45	+ if email_tester(ls) is (_,result) then if result is
	46	+ {
	47	+ end_of_file then print("End of input.\n"),
	48	+ token(t) then with result1 = implode(t),
	49	+ if length(result1) = length(s)
	50	+ then print(s+" (accepted)\n")
	51	+ else print(s+" (truncated as: "+result1+")\n"),
	52	+ error then print(s+" (rejected)\n")
	53	+ },
	54	+ args)
	55	+ }.
	56	+
	57	+
	58	+
...	...

anubis_distrib/library/lexical_analysis/testing_fast_lexer.anubis 0 → 100644

Wrap text Show/Hide comments View file @60ef836

	1	+
	2	+
	3	+
	4	+
	5	+
	6	+
	7	+
	8	+
	9	+ This is just for testing 'fast_lexer.anubis'.
	10	+
	11	+read tools/basis.anubis
	12	+read tools/streams.anubis
	13	+
	14	+read regexpr_parser.anubis
	15	+read dfa_compiler.anubis
	16	+
	17	+
	18	+define String
	19	+ format
	20	+ (
	21	+ DFA_pre_label(String) l
	22	+ ) =
	23	+ if l is
	24	+ {
	25	+ char(c) then implode[c],
	26	+ beginning_of_line then "^",
	27	+ end_of_line then "$",
	28	+ action(mbf) then if mbf is
	29	+ {
	30	+ failure then "<ignore>",
	31	+ success(f) then if f(constant_byte_array(0,0)) is
	32	+ token(s) then s else alert
	33	+ }
	34	+ }.
	35	+
	36	+define String
	37	+ format
	38	+ (
	39	+ DFA_label l
	40	+ ) =
	41	+ if l is
	42	+ {
	43	+ char(c) then implode[c],
	44	+ beginning_of_line then "^",
	45	+ end_of_line then "$"
	46	+ }.
	47	+
	48	+define Printable_tree
	49	+ format
	50	+ (
	51	+ DFA_transition t
	52	+ ) =
	53	+ if t is transition(label,target_name) then
	54	+ ["'", format(label), "'>", target_name, " "].
	55	+
	56	+
	57	+define Printable_tree
	58	+ format
	59	+ (
	60	+ List(DFA_transition) l
	61	+ ) =
	62	+ if l is
	63	+ {
	64	+ [ ] then [ ],
	65	+ [h . t] then [format(h) . format(t)]
	66	+ }.
	67	+
	68	+define Printable_tree
	69	+ format
	70	+ (
	71	+ DFA_state(String) s
	72	+ ) =
	73	+ if s is
	74	+ {
	75	+ rejecting(n,trs) then ["\n", to_decimal(n), " (rejecting) ", format(trs)],
	76	+ accepting(n,trs,mba) then ["\n", to_decimal(n), " (accepting) ", format(trs),
	77	+ if mba is
	78	+ {
	79	+ failure then "<ignore>",
	80	+ success(a) then "<action "+
	81	+ if a(constant_byte_array(0,0)) is
	82	+ {
	83	+ end_of_input then alert,
	84	+ error(_) then alert,
	85	+ token(s1) then s1
	86	+ }+">"
	87	+ }]
	88	+ }.
	89	+
	90	+
	91	+define Printable_tree
	92	+ format
	93	+ (
	94	+ List(DFA_state(String)) l
	95	+ ) =
	96	+ if l is
	97	+ {
	98	+ [ ] then ["\n------------------------\n"],
	99	+ [h . t] then
	100	+ [format(h) . format(t)]
	101	+ }.
	102	+
	103	+
	104	+define One
	105	+ syntax
	106	+ =
	107	+ print("Usage: fast_lexer_test <regular expression> ... <regular expression>\n\n").
	108	+
	109	+
	110	+define String
	111	+ format
	112	+ (
	113	+ RegExpr e
	114	+ ) =
	115	+ if e is
	116	+ {
	117	+ char(Word8 c) then implode([c]),
	118	+ choice(l) then "["+implode(l)+"]",
	119	+ plus(RegExpr e1) then "("+format(e1)+"+"+")",
	120	+ star(RegExpr e1) then "("+format(e1)+"*"+")",
	121	+ cat(RegExpr e1,RegExpr e2) then format(e1)+format(e2),
	122	+ or(RegExpr e1,RegExpr e2) then "("+format(e1)+"\|"+format(e2)+")",
	123	+ beginning_of_line then "^",
	124	+ end_of_line then "$",
	125	+ dot then ".",
	126	+ question_mark(e1) then "("+format(e1)+")?"
	127	+ }.
	128	+
	129	+
	130	+define String
	131	+ format
	132	+ (
	133	+ BasicRegExpr($Token) e
	134	+ ) =
	135	+ if e is
	136	+ {
	137	+ char(c) then implode([c]),
	138	+ star(e1) then "("+format(e1)+"*"+")",
	139	+ or(e1,e2) then "("+format(e1)+"\|"+format(e2)+")",
	140	+ cat(e1,e2) then format(e1)+format(e2),
	141	+ epsilon then "()",
	142	+ beginning_of_line then "^",
	143	+ end_of_line then "$",
	144	+ action(a) then "<action>"
	145	+ }.
	146	+
	147	+
	148	+define List(LexerItem(String))
	149	+ prepare_lexer_items
	150	+ (
	151	+ List(String) regexprs,
	152	+ Int i
	153	+ ) =
	154	+ if regexprs is
	155	+ {
	156	+ [ ] then [ ],
	157	+ [h . t] then
	158	+ [lexer_item(h,success((ByteArray b) \|-> token(to_decimal(i))))
	159	+ . prepare_lexer_items(t,i+1)]
	160	+ }.
	161	+
	162	+
	163	+define Printable_tree
	164	+ format
	165	+ (
	166	+ List(FastLexerTransition) l
	167	+ ) =
	168	+ if l is
	169	+ {
	170	+ [ ] then [ ],
	171	+ [h . t] then if h is
	172	+ {
	173	+ transition(c,s) then
	174	+ [implode[c], ":", s, " " . format(t)],
	175	+ beginning_of_line(s) then
	176	+ ["^:",s, " " . format(t)],
	177	+ end_of_line(s) then
	178	+ ["$:",s, " " . format(t)]
	179	+ }
	180	+ }.
	181	+
	182	+define Printable_tree
	183	+ format
	184	+ (
	185	+ List(FastLexerState) l,
	186	+ Int i
	187	+ ) =
	188	+ if l is
	189	+ {
	190	+ [ ] then ["\n------------------------\n"],
	191	+ [h . t] then if h is
	192	+ {
	193	+ rejecting(trs) then ["\n", i, " rejecting: ", format(trs) . format(t,i+1)],
	194	+ accepting(trs) then ["\n", i, " accepting: ", format(trs) . format(t,i+1)]
	195	+ }
	196	+ }.
	197	+
	198	+
	199	+
	200	+
	201	+define One
	202	+ run_fast_lexer
	203	+ (
	204	+ (ByteArray input,
	205	+ FastLexerLastAccepted last_accepted,
	206	+ FastLexerBeginningOfLine bol,
	207	+ FastLexerEndOfLine eol,
	208	+ Int position,
	209	+ Word32 starting_state) -> FastLexerOutput fast
	210	+ ) =
	211	+ with text = prompt("Try it out (q to quit): ") + "\n",
	212	+ if text = "q\n" then unique else
	213	+ with ba = to_byte_array(text),
	214	+ if fast(ba,
	215	+ none,
	216	+ at_beginning_of_line,
	217	+ not_at_end_of_line,
	218	+ 0,
	219	+ 0) is
	220	+ {
	221	+ rejected(n,e,a) then print("\""+to_string(extract(ba,0,e))+
	222	+ "\" rejected in state "+to_decimal(n)+"\n"),
	223	+ accepted(n,e,a) then print("\""+to_string(extract(ba,0,e))+
	224	+ "\" accepted in state "+to_decimal(n)+"\n")
	225	+ };
	226	+ run_fast_lexer(fast).
	227	+
	228	+
	229	+define One
	230	+ run_fast_lexer
	231	+ (
	232	+ List(FastLexerState) l
	233	+ ) =
	234	+ if make_fast_lexer(l) is
	235	+ {
	236	+ unknown_state(n) then print("\nUnknown state: "+to_decimal(n)),
	237	+ ok(fast) then run_fast_lexer(fast)
	238	+ }.
	239	+
	240	+
	241	+global define One
	242	+ fast_lexer_test
	243	+ (
	244	+ List(String) args
	245	+ ) =
	246	+ if args is [] then syntax else
	247	+ map_forget((String e) \|-> if parse_regular_expression(make_stream(e)) is
	248	+ {
	249	+ error(msg) then print("*** Error: "+to_English(msg)+"\n\n"),
	250	+ ok(re) then print("Regular expression "+e+" is correct.\n");
	251	+ print("Read as: "+format(re)+"\n");
	252	+ print("Basic equivalent: "+
	253	+ format((BasicRegExpr(String))to_basic(re))+"\n\n")
	254	+ },
	255	+ args);
	256	+ if make_DFA(prepare_lexer_items(args,0)) is
	257	+ {
	258	+ error(msg) then print("*** Error: "+to_English(msg)+"\n\n"),
	259	+ ok(auto) then with fl = to_fast_lexer_description(auto),
	260	+ print("Automaton:\n------------------------ ");
	261	+ print(format(auto));
	262	+ print("Fast Lexer:\n------------------------ ");
	263	+ print(format(fl,0));
	264	+ run_fast_lexer(fl)
	265	+ }.
	266	+
	267	+
	268	+
	269	+
	270	+
	271	+
	272	+
0	273	\ No newline at end of file
...	...