overview
14.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
/**************************************************************************************
_____ ___. .__ ________
/ _ \ ____ __ _\_ |__ |__| ______ \_____ \
/ /_\ \ / \| | \ __ \| |/ ___/ / ____/
/ | \ | \ | / \_\ \ |\___ \ / \
\____|__ /___| /____/|___ /__/____ > \_______ \
\/ The \/ Anubis \/ 2 \/ Project \/
Name of this file: overview (text file)
Purpose of this file: An overview of the whole compiler.
Authors (Name [initials]): Alain Proute' [AP]
Updates ([initials] (date) comment):
[AP] (2007 jun 07) Creation of this file.
**************************************************************************************/
*** How source texts flow through the compiler.
The role of the compiler is to read one or several ``source files'', and to produce one
or several ``output files''. The content of the output files is some kind of
translation of the content of the source files. This translation goes through
intermediate steps (i.e. intermediate languages):
source text
|
| phase 1 (syntax)
V
weak expressions
|
| phase 2 (semantics)
V
strong expressions
|
| phase 3 (code generation)
V
symbolic code
Further translation, for example into another programming language or directly into
binary code of a physical machine may also be performed.
The source texts are transformed into weak expressions by a so-called ``lexer-parser''
pair of programs. During this phase, syntax errors may be generated. Weak expressions
are then ``interpreted'' which leads to strong expressions. During this phase,
semantic errors may be generated. Finally, strong expressions are ``compiled'' into
some symbolic ``machine code'' which may be output in several different ways. The
result of intermediate steps may also be output into files. This is interesting in
particular for strong expressions, whose logical coherency may still be checked by an
algorithm, which is not a priori the case of symbolic code.
*** The system of types.
The system of types plays a central role in this compiler, the originality of the
Anubis language being for the main part due to it. Types are of the following sorts:
- defined types,
- anonymous products,
- dependant functional types,
- the type ``Omega'' of truth values,
- witness types,
- quantified types,
- universes.
Each sort of type is normally accompanied by ``constructors'' and ``destructors''.
Constructors enable the construction of data of these types, and destructors enable to
use these data. However, this is not as simple as this. For example, the type Omega has
a constructor, but no destructor. Instead of being destroyed, data of type Omega create
witness types. Witness types have still no constructor and no destructor, but in enough
cases they are identified to dependant functional types.
Data of all our types are represented by ``terms''. Hence, what the constructors and
destructors enable is essentially the writing of terms. Terms of type Omega are called
``statements'', and terms of witness types are called ``proofs''.
Types contain only abstract data. Concrete data, like network connections, files, and
other ``objects'' are generated by ``classes'', not by types. Even if there are of
course many similar features between types and classes, they should not be confused.
*** The syntax.
The syntax of the Anubis language has been simplified since version 1. In version 2, it
is at the same time simpler and more flexible. It is based on a small number of
constructions. The main notions are:
- modules (enabling modular programming),
- paragraphs (what modules, for example source files, contain)
- type expressions (the representation of types)
- terms (the representation of data)
A module is an ordered sequence of paragraphs. Each paragraph is either:
- a ``read'' paragraph, enabling to use tools from other modules,
- a type definition,
- a datum definition or declaration,
- a macro paragraph.
The Anubis language has simple and parametric polymorphisms. Simple polymorphism,
better called ``polysemy'', is the ability to use several times the same name for
defining several different things (of different types). Parametric polymorphism is the
possibility to use ``parameters'' representing arbitrary types. These two tools are
very powerful. However, they introduce many ambiguities in the texts, which must be
resolved by the compiler. This resolution is essentially performed by ``unification''.
Terms are classified into the following few sorts:
- simple symbols (also called ``nullary terms''),
- unary terms,
- binary terms,
- tuples,
- declarative terms,
- conditionals,
- local definitions.
This classification is reflected into the weak terms, whose structure mimics the
syntax, but disappears from the strong terms, whose structure is dictated by the
semantics (i.e. by the type system).
Phase 1 is actually made of 3 main tools:
- the lexer, producing 'tokens',
- the preparser, producing 'weak expressions',
- the postparser, producing 'weak paragraphs', 'weak terms' and 'weak types'.
At the level of 'weak expressions', paragraphs, types and terms are not yet
differentiated.
*** Type definitions.
Type definitions create ``defined types''. Actually, a defined type may be a ``type
schema'', i.e. the type it defines depends on one or several parameters. For example,
the well known type 'List($T)' depends on the parameter '$T'. When a type does not
depend on any parameter, it is called an ``actual type''. For example, 'Bool' and
'List(String)' are actual types, whilst 'List($T)' is not. Nevertheless, a quantified
type like '?$T List($T)' is an actual type, because the universal quantifier '?' makes
the parameter '$T' bound into this expression.
Type definitions (with or without parameters) may be recursive and even cross
recursive. The name of a type and its number of parameters, are part of this type (two
types with distinct names are distinct, even if their definitions are otherwise
identical). Furthermore, the compiler must be careful, because two types may have the
same name and same number of parameters, if they are never visible together from the
same place.
*** Data definitions.
The purpose of a datum definition is to associate a meaning to one or several signs.
This is also essential for recursion. Signs are of 4 sorts:
- nullary signs (also called ``simple symbols''),
- unary signs,
- binary signs,
- declarative signs.
The sort of a sign is recognizable from the sign itself, except that some signs are
``ambiguous''. Such signs may be unary in some circumstances, and binary in others. The
well known example is the ``minus sign'', which may be used for constructing unary
terms like '-456' and binary terms like '521 - 76'.
The heart of a datum definition is a term which is called the ``pattern'' of the
definition. The pattern is made only of signs. Some of these signs are declared in the
preambule of the definition, others are not. Those which are not declared are
considered as constants, and each constant found in the pattern receives a definition
(several definitions if it appears several times). This system makes the syntax of
Anubis 2 very flexible. Details of this mecanism are explained in the file
'define_check_2'.
Actually, there is still another kind of datum definition, that of ``converter''. This
kind of definition as no pattern at all in the source files, but the internal pattern
'internal_symbol(convert)' is implicit. This is part of the automatic conversion system
between types, a very important feature for the users.
*** Weak expressions.
Weak expresions are just a structuration of original expressions from the source files
into a tree form. They contain ``position'' informations (file path, line number,
column number). Weak expressions may be ambiguous, due to simple and parametric
polymorphisms.
*** Strong expressions.
Strong expressions are combinatorial expressions (there are no symbols, but context
depth informations, aka. De Bruijn symbols), and ambiguities are explicit. Their
structure is based on the type system, i.e. they mimic the notions of constructors and
destructors of the type system.
*** Knowledge base.
The compiler has a 'knowledge base', which is made of a family of 'dictionaries'. Each
dictionary is actually a table of so-called 'entries', together with tools for using
the dictionary. Each entry is uniquely identified by its position into the table,
i.e. an integer which is called its 'id'. Among the tools attached to a dictionary, we
have:
- A command for adding a new entry (the dictionary extends automatically if needed).
- A command for getting the entry for a given 'id',
- One or several 'indexes' for quickly finding an entry, for example from its name.
- A command for saving a dictionary into a file in order to use it next time.
- A command fore retrieving a dictionary from a file, discarding out of date entries.
We never have to remove entries from a dictionary (except when retrieving a
dictionary). Indexes are realized by hashtables or B-trees. When a dictionary is
retrieved entry by entry, the out of date entries are discarded. Because of these
discardings, ids are changing. During the retrieving of each dictionary, a hashtable
is constructed giving the correspondance between previous ids and new ids. This is
used when retrieving subsequent dictionaries. When all dictionaries are retrieved,
these hashtables are discarded.
The dictionaries are the following:
- The 'read modules dictionary' (mod_dict) contains for each entry:
-- The non ambiguous name of the module (absolute path for a source file).
-- The date of reading (date at which the module has been read by the compiler).
-- A flag indicating if the module is currently correct (no error).
- The 'types dictionary' (type_dict) contains for each entry:
-- The id of the module where the type (or type schema) is defined.
-- The list of positions of the paragraphs defining the type in the module.
-- The name of the type.
-- The list of parameters.
-- The checked definition of the type (in the form of a 'strong type').
- The 'data dictionary' (data_dict) contains for each entry:
-- The id of the module where the datum is defined.
-- The list of type parameters for this definition.
-- The name of the sign which is defined by this entry.
-- The signature of the definition (strong meta-type).
-- Maybe the value (a strong term) of the definition if it is not just a declaration.
- The 'actual types implementations dictionary' (implem_dict) contains for each entry:
-- The list of ids of the modules the entry depends on.
-- The actual type itself (a strong type).
-- The implementation.
- The 'closed terms realizations dictionary' (code_dict) contains for each entry:
-- The list of ids of the modules the entry depends on.
-- The id of the datum which is realized.
-- The values of type parameters for this instance (list of strong types).
-- The realization (symbolic code).
The dictionary 'mod_dict' is retrived first. For each entry, the 'date of reading'
field is compared with the current date (of last modification) of the module. If the
module is out of date, it is discarded. Hence, only those modules which have not been
modified since the previous compilation are present in 'mod_dict', and considered as
'already read'.
Next, 'type_dict' and 'data_dict' are retrieved. Entries for which the new id of the
module is not present in 'mod_dict' is discarded. For other entries, ids are updated,
i.e.:
- The previous id of the module is replaced by the new one.
- The strong types, meta-types and terms appearing within the entry are updated,
because they contain type ids and data ids.
Finally, 'implem_dict' and 'code_dict' are retrieved. If one of the old ids of modules
in their dependancies list corresponds to a dicarded module, they are discarded. Strong
data are updated (old ids replaced by new ids).
After this, the hashtables for the corespondance betweeen old and new ids are
discarded, and the new compilation may begin.
Entries belonging to incorrect modules are not saved, but this does not change the ids.
*** Symbolic code.
The symbolic code is obtained from the strong expressions by ``compiling'' it. The
symbolic code is made of instructions working on top of a stack.
*** Memory management.
The Anubis 2 language manages the memory automatically. This is required for preserving
the safety of the language. This is essentially a increment/decrement system, which
generates 'vcopy' (virtual copy) and 'vdelete' (virtual delete) instructions in the
symbolic code. Many of these instructions disappear during the peep hole optimization
of the symbolic code.