Propositions pour l'amélioration d'Anubis.
*** réorganisation des librairies et plus de lib
POP3 ? Y a-t-il un volontaire ?
*** accès aux lib extérieurs dll avec des appels comme en haskell
Proposer une liste de types de communication Anubis <--> C
*** tutoriel comme celui d'haskell de Hal Daumé III
*** logo original
*** String unicode
Unicode ou UTF8 ? Il m'avais semblé qu'on avait plutôt parlé d'UTF8. On peut
peut être utiliser les deux suivant les cas. Par exemple, Unicode en interne et UTF8
dès qu'on sort sur disque ou sur le réseau.
Par exemple, les fichiers sources Anubis, comment seront-ils lus ?
I have written here some ideas for enhancing Anubis 1. I put as much details as
possible, but there is no warranty that everything is correct. Everything should be
checked from the sources files by anyone making a transformation in the program.
After just one day of writting in this file, I realize that Anubis is the result of so
many successive transformations that it contains a lot of dead code, dead concepts and
unnecessary instructions. Because of this it is far from optimal, and it seems that it
will probably be easier (and safer) to make a new system from scratch rather than
correct all defauts of this one.
*** Removing obsolete things.
Anubis 1 contains a number of obsolete things:
*** (Obsolete 1) Old way of opening files and connections.
At the beginning of Anubis, a file was opened using the syntactical construct:
if (Maybe(RAddr(Int8)))connect to file is
{
...
}
which has been replaced by the primitive 'file', defined in 'predefined.anubis'. We
have the same for TCP/IP connections:
if (Result(NetworkConnectError,RWAddr(Int8)))connect to network ip_address:ip_port is
{
...
}
which has been replaced by the primitive 'connect'.
In order to make some cleanup, we must do the following:
- transform 'predefined.anubis' so as to use the new paradigm everywhere.
- transform file in the library which are still using the old paradigm.
- remove the key locutions 'connect to file' and 'connect to network' from 'lexer.y'.
at that point test if everything works well.
- remove the corresponding grammar rules from 'grammar.y'.
- remove the corresponding interpretation functions from 'interp.c'.
- remove the corresponding code generation from 'compile.c'.
- remove the obsolete messages from 'msgtext.c'.
- remove the unnecessary instructions from 'bytecode.h' and 'vminstr.c'.
- remove the corresponding declarations from 'compil.h'.
- remove the corresponding instruction from the virtual machine.
*** (Obsolete 2) Removing the concept of (semi)-global variable.
Anubis 1 has a concept of 'global' variable (almost the same as in C), but which is not
as 'global' as one may think. Indeed, when a new machine is started (by 'delegate') all
the global variables are duplicated, so that the variables in the old and the new
machine are not the same ones. So, such variables are not really global, but only
global relative to one machine. Furthermore, when the new machine is started, these
variables receive the initial value as decribed in the source file, not the current
value.
This concept is clearly obsolete. Dynamic variables are much better, because they are
more flexible, and allow the sharing of data and communication between virtual
machines. Also the semantic is simpler.
How to remove this concept ? First of all, it should be removed from all the files of
the library. Everywhere a global variable is used, we should use a dynamic variable
instead. Global variables are declared (defined) by paragraphs beginning by the keyword
'variable'. Such variables are used in the following library files:
doc_tools/maml.anubis (1 times)
graphism/image_tools.anubis (many times)
graphism/lzw_gif.anubis (many times)
web/cookies.anubis (4 times)
web/html.anubis (5 times)
web/http_server.anubis (2 times)
web/making_a_web_site.anubis (1 time)
web/multihost_http_server.anubis (3 times)
web/read_html.anubis (5 times)
web/to_html.anubis (4 times)
examples/network/sokoban.anubis (1 time)
A technique for replacing 'global' variables by dynamic variables is the
following. Define a type of 'objects', with just one alternative and with one component
for each variable. Each component will be of type 'Var(...)'. In the main function
(the starting point of the program using the variables), create an instance of this
object (with the right initial values). Then, transmit this object from functions to
functions.
Another technique (maybe more subtle) is to create the dynamic variables, and to do in
such a way that all functions using these variables are defined by an arrow '|->' in a
context where these variables are visible.
Actually, each situations needs probably a special decision. For example, create
several objects instead of one, and mix the two methods above.
When this is done, delete the initial keyword 'variable' (and 'public variable') from
'lexer.y', and test if everything works well.
After this, we may make some cleanup in the compiler and the virtual machine.
- in 'grammar.y': remove the tokens 'yy__variable' and 'yy__p_variable', and the
grammar rules using them. Also remove the non terminal 'VarKW' and the grammar rule
using them.
- the file 'new_var.c' become entirely obsolete.
- the messages E001 et E002 disappear.
- the primitive type 'GAddr' disappears, together with its internal version
'type_GAddr'. Also the token 'yy__GAddr' disappears. This induces some cleanup in
'compil.h', 'compile.c', 'delcode.c', 'grammar.y', 'implem.c', 'lexer.y', 'typedef.c',
'typetools.c'.
- the instructions 'gv_address', 'get_gvv', 'xchg_gvv', 'init_gv, 'del_gv'
disappear. This will induce much cleanup in 'output.cpp', because the special code for
initializing, and deleting global variables will disappear.
*** (Obsolete 3) Removing the 'succeeds as' construct.
There are two syntactical construct:
if x succeeds as y then t
if y succeeds then t
(where x is of type Maybe(?)), which are just abbreviations for:
if x is
{
failure then failure,
success(y) then t
}
if x is
{
failure then failure,
success(_) then t
}
This construct is present in the following files (where it should just be replaced by
its expansion).
tools/maybefloat.anubis
tools/table.anubis
When these easy corrections are made, it is just enough to remove the tokens
'yy__succeeds' and 'yy__succeeds_as' from 'lexer.y' and 'grammar.y', and the
corresponding grammar rules.
There is another notion which replaces this one. It is the 'Kleisli composition'. It is
inspired by the monads of Haskell, and defined in
'library/syntactical_analysis/read_grammar.anubis'. It should maybe be put in some new
file in 'library/tools/'.
*** (Obsolete 4) The syntax: checking every ... wait for ...
This special syntax is used for defining the command 'sleep' (in 'tools/basis.anubis'):
public define One
sleep
(
Int32 ms // number of milliseconds to sleep
) =
checking every ms milliseconds, wait for true then unique.
If instead 'sleep' is a primitive, we may define:
checking every n milliseconds, wait for t then u
as:
wait(n,(One _) |-> t); u
where 'wait' would be defined as:
define One
wait
(
Int32 cheking_period,
One -> Bool test
) =
sleep(checking_period);
if test(unique)
then unique
else wait(checking_period,test).
Another (maybe better) solution could be just to replace 'checking every ...' by a
primitive construct (of type One) like this (where 'test' is of type 'Bool'):
wait_for(n,test) or wait_for(test,n)
which executes 'test' every 'n' milliseconds until it becomes 'true'.
Actually, the problem here is that 'wait_for(n,test)' cannot be just a primitive,
because if it is a primitive, 'test' is executed only once, and the result of this
execution transmitted to 'wait_for' ('call by value' mechanism). We need to simulate
'call by necessity', by replacing a test of type 'Bool' by a test of type 'One ->
Bool', which may be executed as many times as we want.
*** (Obsolete 5) Local variables.
This was an idea of the beginning, but it has never be pushed beyond the simple
intention. The idea was similar to:
with x = a, t
except that 'x' would have been a true variable in the sens that commands like 'x <- b'
would have been accepted in the term 't'. This is of course a non deterministic
concept, and it appears after 5 years of use of Anubis that it is of no use.
Nevertheless, this idea leaved some dead code in the source files of the compiler and
of the virtual machine. The following instructions should be removed (from 'vm.cpp',
'vmtools.cpp', 'bytecode.h'):
new_locvar
read_locvar
write_locvar
Warning: the Lisp tag 'local' in the compiler refers to local 'symbols' in the context,
not to local 'variables', hence should not be removed.
Actually, local variables if they were to exist, would be of type 'RWAddr(?)'. This is
why instructions dealing with these types have in general a case for local
variables. This case should be removed.
The constant 'conn_local' defined in 'vm.h' is a tag for local variables at run time.
*** Transforming (fake) structural instruction into system calls.
This is an easy transformation which should be performed as soon as possible. The
virtual machine executes 2 kinds of instructions:
- structural instructions,
- system calls.
The official list of instructions and system calls may be found in
'anubis_dev/include/bytecode.h'.
However, since system calls have been introduced quite late, there are still some
instructions which should be transformed into system calls. They are (at least) the
following:
print_string
print_int32
connect_file_R
connect_file_W
connect_file_RW
connect_IP_RW
read_int8
write_int8
implode
explode
int8_to_int32
truncate_to_int8
now
convert_time_from_int
convert_time_to_int
listener
accept_connection
listener_shutdown
listener_is_down
alt_number_direct
alt_number_indirect
do_alert
create_var
create_mvar
get_vv
xchg_vv
byte_array_to_ascii
byte_array_to_string
dns
write_file
read_file
file_size
byte_array_length
sha1_hash
get_file_mode
set_file_mode
xchg_mvv
get_mvv
This will make some room for new instructions.
In order to transform such an instruction into a system call, proceed as follows.
- within 'vm.cpp' inhibit the instruction with a:
#define TO_SYSCALL
...
#endif
- In 'bytecode.h' remove the name of the instruction from 'normal_instructions_list',
and add it to 'syscall_list'. Of course 'item' must be changed into 'sc_item'. The
second argument of 'sc_item' is a set of ORed flags. These flags are prefixed by
'mf_using_' and defined in 'bytecode.h'. Put them as needed.
- create a new case in 'syscall.cpp'. The code of the instruction may be copied as
is, except for the following:
- the syscall instruction has size 3. Hence, it must terminate with:
MAM(m_IP) += 1+2;
(see the already existing syscalls).
- remove the instruction from 'vmtools.cpp', 'vminstr.c', and maybe also from
'compil.h'.
*** Introducing an heterogeneous stack.
A big problem with Anubis 1 is that any datum occupies always 32 bits in the stack.
One consequence is that when a datum needs more than 32 bits, it must be implemented
indirectly, i.e. using a pointer to an allocated segment.
For Anubis 1, it is probably not reasonable to consider a completely heterogeneous
stack, i.e. a stack within which we may push data of any number of bytes. Nevertheless,
it would be nice to be able to push data of 64 bits. This will allow 'direct' 64 bit
integers, and allow 'direct' 'IEEE754' numbers.
The virtual machine has two registers:
'R' (like 'Result') of 32 bits
'I' (like 'Index') of 8 bits
It would be necessary to extend 'R' to 64 bits. When the result is 32 bits wide, the 32
highest bits would be just ignored.
However, the (structural) instructions of the virtual machine are most often using the
stack. Most of them should be adjusted (or maybe duplicated) so as to accomodate 64
bits slots in the stack. On the contrary, system calls make no problem, except that we
may have to invent new system calls for handling 64 bits direct data. Below are the
instructions which have something to do with the stack in one way or another. This
means almost all instructions.
check_stack
push
pop1
pop2
pop3
peek
micro_peek
collapse
glue
store_0
store_1
store_2
store_4
unglue
unstore_0
unstore_1
unstore_2
unstore_4
unstore_copy
unstore_copy_mixed
unstore_copy_ptr
unstore_copy_function
push_addr
apply
put_copy_direct
put_micro_copy_direct
put_copy_indirect
put_micro_copy_indirect
put_copy_function
put_micro_copy_function
put_copy_mixed
put_micro_copy_mixed
ret
free_seg_1
free_seg_0
create_var
create_mvar
push_mvar_length
mvar_slots_del
mvar_slots_del_var
mvar_slots_del_mvar
mvar_slots_del_mixed
mvar_slots_del_conn
mvar_slots_del_ptr
mvar_slots_del_function
mvar_slots_del_struct_ptr
free_mvar_seg
xchg_mvv
free_var_seg
get_vv
get_mvv
xchg_vv
remove_monitor
get_var_monitors
get_mvar_monitors
ret_if_zero
get_var_handler
get_mvar_handler
del
del_mixed
del_stack
del_stack_mvar
del_stack_mixed
del_stack_ptr
del_stack_function
del_stack_struct_ptr
del_stack_conn
del_index_direct
del_index_indirect
increment_del
indirect_del
indirect_del_mvar
indirect_del_mixed
indirect_del_ptr
indirect_del_function
del_function
indirect_del_struct_ptr
indirect_del_conn
eq
eq_string
eq_byte_array
push_eq_data
push_before_eq (this one is dead)
increment_eq
jmp_eq_stack
jmp_neq_1
jmp_neq_2
jmp_neq_4
jmp_neq_indexes_large
jmp_neq_indexes_mixed
jmp_neq_string
jmp_neq_byte_array
dec3 (this one is dead)
start
copy_stack_ptr
copy_stack_function
copy_stack_mixed
serialize
unserialize
*** Int32, Int64 and the like.
I took some time to think about these types. When someone says (like the Polish guy on
comp.lang.functional) that computing 10^10 makes an error, he is somewhat right. Of
course, we could argue that the computation is made modulo 2^32, and that the name of
the type is precisely not 'Int' but 'Int32' meaning that the numbers we are considering
are integers modulo 2^32.
Ok, but this is not what users are waiting for. They want true integers, and consider
the computation modulo 2^32 as an overflow not as a normal computation. As a
consequence, we need to introduce true natural numbers of arbitrary size, not only of
32 or 64 or even 128 bits, but of any number of bits. Such a type (of positive integers
only) is generally named 'Nat' (for 'Naturals').
Notice that processors are manipulating fixed size numbers (of 16, 32 bits or more),
but actually, theses data are not numbers. They are 'bigits' (a contraction for 'big
digits'). I mean that processors are computing with these bigits exactly as we, the
humans, are computing with 'decimal digits'. The only difference is that we are
computing in numeration basis 10, while processors are computing in numeration basis
2^16, 2^32 or more.
As an example, when we are adding two digits, say 7 and 4, we get 11, hence actually
the digit 1, and a carry of 1. When we are multiplying two digits, say 6 and 7, we get
42, hence two digits, one for the unities (in this case 2) and one for the tens (in
this case 4). When we are dividing by a digit, we divide a block of two digits. For
example,
45 | 7
+---
3 | 6
But we never do that if the divisor is smaller than or equal to the first digit, like
in this example:
45 | 2
+---
1 | 22
We should have instead divided 4 by 2, i.e. not considered the last digit 5 in a first
step. We would have instead divided in two steps:
45 | 2
+---
05 | 22
1 |
This is how we compute, and this is exactly how processors also compute, but with
bigits instead of digits. For example, the 80386 processor (and its successors
likewise) has an instruction DIV which divides the two bigits number EDX:EAX by the
bigit (say) EBX, producing a one bigit quotient in EAX and a one bigit remainder in
EDX. If EBX is smaller than or equal to EDX, the processor launches an exception.
So, it should be clear that 16 bits, 32 bits or 64 bits words manipulated by processors
are not numbers, but bigits. A number (natural integer of course) may be represented
as a finite sequence of bigits. As a consequence, processors are very well designed,
but programming languages, like C and many others have decided that a number would have
only one bigit, and consequently that all computations are deliberately false.
For this reason the types which are generally called Int16, Int32, etc... should have
another name, or at least it should be made clear that they are computing in the
strange (local) quotient rings Z/(2^32)Z, etc... which are far from being fields
(rings with all non zero elements invertible), and which are even not integral domains,
since all even numbers are divisors of 0.
Now, we are also very often using these pseudo-numbers for other purposes in which they
are vector spaces on the field Z/2Z. This is the case in cryptography, where the XOR
operation is (correctly) identified to the sum in the vector space (Z/2Z)^32.
We also use them just as bit field. This is the case when we use the left or right
shift operators, or when we are masking bits using the bitwise OR or the bitwise
AND. For example, it is the case in several cryptographic algorithms and also when
these pseudo-integers are interpreted as colors in graphical applications.
Consequently, types like 'Int32' are lumber rooms which are interpreted (more or less
consciously) in several different ways. For this reason, I think that these types
should at least be cloned in several versions with appropriate names, and only
the meaningful operations for each clone (and maybe conversions between these clones).
How to implement 'Nat' ? We have at least the choice between:
- 'Nat' is a primitive type,
- Introduce primitive types for bigits, and defined 'Nat' using them.
For example, addition (of bigits) should be typed as follows:
define (Bit,Bigit32) Bigit32 x + Bigit32 y.
where the result is a pair made of the resulting carry and bigit. Similarly, the
multiplication should be typed as follows:
define (Bigit32,Bigit32) Bigit32 x * Bigit32 y.
producing two bigits for the multiplication of two bigits. This is nothing else than an
implementation of the 'table of multiplication' for bigits, analogous to the 'table of
multiplication' for digits:
0 1 2 3 4 5 6 7 8 9
---+----+----+----+----+----+----+----+----+----+----+
0 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 |
---+----+----+----+----+----+----+----+----+----+----+
1 | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
---+----+----+----+----+----+----+----+----+----+----+
2 | 00 | 02 | 04 | 06 | 08 | 10 | 12 | 14 | 16 | 18 |
---+----+----+----+----+----+----+----+----+----+----+
3 | 00 | 03 | 06 | 09 | 12 | 15 | 18 | 21 | 24 | 27 |
---+----+----+----+----+----+----+----+----+----+----+
4 | 00 | 04 | 08 | 12 | 16 | 20 | 24 | 28 | 32 | 36 |
---+----+----+----+----+----+----+----+----+----+----+
5 | 00 | 05 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | 45 |
---+----+----+----+----+----+----+----+----+----+----+
6 | 00 | 06 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | 54 |
---+----+----+----+----+----+----+----+----+----+----+
7 | 00 | 07 | 14 | 21 | 28 | 35 | 42 | 49 | 56 | 63 |
---+----+----+----+----+----+----+----+----+----+----+
8 | 00 | 08 | 16 | 24 | 32 | 40 | 48 | 56 | 64 | 72 |
---+----+----+----+----+----+----+----+----+----+----+
9 | 00 | 09 | 18 | 27 | 36 | 45 | 54 | 63 | 72 | 81 |
---+----+----+----+----+----+----+----+----+----+----+
(Notice that I put two digits in each box.)
and of course the division should be typed:
define Maybe((Bigit32 quotient, Bigit32 remainder))
(Bigit32,Bigit32) x / Bigit32 y.
where the result is 'failure' when 'y' is smaller than the high order bigit of 'x'.
Now the type of natural integers could (maybe) be defined as follows:
type Nat:
zero
non_zero(Bigit32 first_bigit, Nat other_bigits).
However, it does not work because the first bigit should never be 0, otherwise the
representation is not one to one. Hence, the 'true' type of natural integers is
actually a quotient of this one, which just means that this is the right type, but that
equality for this type has to be defined in a special way.
For the time being, the Anubis compiler produces an equality code for each type. In
order to make 'Nat' as defined above a 'true' type of natural integers, we would have
to redefine this equality.
Well, this is maybe not the right thing to do. It's probably better to implement 'Nat'
directly as a primitive type. Of course, this means that we need to write several
operations in assembly, because C does not give enough access to the processor (carry,
etc...).
*** IEEE754 floating point numbers.
The current implementation of the type 'Float' is inefficient for two main reasons:
- since a 'Float' occupies 64 bits, it is implemented as a pointer to a segment of 12
bytes. The first 4 bytes contain the counter for the garbage-collector, and the next 8
bytes contain the 'double precision' floating point number as represented by the
processor.
- The current implementation does not take into account the fact that the
representation of an IEEE754 floating point number includes special cases for too big
or too small numbers. As a consequence, we have (for example) the following definition
in 'predefined.anubis':
public define Maybe(Float) Float x + Float y = £avm{ float_plus }.
which could have been:
public define Float Float x + Float y = £avm{ float_plus }.
if this fact was taken into consideration.
The result of this is that we are always computing with 'Maybe(Float)' instead of
'Float', and that this makes a double indirection in the implementation. The result is
that computations are very slow, as shown by the Mandelbrot program.
For these reasons, it would be nice to create another new primitive type: 'IEEE754'. A
datum of this type should be implemented directly (no indirection) as a 64 bits word.
We could have several primitive predicates for testing an IEEE754 number:
define Bool is_too_big(IEEE754 x). // 'infinity'
define Bool is_too_small(IEEE754 x). // 'denormal'
define Bool is_not_a_number(IEEE754 x).
etc... (according to the IEEE754 specification)
We could also have a type like this one:
type IEEE754_Sort:
too_big,
too_small,
normal,
not_a_number.
and a primitive predicate:
define IEEE754_Sort sort(IEEE754 x).
*** What for the future ?
*** Working only on top of stack.
Designing the virtual machine with a register ('R') for holding the results of pieces
of code was a bad idea. It would certainly be better to work only on top of the
stack. There are several reasons for that:
- In most cases, just after a datum has been computed and put in 'R' it is pushed on
the stack (you will be easily convinced of that after a look at a .sc file).
- Working on the stack only is more symmetric in the sens that as well as functions
may take several arguments, they could return several results instead of just one. A
good example is the euclidian division. For the time being, it is defined in
'basis.anubis' as follows:
public define (Int32, // quotient
Int32) // remainder
euclid
(
Int32 a, // divide 'a' by 'b'
Int32 b
) =
...
i.e. it takes two integers as arguments and returns two integers. The type 'Int32' has
a 'direct' implementation, while '(Int32,Int32)' is indirect (it is a pointer to a
segment of 12 bytes: counter, first integer, second integer). This segment is allocated
for each division. If all the computations were done on the stack, it would be
possible to construct the resulting pair '(quotient,remainder)' on the stack without
allocating any segment.
Of course, this symmetry is very 'categorical'. In a cartesian category, we have arrows
like:
A x B x C ---------> D x E
for example, which may represent a function taking 3 arguments, and returning 2
results. Notice that 'Common Lisp' has this notion of multiple return values. Anyway, I
don't want to mimic Common Lisp whose design is, to my opinion, one of the worst things
ever conceived, an horrible denaturation of the original Lisp language.
When we have a tuple to implement, we have two possibilities, even if all components of
the tuple have a direct implementation.
1. We can implement the tuple directly. This means that if the tuple (say '(a,b,c)')
is supposed to be at position 'p' in the stack, 'a' will be at position 'p', 'b' will
be at position 'p+size(a)', and 'c' at position 'p+size(a)+size(b)'.
2. we can implement the tuple indirectly. This means that if the tuple is supposed to
be at position 'p' in the stack, we have a pointer at position 'p', and the data
representing 'a', 'b' and 'c' are in the segment pointed to by this pointer.
These two methods are just two ways of implementing categorical products. From a
categorical viewpoint this means that in a cartesian category we may have two 'product'
functors. Of course, these two functors are naturally isomorphic. In practice, this
means that a datum represented using the first method may be converted into a datum
represented using the second method, and conversely, without loosing information.
This also means that the two methods can be used together in the same program. We could
for example use the first one for anonymous products, and the second one for named
products. We could also use the first one for products whose bit width is not too big
(for example not bigger than 128 bits or 256 bits) and the second one for bigger
products.
*** meilleur support de SQLite3 (type de colonnes, etc...)
*** désallocation de la mémoire.
*** Simplification de la grammaire Anubis:
- Suppression du mot clé 'operation' de la grammaire ('define' suffit)
- ne plus avoir la possibilité de mettre des majuscules à certains mots clé
(Read / read, Define / define, ...)
- Obliger à délimiter les commentaires par des marqueurs (sinon, le parsing du code
est un vrai casse-tête)
- suppression de la grammaire Anubis 2
*** Amélioration de la grammaire Anubis:
- Ajout de guillemets encadrant le chemin du 'read' afin de gerer les chemins
contenant des espaces