Propositions pour l'amélioration d'Anubis. *** réorganisation des librairies et plus de lib POP3 ? Y a-t-il un volontaire ? *** accès aux lib extérieurs dll avec des appels comme en haskell Proposer une liste de types de communication Anubis <--> C *** tutoriel comme celui d'haskell de Hal Daumé III *** logo original *** String unicode Unicode ou UTF8 ? Il m'avais semblé qu'on avait plutôt parlé d'UTF8. On peut peut être utiliser les deux suivant les cas. Par exemple, Unicode en interne et UTF8 dès qu'on sort sur disque ou sur le réseau. Par exemple, les fichiers sources Anubis, comment seront-ils lus ? I have written here some ideas for enhancing Anubis 1. I put as much details as possible, but there is no warranty that everything is correct. Everything should be checked from the sources files by anyone making a transformation in the program. After just one day of writting in this file, I realize that Anubis is the result of so many successive transformations that it contains a lot of dead code, dead concepts and unnecessary instructions. Because of this it is far from optimal, and it seems that it will probably be easier (and safer) to make a new system from scratch rather than correct all defauts of this one. *** Removing obsolete things. Anubis 1 contains a number of obsolete things: *** (Obsolete 1) Old way of opening files and connections. At the beginning of Anubis, a file was opened using the syntactical construct: if (Maybe(RAddr(Int8)))connect to file is { ... } which has been replaced by the primitive 'file', defined in 'predefined.anubis'. We have the same for TCP/IP connections: if (Result(NetworkConnectError,RWAddr(Int8)))connect to network ip_address:ip_port is { ... } which has been replaced by the primitive 'connect'. In order to make some cleanup, we must do the following: - transform 'predefined.anubis' so as to use the new paradigm everywhere. - transform file in the library which are still using the old paradigm. - remove the key locutions 'connect to file' and 'connect to network' from 'lexer.y'. at that point test if everything works well. - remove the corresponding grammar rules from 'grammar.y'. - remove the corresponding interpretation functions from 'interp.c'. - remove the corresponding code generation from 'compile.c'. - remove the obsolete messages from 'msgtext.c'. - remove the unnecessary instructions from 'bytecode.h' and 'vminstr.c'. - remove the corresponding declarations from 'compil.h'. - remove the corresponding instruction from the virtual machine. *** (Obsolete 2) Removing the concept of (semi)-global variable. Anubis 1 has a concept of 'global' variable (almost the same as in C), but which is not as 'global' as one may think. Indeed, when a new machine is started (by 'delegate') all the global variables are duplicated, so that the variables in the old and the new machine are not the same ones. So, such variables are not really global, but only global relative to one machine. Furthermore, when the new machine is started, these variables receive the initial value as decribed in the source file, not the current value. This concept is clearly obsolete. Dynamic variables are much better, because they are more flexible, and allow the sharing of data and communication between virtual machines. Also the semantic is simpler. How to remove this concept ? First of all, it should be removed from all the files of the library. Everywhere a global variable is used, we should use a dynamic variable instead. Global variables are declared (defined) by paragraphs beginning by the keyword 'variable'. Such variables are used in the following library files: doc_tools/maml.anubis (1 times) graphism/image_tools.anubis (many times) graphism/lzw_gif.anubis (many times) web/cookies.anubis (4 times) web/html.anubis (5 times) web/http_server.anubis (2 times) web/making_a_web_site.anubis (1 time) web/multihost_http_server.anubis (3 times) web/read_html.anubis (5 times) web/to_html.anubis (4 times) examples/network/sokoban.anubis (1 time) A technique for replacing 'global' variables by dynamic variables is the following. Define a type of 'objects', with just one alternative and with one component for each variable. Each component will be of type 'Var(...)'. In the main function (the starting point of the program using the variables), create an instance of this object (with the right initial values). Then, transmit this object from functions to functions. Another technique (maybe more subtle) is to create the dynamic variables, and to do in such a way that all functions using these variables are defined by an arrow '|->' in a context where these variables are visible. Actually, each situations needs probably a special decision. For example, create several objects instead of one, and mix the two methods above. When this is done, delete the initial keyword 'variable' (and 'public variable') from 'lexer.y', and test if everything works well. After this, we may make some cleanup in the compiler and the virtual machine. - in 'grammar.y': remove the tokens 'yy__variable' and 'yy__p_variable', and the grammar rules using them. Also remove the non terminal 'VarKW' and the grammar rule using them. - the file 'new_var.c' become entirely obsolete. - the messages E001 et E002 disappear. - the primitive type 'GAddr' disappears, together with its internal version 'type_GAddr'. Also the token 'yy__GAddr' disappears. This induces some cleanup in 'compil.h', 'compile.c', 'delcode.c', 'grammar.y', 'implem.c', 'lexer.y', 'typedef.c', 'typetools.c'. - the instructions 'gv_address', 'get_gvv', 'xchg_gvv', 'init_gv, 'del_gv' disappear. This will induce much cleanup in 'output.cpp', because the special code for initializing, and deleting global variables will disappear. *** (Obsolete 3) Removing the 'succeeds as' construct. There are two syntactical construct: if x succeeds as y then t if y succeeds then t (where x is of type Maybe(?)), which are just abbreviations for: if x is { failure then failure, success(y) then t } if x is { failure then failure, success(_) then t } This construct is present in the following files (where it should just be replaced by its expansion). tools/maybefloat.anubis tools/table.anubis When these easy corrections are made, it is just enough to remove the tokens 'yy__succeeds' and 'yy__succeeds_as' from 'lexer.y' and 'grammar.y', and the corresponding grammar rules. There is another notion which replaces this one. It is the 'Kleisli composition'. It is inspired by the monads of Haskell, and defined in 'library/syntactical_analysis/read_grammar.anubis'. It should maybe be put in some new file in 'library/tools/'. *** (Obsolete 4) The syntax: checking every ... wait for ... This special syntax is used for defining the command 'sleep' (in 'tools/basis.anubis'): public define One sleep ( Int32 ms // number of milliseconds to sleep ) = checking every ms milliseconds, wait for true then unique. If instead 'sleep' is a primitive, we may define: checking every n milliseconds, wait for t then u as: wait(n,(One _) |-> t); u where 'wait' would be defined as: define One wait ( Int32 cheking_period, One -> Bool test ) = sleep(checking_period); if test(unique) then unique else wait(checking_period,test). Another (maybe better) solution could be just to replace 'checking every ...' by a primitive construct (of type One) like this (where 'test' is of type 'Bool'): wait_for(n,test) or wait_for(test,n) which executes 'test' every 'n' milliseconds until it becomes 'true'. Actually, the problem here is that 'wait_for(n,test)' cannot be just a primitive, because if it is a primitive, 'test' is executed only once, and the result of this execution transmitted to 'wait_for' ('call by value' mechanism). We need to simulate 'call by necessity', by replacing a test of type 'Bool' by a test of type 'One -> Bool', which may be executed as many times as we want. *** (Obsolete 5) Local variables. This was an idea of the beginning, but it has never be pushed beyond the simple intention. The idea was similar to: with x = a, t except that 'x' would have been a true variable in the sens that commands like 'x <- b' would have been accepted in the term 't'. This is of course a non deterministic concept, and it appears after 5 years of use of Anubis that it is of no use. Nevertheless, this idea leaved some dead code in the source files of the compiler and of the virtual machine. The following instructions should be removed (from 'vm.cpp', 'vmtools.cpp', 'bytecode.h'): new_locvar read_locvar write_locvar Warning: the Lisp tag 'local' in the compiler refers to local 'symbols' in the context, not to local 'variables', hence should not be removed. Actually, local variables if they were to exist, would be of type 'RWAddr(?)'. This is why instructions dealing with these types have in general a case for local variables. This case should be removed. The constant 'conn_local' defined in 'vm.h' is a tag for local variables at run time. *** Transforming (fake) structural instruction into system calls. This is an easy transformation which should be performed as soon as possible. The virtual machine executes 2 kinds of instructions: - structural instructions, - system calls. The official list of instructions and system calls may be found in 'anubis_dev/include/bytecode.h'. However, since system calls have been introduced quite late, there are still some instructions which should be transformed into system calls. They are (at least) the following: print_string print_int32 connect_file_R connect_file_W connect_file_RW connect_IP_RW read_int8 write_int8 implode explode int8_to_int32 truncate_to_int8 now convert_time_from_int convert_time_to_int listener accept_connection listener_shutdown listener_is_down alt_number_direct alt_number_indirect do_alert create_var create_mvar get_vv xchg_vv byte_array_to_ascii byte_array_to_string dns write_file read_file file_size byte_array_length sha1_hash get_file_mode set_file_mode xchg_mvv get_mvv This will make some room for new instructions. In order to transform such an instruction into a system call, proceed as follows. - within 'vm.cpp' inhibit the instruction with a: #define TO_SYSCALL ... #endif - In 'bytecode.h' remove the name of the instruction from 'normal_instructions_list', and add it to 'syscall_list'. Of course 'item' must be changed into 'sc_item'. The second argument of 'sc_item' is a set of ORed flags. These flags are prefixed by 'mf_using_' and defined in 'bytecode.h'. Put them as needed. - create a new case in 'syscall.cpp'. The code of the instruction may be copied as is, except for the following: - the syscall instruction has size 3. Hence, it must terminate with: MAM(m_IP) += 1+2; (see the already existing syscalls). - remove the instruction from 'vmtools.cpp', 'vminstr.c', and maybe also from 'compil.h'. *** Introducing an heterogeneous stack. A big problem with Anubis 1 is that any datum occupies always 32 bits in the stack. One consequence is that when a datum needs more than 32 bits, it must be implemented indirectly, i.e. using a pointer to an allocated segment. For Anubis 1, it is probably not reasonable to consider a completely heterogeneous stack, i.e. a stack within which we may push data of any number of bytes. Nevertheless, it would be nice to be able to push data of 64 bits. This will allow 'direct' 64 bit integers, and allow 'direct' 'IEEE754' numbers. The virtual machine has two registers: 'R' (like 'Result') of 32 bits 'I' (like 'Index') of 8 bits It would be necessary to extend 'R' to 64 bits. When the result is 32 bits wide, the 32 highest bits would be just ignored. However, the (structural) instructions of the virtual machine are most often using the stack. Most of them should be adjusted (or maybe duplicated) so as to accomodate 64 bits slots in the stack. On the contrary, system calls make no problem, except that we may have to invent new system calls for handling 64 bits direct data. Below are the instructions which have something to do with the stack in one way or another. This means almost all instructions. check_stack push pop1 pop2 pop3 peek micro_peek collapse glue store_0 store_1 store_2 store_4 unglue unstore_0 unstore_1 unstore_2 unstore_4 unstore_copy unstore_copy_mixed unstore_copy_ptr unstore_copy_function push_addr apply put_copy_direct put_micro_copy_direct put_copy_indirect put_micro_copy_indirect put_copy_function put_micro_copy_function put_copy_mixed put_micro_copy_mixed ret free_seg_1 free_seg_0 create_var create_mvar push_mvar_length mvar_slots_del mvar_slots_del_var mvar_slots_del_mvar mvar_slots_del_mixed mvar_slots_del_conn mvar_slots_del_ptr mvar_slots_del_function mvar_slots_del_struct_ptr free_mvar_seg xchg_mvv free_var_seg get_vv get_mvv xchg_vv remove_monitor get_var_monitors get_mvar_monitors ret_if_zero get_var_handler get_mvar_handler del del_mixed del_stack del_stack_mvar del_stack_mixed del_stack_ptr del_stack_function del_stack_struct_ptr del_stack_conn del_index_direct del_index_indirect increment_del indirect_del indirect_del_mvar indirect_del_mixed indirect_del_ptr indirect_del_function del_function indirect_del_struct_ptr indirect_del_conn eq eq_string eq_byte_array push_eq_data push_before_eq (this one is dead) increment_eq jmp_eq_stack jmp_neq_1 jmp_neq_2 jmp_neq_4 jmp_neq_indexes_large jmp_neq_indexes_mixed jmp_neq_string jmp_neq_byte_array dec3 (this one is dead) start copy_stack_ptr copy_stack_function copy_stack_mixed serialize unserialize *** Int32, Int64 and the like. I took some time to think about these types. When someone says (like the Polish guy on comp.lang.functional) that computing 10^10 makes an error, he is somewhat right. Of course, we could argue that the computation is made modulo 2^32, and that the name of the type is precisely not 'Int' but 'Int32' meaning that the numbers we are considering are integers modulo 2^32. Ok, but this is not what users are waiting for. They want true integers, and consider the computation modulo 2^32 as an overflow not as a normal computation. As a consequence, we need to introduce true natural numbers of arbitrary size, not only of 32 or 64 or even 128 bits, but of any number of bits. Such a type (of positive integers only) is generally named 'Nat' (for 'Naturals'). Notice that processors are manipulating fixed size numbers (of 16, 32 bits or more), but actually, theses data are not numbers. They are 'bigits' (a contraction for 'big digits'). I mean that processors are computing with these bigits exactly as we, the humans, are computing with 'decimal digits'. The only difference is that we are computing in numeration basis 10, while processors are computing in numeration basis 2^16, 2^32 or more. As an example, when we are adding two digits, say 7 and 4, we get 11, hence actually the digit 1, and a carry of 1. When we are multiplying two digits, say 6 and 7, we get 42, hence two digits, one for the unities (in this case 2) and one for the tens (in this case 4). When we are dividing by a digit, we divide a block of two digits. For example, 45 | 7 +--- 3 | 6 But we never do that if the divisor is smaller than or equal to the first digit, like in this example: 45 | 2 +--- 1 | 22 We should have instead divided 4 by 2, i.e. not considered the last digit 5 in a first step. We would have instead divided in two steps: 45 | 2 +--- 05 | 22 1 | This is how we compute, and this is exactly how processors also compute, but with bigits instead of digits. For example, the 80386 processor (and its successors likewise) has an instruction DIV which divides the two bigits number EDX:EAX by the bigit (say) EBX, producing a one bigit quotient in EAX and a one bigit remainder in EDX. If EBX is smaller than or equal to EDX, the processor launches an exception. So, it should be clear that 16 bits, 32 bits or 64 bits words manipulated by processors are not numbers, but bigits. A number (natural integer of course) may be represented as a finite sequence of bigits. As a consequence, processors are very well designed, but programming languages, like C and many others have decided that a number would have only one bigit, and consequently that all computations are deliberately false. For this reason the types which are generally called Int16, Int32, etc... should have another name, or at least it should be made clear that they are computing in the strange (local) quotient rings Z/(2^32)Z, etc... which are far from being fields (rings with all non zero elements invertible), and which are even not integral domains, since all even numbers are divisors of 0. Now, we are also very often using these pseudo-numbers for other purposes in which they are vector spaces on the field Z/2Z. This is the case in cryptography, where the XOR operation is (correctly) identified to the sum in the vector space (Z/2Z)^32. We also use them just as bit field. This is the case when we use the left or right shift operators, or when we are masking bits using the bitwise OR or the bitwise AND. For example, it is the case in several cryptographic algorithms and also when these pseudo-integers are interpreted as colors in graphical applications. Consequently, types like 'Int32' are lumber rooms which are interpreted (more or less consciously) in several different ways. For this reason, I think that these types should at least be cloned in several versions with appropriate names, and only the meaningful operations for each clone (and maybe conversions between these clones). How to implement 'Nat' ? We have at least the choice between: - 'Nat' is a primitive type, - Introduce primitive types for bigits, and defined 'Nat' using them. For example, addition (of bigits) should be typed as follows: define (Bit,Bigit32) Bigit32 x + Bigit32 y. where the result is a pair made of the resulting carry and bigit. Similarly, the multiplication should be typed as follows: define (Bigit32,Bigit32) Bigit32 x * Bigit32 y. producing two bigits for the multiplication of two bigits. This is nothing else than an implementation of the 'table of multiplication' for bigits, analogous to the 'table of multiplication' for digits: 0 1 2 3 4 5 6 7 8 9 ---+----+----+----+----+----+----+----+----+----+----+ 0 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | ---+----+----+----+----+----+----+----+----+----+----+ 1 | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | ---+----+----+----+----+----+----+----+----+----+----+ 2 | 00 | 02 | 04 | 06 | 08 | 10 | 12 | 14 | 16 | 18 | ---+----+----+----+----+----+----+----+----+----+----+ 3 | 00 | 03 | 06 | 09 | 12 | 15 | 18 | 21 | 24 | 27 | ---+----+----+----+----+----+----+----+----+----+----+ 4 | 00 | 04 | 08 | 12 | 16 | 20 | 24 | 28 | 32 | 36 | ---+----+----+----+----+----+----+----+----+----+----+ 5 | 00 | 05 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | 45 | ---+----+----+----+----+----+----+----+----+----+----+ 6 | 00 | 06 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | 54 | ---+----+----+----+----+----+----+----+----+----+----+ 7 | 00 | 07 | 14 | 21 | 28 | 35 | 42 | 49 | 56 | 63 | ---+----+----+----+----+----+----+----+----+----+----+ 8 | 00 | 08 | 16 | 24 | 32 | 40 | 48 | 56 | 64 | 72 | ---+----+----+----+----+----+----+----+----+----+----+ 9 | 00 | 09 | 18 | 27 | 36 | 45 | 54 | 63 | 72 | 81 | ---+----+----+----+----+----+----+----+----+----+----+ (Notice that I put two digits in each box.) and of course the division should be typed: define Maybe((Bigit32 quotient, Bigit32 remainder)) (Bigit32,Bigit32) x / Bigit32 y. where the result is 'failure' when 'y' is smaller than the high order bigit of 'x'. Now the type of natural integers could (maybe) be defined as follows: type Nat: zero non_zero(Bigit32 first_bigit, Nat other_bigits). However, it does not work because the first bigit should never be 0, otherwise the representation is not one to one. Hence, the 'true' type of natural integers is actually a quotient of this one, which just means that this is the right type, but that equality for this type has to be defined in a special way. For the time being, the Anubis compiler produces an equality code for each type. In order to make 'Nat' as defined above a 'true' type of natural integers, we would have to redefine this equality. Well, this is maybe not the right thing to do. It's probably better to implement 'Nat' directly as a primitive type. Of course, this means that we need to write several operations in assembly, because C does not give enough access to the processor (carry, etc...). *** IEEE754 floating point numbers. The current implementation of the type 'Float' is inefficient for two main reasons: - since a 'Float' occupies 64 bits, it is implemented as a pointer to a segment of 12 bytes. The first 4 bytes contain the counter for the garbage-collector, and the next 8 bytes contain the 'double precision' floating point number as represented by the processor. - The current implementation does not take into account the fact that the representation of an IEEE754 floating point number includes special cases for too big or too small numbers. As a consequence, we have (for example) the following definition in 'predefined.anubis': public define Maybe(Float) Float x + Float y = £avm{ float_plus }. which could have been: public define Float Float x + Float y = £avm{ float_plus }. if this fact was taken into consideration. The result of this is that we are always computing with 'Maybe(Float)' instead of 'Float', and that this makes a double indirection in the implementation. The result is that computations are very slow, as shown by the Mandelbrot program. For these reasons, it would be nice to create another new primitive type: 'IEEE754'. A datum of this type should be implemented directly (no indirection) as a 64 bits word. We could have several primitive predicates for testing an IEEE754 number: define Bool is_too_big(IEEE754 x). // 'infinity' define Bool is_too_small(IEEE754 x). // 'denormal' define Bool is_not_a_number(IEEE754 x). etc... (according to the IEEE754 specification) We could also have a type like this one: type IEEE754_Sort: too_big, too_small, normal, not_a_number. and a primitive predicate: define IEEE754_Sort sort(IEEE754 x). *** What for the future ? *** Working only on top of stack. Designing the virtual machine with a register ('R') for holding the results of pieces of code was a bad idea. It would certainly be better to work only on top of the stack. There are several reasons for that: - In most cases, just after a datum has been computed and put in 'R' it is pushed on the stack (you will be easily convinced of that after a look at a .sc file). - Working on the stack only is more symmetric in the sens that as well as functions may take several arguments, they could return several results instead of just one. A good example is the euclidian division. For the time being, it is defined in 'basis.anubis' as follows: public define (Int32, // quotient Int32) // remainder euclid ( Int32 a, // divide 'a' by 'b' Int32 b ) = ... i.e. it takes two integers as arguments and returns two integers. The type 'Int32' has a 'direct' implementation, while '(Int32,Int32)' is indirect (it is a pointer to a segment of 12 bytes: counter, first integer, second integer). This segment is allocated for each division. If all the computations were done on the stack, it would be possible to construct the resulting pair '(quotient,remainder)' on the stack without allocating any segment. Of course, this symmetry is very 'categorical'. In a cartesian category, we have arrows like: A x B x C ---------> D x E for example, which may represent a function taking 3 arguments, and returning 2 results. Notice that 'Common Lisp' has this notion of multiple return values. Anyway, I don't want to mimic Common Lisp whose design is, to my opinion, one of the worst things ever conceived, an horrible denaturation of the original Lisp language. When we have a tuple to implement, we have two possibilities, even if all components of the tuple have a direct implementation. 1. We can implement the tuple directly. This means that if the tuple (say '(a,b,c)') is supposed to be at position 'p' in the stack, 'a' will be at position 'p', 'b' will be at position 'p+size(a)', and 'c' at position 'p+size(a)+size(b)'. 2. we can implement the tuple indirectly. This means that if the tuple is supposed to be at position 'p' in the stack, we have a pointer at position 'p', and the data representing 'a', 'b' and 'c' are in the segment pointed to by this pointer. These two methods are just two ways of implementing categorical products. From a categorical viewpoint this means that in a cartesian category we may have two 'product' functors. Of course, these two functors are naturally isomorphic. In practice, this means that a datum represented using the first method may be converted into a datum represented using the second method, and conversely, without loosing information. This also means that the two methods can be used together in the same program. We could for example use the first one for anonymous products, and the second one for named products. We could also use the first one for products whose bit width is not too big (for example not bigger than 128 bits or 256 bits) and the second one for bigger products. *** meilleur support de SQLite3 (type de colonnes, etc...) *** désallocation de la mémoire. *** Simplification de la grammaire Anubis: - Suppression du mot clé 'operation' de la grammaire ('define' suffit) - ne plus avoir la possibilité de mettre des majuscules à certains mots clé (Read / read, Define / define, ...) - Obliger à délimiter les commentaires par des marqueurs (sinon, le parsing du code est un vrai casse-tête) - suppression de la grammaire Anubis 2 *** Amélioration de la grammaire Anubis: - Ajout de guillemets encadrant le chemin du 'read' afin de gerer les chemins contenant des espaces