overview 14.3 KB

Edit Raw Blame History

   /**************************************************************************************


                     _____              ___.   .__         ________
                    /  _  \   ____  __ _\_ |__ |__| ______ \_____  \
                   /  /_\  \ /    \|  |  \ __ \|  |/  ___/  /  ____/
                  /    |    \   |  \  |  / \_\ \  |\___ \  /       \
                  \____|__  /___|  /____/|___  /__/____  > \_______ \
                          \/ The \/  Anubis  \/    2   \/  Project \/


   Name of this file:           overview (text file)

   Purpose of this file:        An overview of the whole compiler.


   Authors (Name [initials]):   Alain Proute'   [AP]

   Updates ([initials] (date) comment):
      [AP] (2007 jun 07) Creation of this file.


   **************************************************************************************/


   *** How source texts flow through the compiler.

   The role of the compiler is to read one or several ``source files'', and to produce one
   or several ``output files''.  The content of the output files is some kind of
   translation of the content of the source files.  This translation goes through
   intermediate steps (i.e. intermediate languages):

                     source text
                          |
                          |  phase 1 (syntax)
                          V
                  weak expressions
                          |
                          |  phase 2 (semantics)
                          V
                 strong expressions
                          |
                          |  phase 3 (code generation)
                          V
                    symbolic code

   Further translation, for example into another programming language or directly into
   binary code of a physical machine may also be performed.

   The source texts are transformed into weak expressions by a so-called ``lexer-parser''
   pair of programs.  During this phase, syntax errors may be generated. Weak expressions
   are then ``interpreted'' which leads to strong expressions.  During this phase,
   semantic errors may be generated.  Finally, strong expressions are ``compiled'' into
   some symbolic ``machine code'' which may be output in several different ways. The
   result of intermediate steps may also be output into files. This is interesting in
   particular for strong expressions, whose logical coherency may still be checked by an
   algorithm, which is not a priori the case of symbolic code.


   *** The system of types.

   The system  of types  plays a  central role in  this compiler,  the originality  of the
   Anubis language being for the main part due to it. Types are of the following sorts:

     - defined types,
     - anonymous products,
     - dependant functional types,
     - the type ``Omega'' of truth values,
     - witness types,
     - quantified types,
     - universes.

   Each  sort of  type is  normally accompanied  by ``constructors''  and ``destructors''.
   Constructors enable the construction of data  of these types, and destructors enable to
   use these data. However, this is not as simple as this. For example, the type Omega has
   a constructor, but no destructor. Instead of being destroyed, data of type Omega create
   witness types. Witness types have still no constructor and no destructor, but in enough
   cases they are identified to dependant functional types.

   Data of  all our types are represented  by ``terms''. Hence, what  the constructors and
   destructors enable is essentially the writing  of terms. Terms of type Omega are called
   ``statements'', and terms of witness types are called ``proofs''.

   Types contain only  abstract data. Concrete data, like  network connections, files, and
   other ``objects''  are generated by  ``classes'', not by  types.  Even if there  are of
   course many similar features between types and classes, they should not be confused.


   *** The syntax.

   The syntax of the Anubis language has been simplified since version 1. In version 2, it
   is at  the same  time simpler  and more  flexible.  It is  based on  a small  number of
   constructions. The main notions are:

     - modules (enabling modular programming),
     - paragraphs (what modules, for example source files, contain)
     - type expressions (the representation of types)
     - terms (the representation of data)

   A module is an ordered sequence of paragraphs. Each paragraph is either:

     - a ``read'' paragraph, enabling to use tools from other modules,
     - a type definition,
     - a datum definition or declaration,
     - a macro paragraph.

   The  Anubis language  has  simple and  parametric  polymorphisms. Simple  polymorphism,
   better  called ``polysemy'', is  the ability  to use  several times  the same  name for
   defining several different things (of different types).  Parametric polymorphism is the
   possibility to  use ``parameters'' representing  arbitrary types.  These two  tools are
   very powerful.   However, they introduce many  ambiguities in the texts,  which must be
   resolved by the compiler.  This resolution is essentially performed by ``unification''.

   Terms are classified into the following few sorts:

     - simple symbols (also called ``nullary terms''),
     - unary terms,
     - binary terms,
     - tuples,
     - declarative terms,
     - conditionals,
     - local definitions.

   This  classification is  reflected  into the  weak  terms, whose  structure mimics  the
   syntax,  but disappears  from the  strong  terms, whose  structure is  dictated by  the
   semantics (i.e. by the type system).

   Phase 1 is actually made of 3 main tools:

      - the lexer, producing 'tokens',
      - the preparser, producing 'weak expressions',
      - the postparser, producing 'weak paragraphs', 'weak terms' and 'weak types'.

   At  the  level  of  'weak  expressions',  paragraphs,  types  and  terms  are  not  yet
   differentiated.


   *** Type definitions.

   Type definitions  create ``defined types''.  Actually,  a defined type may  be a ``type
   schema'', i.e. the  type it defines depends on one or  several parameters. For example,
   the well  known type 'List($T)' depends  on the parameter  '$T'.  When a type  does not
   depend on  any parameter,  it is called  an ``actual  type''.  For example,  'Bool' and
   'List(String)' are actual types, whilst  'List($T)' is not.  Nevertheless, a quantified
   type like '?$T List($T)' is an  actual type, because the universal quantifier '?' makes
   the parameter '$T' bound into this expression.

   Type  definitions  (with  or  without  parameters)  may be  recursive  and  even  cross
   recursive. The name of a type and its  number of parameters, are part of this type (two
   types  with  distinct names  are  distinct, even  if  their  definitions are  otherwise
   identical). Furthermore, the  compiler must be careful, because two  types may have the
   same name and  same number of parameters,  if they are never visible  together from the
   same place.


   *** Data definitions.

   The purpose of  a datum definition is to  associate a meaning to one  or several signs.
   This is also essential for recursion.  Signs are of 4 sorts:

     - nullary signs (also called ``simple symbols''),
     - unary signs,
     - binary signs,
     - declarative signs.

   The sort  of a sign is  recognizable from the sign  itself, except that  some signs are
   ``ambiguous''. Such signs may be unary in some circumstances, and binary in others. The
   well known  example is  the ``minus sign'',  which may  be used for  constructing unary
   terms like '-456' and binary terms like '521 - 76'.

   The  heart of a  datum definition  is a  term which  is called  the ``pattern''  of the
   definition. The pattern is made only of  signs. Some of these signs are declared in the
   preambule  of  the definition,  others  are  not.  Those  which  are  not declared  are
   considered as constants,  and each constant found in the  pattern receives a definition
   (several definitions  if it appears  several times).  This  system makes the  syntax of
   Anubis  2  very  flexible.   Details  of  this  mecanism  are  explained  in  the  file
   'define_check_2'.

   Actually, there is still another kind  of datum definition, that of ``converter''. This
   kind of definition as  no pattern at all in the source  files, but the internal pattern
   'internal_symbol(convert)' is implicit. This is part of the automatic conversion system
   between types, a very important feature for the users.


   *** Weak expressions.

   Weak expresions are just a structuration  of original expressions from the source files
   into  a tree  form. They  contain ``position''  informations (file  path,  line number,
   column  number).  Weak expressions  may  be ambiguous,  due  to  simple and  parametric
   polymorphisms.


   *** Strong expressions.

   Strong expressions  are combinatorial  expressions (there are  no symbols,  but context
   depth  informations, aka.   De Bruijn  symbols), and  ambiguities are  explicit.  Their
   structure is based on the type system,  i.e. they mimic the notions of constructors and
   destructors of the type system.


   *** Knowledge base.

   The compiler has a 'knowledge base', which  is made of a family of 'dictionaries'. Each
   dictionary is  actually a table of  so-called 'entries', together with  tools for using
   the  dictionary. Each  entry is  uniquely identified  by its  position into  the table,
   i.e. an integer which is called its  'id'. Among the tools attached to a dictionary, we
   have:

     - A command for adding a new entry (the dictionary extends automatically if needed).
     - A command for getting the entry for a given 'id',
     - One or several 'indexes' for quickly finding an entry, for example from its name.
     - A command for saving a dictionary into a file in order to use it next time.
     - A command fore retrieving a dictionary from a file, discarding out of date entries.

   We  never  have  to  remove  entries  from  a  dictionary  (except  when  retrieving  a
   dictionary).  Indexes are  realized  by hashtables  or  B-trees. When  a dictionary  is
   retrieved entry  by entry,  the out of  date entries  are discarded.  Because  of these
   discardings, ids are  changing.  During the retrieving of  each dictionary, a hashtable
   is constructed  giving the correspondance  between previous ids  and new ids.   This is
   used  when retrieving subsequent  dictionaries.  When  all dictionaries  are retrieved,
   these hashtables are discarded.


   The dictionaries are the following:

     - The 'read modules dictionary' (mod_dict) contains for each entry:
        -- The non ambiguous name of the module (absolute path for a source file).
        -- The date of reading (date at which the module has been read by the compiler).
        -- A flag indicating if the module is currently correct (no error).

     - The 'types dictionary' (type_dict) contains for each entry:
        -- The id of the module where the type (or type schema) is defined.
        -- The list of positions of the paragraphs defining the type in the module.
        -- The name of the type.
        -- The list of parameters.
        -- The checked definition of the type (in the form of a 'strong type').

     - The 'data dictionary' (data_dict) contains for each entry:
        -- The id of the module where the datum is defined.
        -- The list of type parameters for this definition.
        -- The name of the sign which is defined by this entry.
        -- The signature of the definition (strong meta-type).
        -- Maybe the value (a strong term) of the definition if it is not just a declaration.

     - The 'actual types implementations dictionary' (implem_dict) contains for each entry:
        -- The list of ids of the modules the entry depends on.
        -- The actual type itself (a strong type).
        -- The implementation.

     - The 'closed terms realizations dictionary' (code_dict) contains for each entry:
        -- The list of ids of the modules the entry depends on.
        -- The id of the datum which is realized.
        -- The values of type parameters for this instance (list of strong types).
        -- The realization (symbolic code).

   The dictionary  'mod_dict' is retrived  first.  For each  entry, the 'date  of reading'
   field is compared  with the current date  (of last modification) of the  module. If the
   module is out of  date, it is discarded. Hence, only those  modules which have not been
   modified since  the previous compilation are  present in 'mod_dict',  and considered as
   'already read'.

   Next, 'type_dict'  and 'data_dict' are retrieved. Entries  for which the new  id of the
   module is not  present in 'mod_dict' is discarded. For other  entries, ids are updated,
   i.e.:

        - The previous id of the module is replaced by the new one.
        - The strong types,  meta-types and terms appearing within  the entry are updated,
          because they contain type ids and data ids.

   Finally, 'implem_dict' and 'code_dict' are retrieved.  If one of the old ids of modules
   in their dependancies list corresponds to a dicarded module, they are discarded. Strong
   data are updated (old ids replaced by new ids).

   After  this,  the  hashtables for  the  corespondance  betweeen  old  and new  ids  are
   discarded, and the new compilation may begin.

   Entries belonging to incorrect modules are not saved, but this does not change the ids.


   *** Symbolic code.

   The symbolic  code is  obtained from  the strong expressions  by ``compiling''  it. The
   symbolic code is made of instructions working on top of a stack.


   *** Memory management.

   The Anubis 2 language manages the memory automatically. This is required for preserving
   the safety  of the language.  This  is essentially a  increment/decrement system, which
   generates 'vcopy'  (virtual copy)  and 'vdelete' (virtual  delete) instructions  in the
   symbolic code. Many  of these instructions disappear during  the peep hole optimization
   of the symbolic code.