description.text
140 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
Description of the Anubis Compiler
This file contains a description of the Anubis compiler together with a discussion on
how the next version of the compiler should be designed.
-------------------------------- Table of Contents ------------------------------------
*** (1) Overall description.
*** (1.1) Modules of the compiler.
*** (1.2) How it works.
*** (1.2.1) Phase 1.
*** (1.2.2) Phase 2.
*** (1.3) Garbage collection.
*** (2) Tools.
*** (2.1) Lisp-like representation.
*** (2.2) Exceptions to the Lisp-like 'Expr' representation.
*** (2.3) Catalog of all internal data representations.
*** (2.3.1) Internal representation of types.
*** (2.3.2) Internal representation of terms.
*** (2.3.3) Internal representation of interpretations.
*** (2.3.4) Internal representation of type implementations.
*** (2.3.5) Internal representation of virtual machine instructions.
*** (2.3.5.1) Pseudo instructions.
*** (2.3.5.2) Computing instructions.
*** (2.3.5.3) Garbage collector instructions.
*** (2.3.5.4) Closure instructions.
*** (2.3.5.5) Equality instructions.
*** (2.3.5.6) Dynamic variables instructions.
*** (2.3.5.7) Serialising/unserialising.
*** (2.3.5.8) Starting a new virtual machine.
*** (2.3.5.8) Obsolete and soon obsolete instructions.
*** (2.4) Unification.
*** (2.5) Organization of the knowledge base.
*** (2.4) Indexation of symbols.
*** (3) The lexer.
*** (3.1) Generalities.
*** (3.2) Finding source files.
*** (3.3) The 'read' stack.
*** (3.4) Remembering already read files.
*** (3.5) Computing visibilities.
*** (4) The parser.
*** (5) Computing interpretations of terms.
*** (5.1) Overview.
*** (5.2) Interpreting applicative terms.
*** (5.3) Interpreting conditionals.
*** (5.4) Interpreting selective conditionals.
*** (6) Implementation of types.
*** (6.1) Implementation of primitive types.
*** (6.1.1) 'String'.
*** (6.1.2) 'ByteArray'.
*** (6.1.3) 'Int32'.
*** (6.1.4) 'Float'.
*** (6.1.5) 'Listener'.
*** (6.2) Implementation of defined types.
*** (6.3) Implementation of functional types.
*** (6.4) Implementation of address types.
*** (6.4.1) File and network address types.
*** (6.4.2) Dynamic variable address types.
*** (6.5) implementation of C structure pointer types.
*** (7) Code generation.
*** (7.1) General conventions.
*** (7.2) Elimination of terminal recursion.
*** (7.3) Generating the code.
*** (7.3.1) Protect.
*** (7.3.2) Lock.
*** (7.3.3) Global symbols.
*** (7.3.4) Strings, Int32, small data and Floats.
*** (7.3.5) Local symbols.
*** (7.3.6) Microlocal symbols.
*** (7.3.7) Closures.
*** (7.3.8) Applicative terms.
*** (7.3.9) Conditionals.
*** (7.3.10) Selective conditionals.
*** (7.3.11) Local definitions ('with ...').
*** (7.3.12) Delegate.
*** (7.3.13) Serialize/Unserialize.
*** (7.4) Peephole optimisation.
*** (8) Garbage collector code.
*** (9) Equality code.
*** (10) Serialisation/unserialisation.
*** (10.1) How it works.
*** (10.2) Descriptions of small types.
*** (10.3) Pseudo-instructions set.
*** (10.4) Serialising.
*** (10.5) Unserialising.
*** (11) Outputing the code (making an *.adm file).
*** (12) How to make a system of macros ?
---------------------------------------------------------------------------------------
*** (1) Overall description.
*** (1.1) Modules of the compiler.
The Anubis compiler is made of the following main parts:
- the lexer (lexer.y)
- the parser (grammar.y)
- a module for analysing type definitions (typedef.c)
- a module for analysing data definitions (opdef.c)
- the interpretations module (this not an interpreter) (interp.c)
- the compiler proper (generation/optimization of symbolic code) (compile.c)
- the module for computing the implementations of data types (implem.c)
- the output module (making the '*.adm' files) (output.cpp)
It also contains secondary modules. Some of them are always used:
- a module for deciding if a type is recursive (rectype.c)
- tools for working on data types (unify.c, typetools.c, typewidth.c, typecmp.c, unknowns.c)
- computing the implicit destructors (destruct.c)
- computing the deletion code (delcode.c)
- computing the equality code (eqcode.c)
- a module containing various tools for manipulating expressions (expr.cpp)
- a system for managing the memory of the compiler itself (mallocz.c)
- a set of messages and functions for printing messages (msgtexts.c, show.c)
- a set of predefined paragraphs (predef.c, predef.anubis, predefined.anubis)
- tools for computing the size and binary format of instructions (vminstr.c)
Some others are used only on demand (options of the compiler):
- making an HTML index of all public symbols (index.c)
- outputing a C version of some Anubis types together with the constructors (dumpct.c)
- making a 'polish' translation (polish.c)
- producing a symbolic code dump (*.sc) (symcode.c)
*** (1.2) How it works.
The compiler works in two phases:
- phase 1: reading source files, checking paragraphs, computing the interpretations.
- phase 2: generating a module for each 'global' paragraph.
*** (1.2.1) Phase 1.
During phase 1, the compiler reads source files and does the following for each
paragraph:
1. reading the paragraph (using the parser and the lexer). This may eventually
produce a syntax error.
2. depending on the kind of paragraph:
- 'read' paragraph: checking if the file refered to has already been read. The
file is read only if it has not already been read. However, visibility is setup
correctly. See the section on the lexer below.
- 'replaced by' paragraph: this is the same as 'read' paragraphs, except that
visibility is computed differently. See the section on the lexer below.
- 'type' paragraph: the type definition is checked. See the section 'Checking a
type definition' below.
- 'define' paragraph: the definition is checked. See the section 'Checking a datum
definition' below.
- 'C constructor' paragraph. An Anubis type is translated into some program usable
from within a C program. (the output, for all those paragraphs, is written in the
files 'construc.h.out' and 'construc.c.out').
Only one error is considered per paragraph. If an error is found, the analysis restarts
at the next paragraph.
*** (1.2.2) Phase 2.
Phase 2 is performed only if there is no error after phase 1. Phase 2 performs the
following tasks:
- check if all type definitions are complete. Since a type may be defined using
several paragraphs, it is possible to define incomplete types (all alternatives have
not been defined).
- determine which types are infinite (recursive).
- Dump C versions of types (if required).
- make the modules (*.adm).
- dump implementations of types (if required).
- print statistics (if required).
- print possible errors after phase 2.
*** (1.3) Garbage collection.
Data are implemented in two different ways:
- directly
- indirectly
'Directly' means that what we have in hand is the datum itself, and 'indirectly' means
that what we have in hand is a pointer to a segment of memory containing the datum.
Data may be 'created', 'duplicated' and 'destroyed'. When we duplicate (or 'copy') a
direct datum, we have no choice, we just copy all the bits of the datum. When we
duplicate an indirect datum, we have the choice between:
- allocating a new segment of memory, and copying the content of the original segment
into the new segment (this is called 'real' copy),
- just duplicating the pointer (this is called 'virtual' copy).
Notice that these methods are semantically equivalent for abstract data. For concrete
data (for example: dynamic variables, connections, etc...), indirect implementation and
virtual copy are required.
Of course, virtual copy is faster, but it creates 'shared data'. When a (necessarily
indirect) datum is shared, we have several pointers pointing to the same segment of
memory. Then, the problem is just to know when the segment is no more used, so that we
can return it to the pool of memory. This problem is easily solved by 'reference
counting', which works as follows.
The first 32 bits word of any allocated segment is reserved for reference counting. It
countains an unsigned integer, called the 'counter' (for this segment). This counter is
either:
- zero, in which case, the segment is 'permanent' and will never be freed,
- non zero, in which case its value is the number of pointers pointing to this
segment.
Permanent segments are created in general (always ?) at compile time. For example, if
you write a character string into your source text, the compiler creates a permanent
segment for holding this string (this segment is within the program code itself). Of
course the counter of a permanent datum is never changed and remains always 0.
Non permanent data are created dynamically at run time. The function
'AllocateDataSegment' (see the virtual machine source code) receives the number of 32
bits words to allocate, and returns the pointer to the segment (actually casted to
'U32'). The first 32 bits words of the segment (the counter) is already initialized to
1.
When we need to make a virtual copy of a (non permanent) indirect datum, we just have
to increment the counter of this segment (so that it always represents the number of
(significant) pointers to the segment). Notice that this will never overflow, because
an overflow would mean a number of pointers bigger than the total memory can
hold.
When we want to delete a (non permanent) indirect datum, we perform a 'virtual
deletion'. This means that we decrement the counter. If the counter becomes 0, the
segment may be freed.
But freeing the segment is not as simple as this, because the segment itself may
contain pointers which are counted within other segments. Freeing the segment means
destroying these pointers, so that we must also decrement the corresponding
counters. We see that this process is recursive, because destroying a pointer may lead
to decrement a counter to zero, and to free another segment, and so on ...
Hence, the process of freeing the segments must be coded as a (recursive) program. This
is what we call 'garbage collector code'. This code is generated by the compiler and
depends only on the definition of types. Actually, there is one garbage collector
function for each defined type which may contain indirect data (i.e. 'mixed' and
'large' types). Some other types (like functional types) require special garbage
collector functions.
*** (2) Tools.
*** (2.1) Lisp-like representation.
The Anubis compiler uses a particular style of programming inspired by Lisp. The
advantage is that this system is very flexible. The disadvantage is that, being not
typed, it is not very safe. I think I will never do that again. It works as follows.
The type 'Expr' represents all symbolic 'expressions' to be manipulated by the
compiler. This includes tokens generated by the lexer, syntactical trees generated by
the parser, interpretations generated by the interpretations module, contexts,
environments, symbolic code, ...
Actually 'Expr' is just a clone of U32, the type of unsigned 32 bits integers. These
integers are 'dynamically' typed (using the low bits) as follows:
0 mod 2 ---------------------------------> (signed) integers
1 mod 2 ---> 1 mod 32 ------------------> tags
3 mod 32 ------------------> temporary pairs
5 mod 32 ------------------> strings
7 mod 32 ---> 7 mod 64 ---> user defined type variables
39 mod 64 ---> internal type variables (unknowns)
9 mod 32 ------------------> permanent pairs
In other words:
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxx0 integer
xxxxxxxx xxxxxxxx xxxxxxxx xxx00001 tags
xxxxxxxx xxxxxxxx xxxxxxxx xxx00011 temporary pairs
xxxxxxxx xxxxxxxx xxxxxxxx xxx00101 strings
xxxxxxxx xxxxxxxx xxxxxxxx xx000111 user defined type variables (like $T)
xxxxxxxx xxxxxxxx xxxxxxxx xx010111 unknowns
xxxxxxxx xxxxxxxx xxxxxxxx xxx01001 permanent pairs
Notice for example, that pairs (both temporary and permanent) have the 5 lowest bits
inhibited. Hence, the maximal number of temporary pairs is 2^27. The same for permanent
pairs. This makes at most 134217728 pairs of each sort. For temporary pairs, this is
not a problem, but for permanent pairs, this may be a problem. Nevertheless, since a
pair occupies 8 bytes, an array of permanent pairs containing 134217728 pairs would
contain 2^30 bytes, i.e. 1 gigabyte. So, with a machine of less than 1 gigabyte of
memory, no problem will happen (except that very big programs may not be compiled).
A good program should be protected against such problems independantly of the size of
available memory. Examining the code I see that it is not the case for Anubis
1. Indeed, the function 'pcons' (defined in 'expr.cpp'), which is used for making a new
permanent pair, has no protection against an enlargement of the array of permanent
pairs above the limit. This should be fixed as soon as possible. (Similarly, 'cons' for
temporary pairs is not well protected).
All 'Expr's except temporary and permanent pairs are called 'atoms'. Pairs, either
temporary or permanent, are just made of two 'Expr's, and represented (in the comments
of the source of the compiler), as follows:
(a . b)
where 'a' and 'b' are two 'Expr's respectively called the 'head' and 'tail' of the
pair. The macro 'cons' enables the construction of such pairs.
As in Lisp, lists are represented using pairs and the tag 'nil' (the empty list) as
follows (for a list made of 4 'Expr's):
(a b c d)
which is actually a shortcut for:
(a . (b . (c . (d . nil))))
Such lists may be easily constructed using the macros 'list1', 'list2', ... defined in
'compil.h'.
Now, an 'Expr' like (a . (b . (c . d))) is written '(a b c . d)', and called a
'multiple cons'. A set of macros, called 'mcons3', 'mcons4', ... enable a direct
construction of such data.
Examples of use and meaning of these macros (and some others):
The C term represents
---------------------------------------------------------------
nil ( )
cons(a,b) (a . b)
mcons3(a,b,c) (a b . c)
mcons4(a,b,c,d) (a b c . d)
etc...
list1(a) (a) same as (a . nil)
list2(a,b) (a b) same as (a b . nil)
etc...
If x = (a b c d e):
car(x) a
second(x) b
third(x) c
etc...
cdr(x) (b c d e)
cdr2(x) (c d e)
cdr3(x) (d e)
etc...
Notice that 'mcons2' does not exist because it is the same a 'cons'.
'consp(x)' is 1 if and only x is a pair (i.e. not an atom). Otherwise, it is 0. This
predicate ('consp' is a traditional Lisp abbreviation for 'cons predicate') is very
often used as follows:
while(consp(x))
{
Expr head = car(x);
... do something with the head of x ...
x = cdr(x); // continue with the tail of x
}
// in most cases, when we arrive here, we have x == nil (the empty list).
Normally, pairs are created temporary, and transformed into permanent only if
needed. The question is that, while working on a paragraph, the compiler constructs
data which will not be part of the final result. When the analysis of a paragraph is
finished, only the data to be kept are made permanent (using the function 'save'), and
the array of temporary pairs is reset to zero for the next paragraph. This is how
memory is managed in the compiler (this has nothing to do with garbage collection in a
virtual machine).
Integers are coded on 31 bits (bits 1 to 31), and the bit number 0, whose value is 0,
means that this 'Expr' is an integer. The file 'compil.h' contains all the macros
needed to test the type of an 'Expr', or to construct 'Expr's.
Tags are used for representing simple atomic concepts. For example, the tag 'nil'
represents the empty list. They are also used for 'tagging' non atomic data. For
example, when the function
(Int32 i, String s) |-> a
is read from a source file, the parser constructs the following 'Expr':
(lambda <lc> (([Int32] . i) ([String] . s)) . [a])
where '[Int32]' is a representation of type 'Int32' (similarly for 'String') , and
'[a]' a representation of the term 'a'.
Here, the tag 'lambda' represents the arrow '|->'.
*** (2.2) Exceptions to the Lisp-like 'Expr' representation.
Actually, this system has exceptions. Normally, all tokens returned by the lexer have a
value of type 'Expr', i.e. a datum containing its own (dynamic) type. Examining such a
value it is possible to know if we have an integer, a string, a pair, etc...
However, the two tokens 'yy__integer' and 'yy__char' have a value of type 'integer'.
This means that when the lexer reads a number (regular expression: [0-9]+), it returns
the token 'yy__integer', with a value to be interpretated as a C integer, not as a
Lisp-like integer. We must be careful because for example the integer 6 is represented
like this in the two systems:
...0110 (C integer: this is not an 'Expr')
...01100 (Lisp-like integer: this is an 'Expr')
^
|
+--------- this bit (0) means that this 'Expr' is an integer
The reason for these exceptions is that we want the compiler to be able to read 32 bits
integers, not only 31 bits integers from source files.
The parser contains the following grammar rule:
Term: yy__integer %prec prec_symbol { $$ = mcons3(integer,linecol(),$1); }
This means that when an integer has been read by the lexer, the parser constructs the
following multiple cons as a representation of this integer:
(integer <lc> . <n>)
where <n> is the value of the integer as a C integer, not as a Lisp-like integer.
As a consequence, we must always remember that a list whose head is the tag 'integer',
has always the form: (integer <lc> . <n>), where <n> is not an 'Expr'. It would be
meaningless to apply a predicate like 'is_integer' to <n>, and the result would be
unpredictable.
The exceptions are the following (where only <Cint> is not an 'Expr'):
(integer <lc> . <Cint>)
(small_datum <type> . <Cint>)
(anb_int32 <lc> . <Cint>)
(load_int32 . <Cint>)
(I hope there is no other exception, but I'm not absolutely sure).
Of course, the existence of these exceptions is very bad. A better solution should be
to represent (say) the integer 746542 as the list (7 4 6 5 4 2), where the elements of
the list would have been Lisp-like integers. In the compilation step, the compiler may
transform this representation into a packed integer of 32, 64 bits,... maybe with a
warning in the case of an overflow. Actually, this is what happens already at a smaller
scale. Indeed, the current version of the compiler finds 3 interpretations for the
integers 180 (of types 'Int8', Int32' and 'Float'), but only 2 for 4567, because this
number is greater that 255.
*** (2.3) Catalog of all internal data representations.
The compiler represents the following data (internally):
- 'types' (as produced by the parser)
- 'terms' (as produced by the parser)
- 'interpretations' (as produced by 'term_interpretations' and related functions)
- 'implementations' (as produced by 'type_implementation' and related functions)
- 'instructions' (as produced by 'compile_term' and related functions)
Note: All data are represented either by Lisp-like atoms (string, tags, ...), or by
lists. In the case of a list, the head of list is always a tag.
*** (2.3.1) Internal representation of types.
<type> :=
yy__Symbol just a type name in the form of a Lisp-like string
For example, the types 'One', 'Bool' etc ... are in this category. The file
'compil.h' contains the definitions of a lot of 'predefined strings'. Within the C
source, such symbols may be represented by the C symbols 'pdstr_One', 'pdstr_Bool'
etc... ('pdstr_' for 'predefined strings').
Equivalently, 'pdstr_Bool' maybe replaced by 'new_string("Bool")'. This has exactly
the same effect, except that 'pdstr_Bool' is faster at execution (of the compiler).
Remark: 'new_string' (defined in 'expr.cpp') does not allocate a new segment for the
string if the string already exists. In this case, it returns the already existing
Lisp-like string.
All types (without parameters) whose names are created by the user are also
represented like this.
yy__utvar user type variable (type parameter)
Type parameters have a special Lisp-like representation as atoms (see above).
type_String a Lisp-like tag representing the type 'String'
type_ByteArray idem
type_Int32 idem
type_Float idem
type_Listener idem
The above being more 'primitive', they are represented by Lisp-like tags, instead of
Lisp-like strings.
(type_RAddr . <type>) represents 'RAddr(T)'; type_RAddr is a Lisp-like tag
(type_WAddr . <type>) idem
(type_RWAddr . <type>) idem
(type_GAddr . <type>) idem (obsolete: global variable)
(type_Var . <type>) idem (dynamic variable)
(type_MVar . <type>) idem (multiple dynamic variable)
The above are called 'address types'.
(app_ts <name> . <types>) type scheme applied to type operands
'app_ts' is for 'apply type scheme'. For example, 'List(String)' is represented as:
(app_ts new_string("List") type_String)
and 'Result(One,String)' as:
(app_ts new_string("Result") pdstr_One type_String)
Notice that there is no dot in the above Exprs. '(app_ts new_string("List")
. type_String)' is erroneous, because the second tail ('cdr2') of the list must be a
list of types, not a single type.
(type_struct_ptr . <id>) encapsulated C structure type
<id> is a Lisp-like integer identifying the structure.
(functype <types> . <type>) functional type
<types> is the list of the representations of the arguments types, and <type> is the
representation of the target type.
There is still another sort of 'types' represented internally: the 'type
unknowns'. They are also a special sort of Lisp-like Exprs. They are not generated by
the parser, but in general by the interpretation functions. When a symbol is found,
representing a scheme, all the type variables in this scheme are replaced by fresh
unknowns. Of course, there is a generator of fresh unknowns: 'fresh_unknown'.
From this we may conclude that defined types are those which are represented either by
a string or by (app_ts ...).
*** (2.3.2) Internal representation of terms.
Note: <lc> (generated by the lexer) is a packed version of the triplet
(filename,line,column). There are macros for extracting the components:
file_in(<lc>) a C char *
line_in(<lc>) a C int
col_in(<lc>) a C int
As usual, <Cint> represents a C integer (i.e. not a Lisp-like integer).
<term> :=
(alert <lc> . <filename>) <filename> is a Lisp-like string
(protect <lc> . <term>)
(lock <lc> <term> . <term>)
(debug_avm <lc> . <term>)
(avm <lc> . AVM) (AVM is explained below)
(terminal <lc> . <term>)
(symbol <lc> . <Lisp-like string>)
(integer <lc> . <Cint>)
(fpnum <lc> <mantissa> . <exponent>) (<mantissa> and <exponent> are Lisp-like integers)
(of_type <lc> <type> . <term>) (explicit typing)
(string <lc> . <string>) (<string> is a Lisp-like string)
(with <lc> <symbol> <value> . <term>)
(delegate <lc> <term> . <term>)
(lambda <lc> <FArgs> . <term>) (|->)
(rec_lambda <lc> <name> <FArgs> . <term>) (|-f->)
(app <lc> <term> . <terms>) (applicative term)
(cond <lc> <term> . <cases>) (conditional)
(select_cond <lc> <term> <case> . <term>) (selective conditional)
(serialize <lc> . <term>)
(unserialize <lc> . <term>)
The following are still used, but should become obsolete (and replaced by system calls).
(anb_read ...)
(anb_write ...)
(anb_exchange ...)
(connect_to_file ...)
(connect_to_IP ...)
(wait_for ...)
*** (2.3.3) Internal representation of interpretations.
The function 'term_interpretations' (in 'interp.c') computes the list of all
interpretations of a term (in a given appropriate context). An interpretation is a
pair:
(head . env)
where 'head' is called an 'interpretation head', and where env is an 'environment'.
Here, we describe interpretation heads.
Unfortunately, I've used the same tags as for terms. It would be safer to use different
tags. Furthermore, tags for some sort of expressions should have a uniform prefix, for
example:
typ_ for types
ter_ for terms
int_ for interpretations
imp_ for implementations
ins_ for instructions
The tags should also be numbered. This will limit the risk of forgetting a tag in a
switch. Hence for example, we should have the following tags for terms:
ter_00_alert
ter_01_protect
ter_02_lock
etc...
and a dummy one for marking the end of the list:
ter_20_dummy
Interpretations heads are the following:
(protect <lc> . <head>)
(lock <lc> <term> . <term>)
(avm <lc> . <instructions>)
(debug_avm <lc> . <head>)
(terminal <lc> . <head>)
(operation <lc> <opid> <name> <parms> <type> . <types>)
(string <lc> . <string>)
(anb_int32 <lc> . <Cint>)
(small_datum <type> . <Cint>)
(fpnum <lc> <mantissa> . <exponent>)
(local <name> <i> . <type>)
(micro_local <name> <d> <i> . <type>)
(closure <lc> (f_micro_ctxt <fname> <ftype> (<sym> . <type>) ...) <args> . <body>)
(app <lc> <head> . <heads>)
(cond <lc> <head> ((<name> (<sym> . <type>) ...) <lc> . <head>) ...)
(select_cond_interp <lc> <test head> <index> <clause head> <head then> . <head else>)
(with <lc> <symbol> <head> . <head>)
(delegate <lc> <head> . <head>)
(serialize <lc> . <head>)
(unserialize <lc> <type> . <head>)
The following are still used but should become obsolete:
(anb_read ...)
(anb_write ...)
(anb_exchange ...)
(wait_for ...)
*** (2.3.4) Internal representation of type implementations.
Type implementations are computed in 'implem.c'. We never compute an interpretation for
a scheme, but only for an actual type (i.e. with no type parameter). For example,
'Maybe(Int8)' has an implementation different from 'Maybe(String)', but 'Maybe($T)' has
no implementation at all.
In the future, it would be nice to have quantification 'à la Girard' on types:
forall $T ...
exists $T ...
In this case, we would have to implement for example the type:
forall $T Maybe($T)
which has no parameter, since $T is bound in this expression.
Internal representations of implementations are the following. Again the same tags are
used, which is quite unfortunate.
<implem> :=
type_String
type_ByteArray
type_Int32
type_Float
(functype <implem> . <implems>)
(type_RAddr . <implem>)
(type_WAddr . <implem>)
(type_RWAddr . <implem>)
(type_GAddr . <implem>) (obsolete)
(type_MVar . <implem>)
It is very unfortunate that dynamic variables are not implemented as '(type_Var
. <implem>). Instead, this type is implemented as a large type. I don't understand
how this can work properly (but it seems to work miraculously). This should be
changed of course.
(type_struct_ptr . <id>)
(small_type <nalt> <iw> <alt geom> ... <alt geom>)
(mixed_type <nalt> <iw> <alt geom> ... <alt geom>)
(large_type <nalt> <iw> <alt geom> ... <alt geom>)
where:
<alt geom> :=
(small_alt (<imp> <offset> . <width>) ... (<imp> <offset> . <width>))
(mixed_alt (<imp> <offset> . <width>) ... (<imp> <offset> . <width>))
(large_alt (<imp> <offset> . <width>) ... (<imp> <offset> . <width>))
*** (2.3.5) Internal representation of virtual machine instructions.
*** (2.3.5.1) Pseudo instructions.
These instructions are of size 0.
(label . <addr>) position within the code
(ret_point . <addr>) same as label
(header ...) used for decorating the .sc files
(comment ...) idem
(context ...) idem
*** (2.3.5.2) Computing instructions.
(glue_index . <index>) used for small types
(glue_mixed_index . <index>) used for mixed types
(store_index . <index>) used for large types
These instructions are used for constructing a datum of a defined type. They put the
right index (i.e. alternative number of the datum) into the datum.
For a small datum, 'glue_index' puts the index within the least significant bits (of
R). This is the first operation when constructing the datum, so that the content of R
is actually overwritten.
For a mixed datum, the index is put into the 2 least significant bits by
'glue_mixed_index'.
For a large datum, the index (a byte) is put at offset 4 (just after the counter) into
the segment.
(glue . <bit width>) used for small and mixed types
(store_0 . ?) used for mixed and large types
(store_1 . ?)
(store_2 . ?)
(store_4 . ?)
These instructions are used for putting the components into the datum under
construction (of some defined type).
The instruction '(glue . <bit width>) ORes the register R with a left shift of the top
of stack. The number of bits of the shift is given by the operand.
'(store_? . <offset>)' stores a word of 0, 1, 2 or 4 bytes at the offset given by the
operand within the segment pointed to by R.
(index_direct . <bit width>)
index_indirect
These instructions are used by conditionals. A datum of a defined type is in the
register 'R', and they put its index (alternative number) into the index register
'I'. The first one is used for small and mixed data, the second one for large data.
(switch <addr> ... <addr>)
This instruction check the content of the index register 'I', and jumps to the
corresponding address. It is used by conditionals for jumping at the right case.
(unglue <bit width> . <right shift>)
(unstore <offset> . <n>) where <n> is a Lisp-like integer equal to 0, 1, 2 or 4
(unstore_copy . <offset>)
(unstore_copy_mixed <offset> . <mask>)
(unstore_copy_ptr . <offset>)
(unstore_copy_function . <offset>)
These instructions are used by conditionals for pushing the values of resurgent symbols
onto the stack.
'(unglue <bit width> . <right shift>)' is used for extracting a component from a small
datum. The datum is pushed into the stack.
'unstore' instructions are similar but for indirect data. Some of them need to make a
virtual copy of the datum to be extracted.
(jmp . <addr>)
This simple 'jump' instructionn is used at the end of cases in conditionals for jumping
to the next code (which is common to all cases).
(apply . k)
(ret . k)
When '(apply . k)' is on the point to be executed, the stack contains (starting at top
of stack) the arguments of a function and a return address. '(apply . k)' The function
to be 'applied' to these arguments is in 'R'.
Thisn instruction acts in two different ways depending on the nature of the function,
which may be either 'top level' or a 'closure'. See the section on closures and
functions for more explanations.
'(ret . k)' is just an ordinary return instruction (except that is stops the virtual
machine if the return address is zero).
'k' is the number of arguments of the function.
(load_Int32 . <Cint>)
(load_float <mantissa> . <exponent>)
(string . <string index>)
These instructions are used for loading constants into 'R'.
(check_stack . <n>)
This instruction checks if the stack is big enough for pushing '<n>' 32 bits words. If
it is not, it puts the virtual machine in the state 'needs_bigger_stack' and gives up,
so that the scheduler may enlarge the stack of this machine.
(push_addr . <addr>)
This instruction pushes a return address on top of stack.
(address . <addr>)
The same but the address is put into 'R'.
push
This instruction pushes the content of 'R' onto the stack.
(select_index_direct <bit width> <index> . <addr>)
(select_index_indirect <index> . <addr>)
These instruction are generated for selective conditionals.
(peek <name> . <depth>)
(micro_peek <name> <depth> . <micro depth>)
The instruction 'peek' takes a 32 bits word at depth '<depth>' into the stack and put
it into 'R'. The datum is not virtually copied. The compiler generates a 'copy'
instruction to follow this one, if needed, depending on the type of the datum.
'micro_peek' is similar except that the datum to be put into 'R' is located into a
micro context. The micro context is at depth <depth> into the stack, and the datum is
at position <micro depth> into the micro context.
protect
(unprotect . <addr>)
unlock
lock
odd_align
This instruction is put in front of the code of a top level function if this code would
be at an even address.
pop1
pop2
pop3
These instructions are just removing 1, 2 or 3 32 bits words from the top of stack.
invalid
This instruction should never be executed. It is generated in the garbage collector
code for small alternatives in mixed types. This means that the garbage collector code
must never be called for a datum belonging to small alternative of a mixed type. Other
instructions with a <mask> operand must use this mask to avoid such a call.
give_up
This instructions make the virtual machine give up as if its slice of time was
finished.
start_debug_avm
stop_debug_avm
These instructions set and unset a flag. When this flag is set the virtual machine
prints all instructions executed with other informations.
finish
(program . <instructions>)
do_alert
*** (2.3.5.3) Garbage collector instructions.
The null 32 bits word '0' is always a valid datum (of any type) for the garbage
collector. Of course, this datum doesn't need to be collected, but the garbage
collector may be passed such a datum.
Allocation of memory:
(alloc . <number of bytes>)
The pointer to resulting segment is put in R. The counter (first 32 bits word of the
segment) contains always 1 after the allocation.
Making virtual copies. The following instructions are used for making a virtual copy of
a maybe indirect datum. This amounts to increment the counter which is at the
beginning of the segment. However, these instructions are generated on behalf of the
type of the datum only, and this is not enough to know if the counter exists (some data
are direct). Hence, the existence of several variants for these instructions. Also,
these instructions always check if the counter is zero. If it is the case, the datum
is permanent, and the counter is not incremented.
Copying virtually within the stack. The datum is at position <depth> in the stack (0 is
the top of stack).
(copy_stack_ptr . <depth>)
The counter always exists (except if the datum is the null word). The counter is
incremented if it is non zero.
(copy_stack_function . <depth>)
The datum reprensents a function, i.e. the address of some piece of code if it is odd,
and a pointer to a closure if it is event. The counter exists only in the second case.
(copy_stack_mixed <depth> . <mask>)
The <mask> has at most 4 bits. Each bit indicates if the corresponding alternative is
large or small. The counter exists only if the alternative is large.
Virtually copying in R. This is the same except that the datum to be copied is in R.
copy_ptr
copy_function
(copy_mixed . <mask>)
The next one should become obsolete:
copy
It's almost the same as 'copy_ptr' except that it does not check for permanent
data. Should disappear and be replaced by 'copy_ptr'.
Virtual deletion within the stack.
(collapse . <depth>)
This instruction deletes the datum at position <depth> into the stack. The datum is
direct, so that the instruction has just to move the <depth> words at top of stack one
slot downwards, and adjust the stack pointer. For example, (collapse . 3) works as
follows:
the stack before the stack after
a b c d e f g .... a b c e f g ...
('d' has disappeared)
The next instructions delete virtually a datum which is in R. These instructions are
generated only for thoses types who have indirect data.
del_ptr
'del_ptr' expects a pointer in R to a segment not containing any indirect data. In this
case, the deletion amounts to decrement the counter pointed to by R. However, R may
contain 0 (in this case the instruction does nothing), or the counter may be zero (in
this case, the datum is permanent and the instruction does nothing). Otherwise, if the
counter becomes zero, the segment is freed.
del_conn
This is similar to 'del_ptr' except that the segment pointed to represents a
'connection' (file or network). The instruction works as 'del_ptr', except that if the
counter becomes zero, it also closes the connection.
del_function
This is similar to 'del_ptr', except that functions are either addresses of code (if
odd) or pointers to closures (if even). In the first case nothing is done. In the
second case, 'del_function' works like 'del_ptr', except that the address of the
deletion code is found within the closure itself instead of as an operand of the
instruction. This is because, the types of the data in the micro context cannot be
known from the type of the function.
(del . <addr>)
'(del . <addr>)' expects a datum belonging to a large type in R. '<addr>' is the
address of the deletion code for this type. The instruction checks if R is zero. In
this case it does nothing. Otherwise, R contain a pointer to a valid segment. If the
counter is zero (permanent datum) the instruction does nothing. Otherwise, the counter
is decremented. If the counter becomes zero, a return address and the datum (in R) are
pushed onto the stack, and the deletion code at '<addr>' is jumped to.
(del_mixed <mask> . <addr>)
This is the same as '(del . <addr>)' except that the datum in R belongs to a mixed
type. The mask is used to check if the datum is direct or indirect. If it is direct,
nothing is done. If it is indirect, the same as for '(del . <addr>)' is performed.
(del_struct_ptr . <id>)
This instruction is the same one but for C structure encapsulation pointers. Somewhat
curiously, it is coded in the compiler ('compile.c'), but has never been generated (the
proof: it is not coded in 'vminstr.c'). This is probably due to the fact that
encapsulated C structures are manipulated only from within 'predefined.anubis'. Hence,
this is not a big problem, because only a modification of 'predefined.anubis' can lead
to generate this instruction.
(del_mvar . <addr>)
This instruction is even more phantomatic than the previous one. This is more
problematic, because multiple variable may be manipulated by anyone. Hence, this
instruction should be coded as soon as possible.
Note: We should also have a '(del_var . <addr>)'. The type 'Var' is very badly
implemented as a large type (one of my worst ideas). Everything concerning this type
should be examined and corrected. I suspect that this creates at least memory leaks,
but maybe also more serious problems.
(del_stack_ptr . <depth>)
(del_stack_conn . <depth>)
(del_stack_function . <depth>)
(del_stack <depth> . <addr>)
(del_stack_mixed <depth> <mask> . <addr>)
(del_stack_struct_ptr <id> . <depth>)
(del_stack_var <depth> . <addr>) (phantom)
(del_stack_mvar <depth> . <addr>)
The above instructions are the same as the previous ones, except that the datum to be
virtually deleted is in the stack at depth <depth> instead of being in R.
mvar_slots_del_ptr
mvar_slots_del_conn
mvar_slots_del_function
(mvar_slots_del . <addr>)
(mvar_slots_del_mixed <mask> . <addr>)
(mvar_slots_del_struct_ptr . <id>)
(mvar_slots_del_var . <addr>)
(mvar_slots_del_mvar . <addr>)
Again, these instructions are the same as above except that they have to delete
virtually the data in all the slots of the multiple variable.
indirect_del_ptr
indirect_del_conn
indirect_del_function
(indirect_del . <addr>)
(indirect_del_var . <addr>) (phantom)
(indirect_del_mvar . <addr>)
(indirect_del_mixed <mask> . <addr>)
(indirect_del_struct_ptr . <id>)
Again, these instructions are working like the above except that the datum to be
virtually deleted is pointed to by the top of stack.
(increment_del . <byte width>)
This instruction is used when we are really deleting an indirect datum. A pointer to
some position within the segment to be freed is on top of stack. This instruction just
increments this pointer of the number of bytes indicated by its operand.
del_index_direct
This instruction expects a datum of a mixed type on top of stack, belonging to a large
alternative. This datum is a 32 bits word which may be decomposed as follows:
pppppppp pppppppp pppppppp ppppppii
i.e into a field of 30 bits and a field of 2 bits. The two bits 'ii' give the number of
the alternative the datum belongs to, and
pppppppp pppppppp pppppppp pppppp00
is the pointer to the segment. The role of the instruction is:
- put 'ii' into the index register I
- replace 'ii' by '00' on top of stack
- duplicate the top of stack.
The reason why we duplicate the top of stack is that one pointer will be incremented
for walking into the segment, and the other one will be used for freeing the segment.
del_index_indirect
This instruction is similar to 'del_index_direct', except that it concerns a large
datum. The index is got from the data segment, not from the manipulation word. The
pointer is also duplicated on top of stack.
free_seg_0
free_seg_1
These two instructions are just calling the 'FreeDataSegment' function of the memory
allocator. The first one transmits the word at *(SP-1) (top of stack: depth 0 in the
stack), whilst the second one transmits the word at *(SP-2) (depth 1 in the stack).
*** (2.3.5.4) Closure instructions.
These instructions are used for constructing new closures.
(put_copy_direct <depth> . <position>)
(put_copy_indirect <depth> . <position>)
(put_copy_function <depth> . <position>)
(put_copy_mixed <mask> <depth> . <position>)
A memory segment (the closure under construction) is pointed to by 'R'. The <depth>
operand is the stack depth of the datum to be put. The <position> operand is the
position where the datum must be put in the micro context of the closure. The offset
(in bytes) within the closure where the datum must be put is 4*(3+<position>), i.e.
<position> is 0 for the first datum of the micro context, 1 for the next one, etc...
For indirect data, the counter is incremented, because this is a duplication.
(put_micro_copy_direct <depth> <micro depth> . <position>)
(put_micro_copy_indirect <depth> <micro depth> . <position>)
(put_micro_copy_function <depth> <micro depth> . <positions>)
(put_micro_copy_mixed <mask> <depth> <micro depth> . <position>)
Same as above, except that the datum is located into another micro context which is at
depth <depth> in the stack. The datum itself is at position <micro depth> in this other
micro context. Again <micro depth> is 0 for the first datum in the micro context, 1 for
the next one, etc...
(put_closure_labels <code> . <del code>)
This instruction puts its two operands at offsets 4 and 8 into the closure (pointed to
by 'R').
Below is an example of how a closure is constructed. First of all, here is the source
text (copy-pasted from the sources of our web site):
(DBRow(MTxt) dbrt) |-> if dbrt is dbrow(_,_,t) then
if t is mtxt(_,v,time) then
success(mtxt(new_name,v,time))
and below is the code generated for the construction of the closure (copy-pasted from
the .sc file):
| ; |-> in '/home/alp/www_dev/anubis-language/_1_6/translate_item.anubis' at line 451
| ; function code at label 31092
146195 | alloc 16
| ;--- stack ------------------------
| ; 0 (return address)
| ; 1 i LangInfo
| ; 2.0 tdb DB(TDB)
| ; 2.1 old_name String
| ; 2.2 new_name String
| ; 3 (return address)
| ;----------------------------------
146197 | put_micro_copy_indirect 2 2 0
146201 | put_closure_labels 31092 31093
'alloc 16' allocates 4 32 bits words for the closure: counter, code, del and just one
datum in the micro context (the value of 'new_name').
Remark that 'new_name' is the sole local symbol occuring free in the above term, and
that its value is within an older micro context located at depth 2 in the stack. The
value of 'new_name' is at position 2 in this older micro context, so that the right
instruction for constructing the new micro context is:
146197 | put_micro_copy_indirect 2 2 0
which means: 'take the datum at position 2 in the micro context at depth 2, and put it
at position 0 in the new micro context' (of course, count a virtual copy).
('indirect' because the type is 'String'). The last instruction:
146201 | put_closure_labels 31092 31093
puts the address (label 31092) of the code for executing the function and the address
(label 31093) of the garbage collector code for deleting the closure itself.
*** (2.3.5.5) Equality instructions.
In the next version of Anubis, equality code should be generated as generic code (like
for implicit destructors), so that all the instructions below will become obsolete.
Furthemore, equality (with Boolean value) should be generated only for 'separable'
types, i.e. type for which this boolean equality, in the mathematical sens, is
computable. From a Topos theoretic view point, this means that it should be generated
only for types 'T' such that the internal equality '(T,T) -> Omega' factors through the
canonical arrow 'Bool -> Omega' (sending 'true' onto 'true' and 'false' onto 'false').
However, I don't know if there is an algorithm able to decide for this property. I know
only a recursive class of types having this property. Of course, internal equality
exists for all (abstract) types and may be defined by Leibnitz rule.
In Anubis 1, equality code is generated for all types (except 'Float', but this may be
the result of some misunderstanding). For concrete types (address types and several
primitive types, structure pointer types) equality is physical (compare the addresses
of the data). For functional types it is also physical, despite the fact that in some
cases, we could be more clever (when we know that the product of the types of the
arguments is finite). Theoretically, boolean equality should not be generated for
functional types in most cases.
So, we have something special to generate only for defined types. The principle of
comparison is as follows for 'x' and 'y' to be equal:
1. 'x' and 'y' must belong to the same alternative,
2. if so, the components must be equal two by two.
Of course, the second condition shows that this is a recursive process. Hence, an
'equality code' is generated for each defined type. This code uses special
instructions, which is quite stupid, because ordinary instructions would have done the
job (maybe somewhat more slowly).
(false_jmp . <addr>)
(true_jmp . <addr>)
Put either 'false' (i.e. 0) or 'true' (i.e. 1) in 'R' and jump (unconditionally) to the
operand address.
(jmp_false . <addr>)
Jump to address if the content of 'R' is 'false'.
(jmp_eq_stack . <addr>)
This instruction compares the two words on top of stack, puts 1 (true) in R if they are
equal and performs a jump to 'addr'. Otherwise the next instruction is executed, and R
remains empty.
(jmp_neq_indexes_large . <addr>)
(jmp_neq_indexes_mixed <bit width> <mask> . <addr>)
(jmp_neq_string . <addr>)
(jmp_neq_byte_array . <addr>)
(jmp_neq 0 . <addr>)
(jmp_neq 1 . <addr>)
(jmp_neq 2 . <addr>)
(jmp_neq 4 . <addr>)
(eq . <depth>)
This instruction test the equality of the two words at depths '<depth>' and '<depth>+1'
in the stack. The register R is set to 1 (true) if they are equal, and to 0 (false)
otherwise.
eq_string
eq_byte_array
Tests the equality of the two strings or byte arrays on top of stack (depths 0 and 1),
and puts either 'true' (if equal) or 'false' (if non equal) in R. Of course, the
strings or byte arrays are walked into.
push_eq_data
(increment_eq . <byte width>)
*** (2.3.5.6) Dynamic variables instructions.
(get_var_handler . <addr>)
(get_mvar_handler . <addr>)
get_var_monitors
get_mvar_monitors
push_mvar_length
remove_monitor
ret_if_zero
get_vv
xchg_vv
get_mvv
xchg_mvv
create_var
free_var_seg
free_mvar_seg
*** (2.3.5.7) Serialising/unserialising.
(serialize . <addr>)
(unserialize . <addr>)
*** (2.3.5.8) Starting a new virtual machine.
(start <depth> . <addr>)
This instruction is generated by the keyword 'delegate' and starts a new virtual
machine. '<depth>' is the number of slots on top of the stack of the old machine which
should be copied into the stack of the new machine. No virtual copies are needed,
because they are already performed (by 'copy_stack_...' instructions). The new machine
starts execution at the instruction following this one, whilst the old machine jumps to
the operand address.
*** (2.3.5.8) Obsolete and soon obsolete instructions.
The following are still used but should become obsolete (used for global variables):
(create_vars ...)
(gv_address ...)
(init_gv ...)
(del_gv ...)
get_gvv
xchg_gvv
initialization_address
variables_deletion_address
The following have been never used (experimental ?):
(call ...)
location
(code_for ...)
(type_list ...)
push_before_eq
(dec3 . <addr>)
*** (2.4) Unification.
Since the Anubis language allows the use of type parameters (parametric polymorphism),
we need to be able to unify types.
'Environments' are lists of the form:
((v_1 . T_1) ... (v_k . T_k))
where each 'v_i' is an 'unknown', and where each 'T_i' is a type (which may also
contain unknowns). Such an environment just says that:
v_1 has value T_1
...
v_k has value T_k
If a pair is of the form '(v_i . v_j)', i.e. if the value of 'v_i' is the unknown
'v_j', we assume that 'v_i > v_j' (recall that they are 'Expr's, hence integers). This
is to avoid some kind of circularity.
Because of the Lisp-like representation, unification is fairly simple. It works as
follows, where 'u(A,B)/E' is the result of unifying 'A' with 'B' relatively to
environment 'E'. The result is a new environment. '$1' and '$2' represent two arbitrary
unknowns.
u($1,$1)/E = E
u($1,$2)/E = (($1 . $2) . E) (if $1 > $2)
u($1,$2)/E = (($2 . $1) . E) (otherwise)
u($1,T)/E = (($1 . T) . E)
u(A,A)/E = E (for any atomic A)
| not_unifiable if u(A1,B1)/E = not_unifiable
u((A1 . B1),(A2 . B2))/E = |
| u(B1,B2)/(u(A1,A2)/E) otherwise
u(A,B)/E = not_unifiable in all other situations
We have no problem with the fake 'Expr' <Cint> because we unify only types, and types
never contain a <Cint>.
It may be the case that the two expressions to be unified are relative to different
environments. In that case we just have to merge the two environments before unifying
the two expression.
We also want to avoid circularities, and to that end we use a non circularity test
(like some Prolog systems). The problem is that the above rules would give the
following for example:
u($1,Maybe($1))/E = (($1 . Maybe($1)) . E)
But this is not correct because, a type 'T' cannot be equal to 'Maybe(T)'. Hence, when
we have to compute 'u($1,T)/E', we use the function 'depends_on' to test if 'T' depends
on '$1' directly or indirectly. If it is the case, the unification fails.
Important note: Unification is used each time we have to compare types. Indeed, a type
with unknowns is a non already completely known type. The comparison may succeed if
some unknowns are instanciated. This is why comparing types without unifying is
generally an error.
Also, all the computations of unification and interpretations of terms are purely
deterministic (no side effect at all). This warranties that when we try several
possibilities, there is no clash between them.
It is often the case that we want to dereference a type relatively to its
environment. This means replacing (recursively) all types unknows by their values (when
they have one). This is performed by the command 'dereference_type'.
*** (2.5) Organization of the knowledge base.
The compiler stores the result of checking the paragraphs into several arrays:
struct Type_struct *types; (array of type definitions)
struct Operation_struct *operations; (array of definitions, formerly called 'operations')
struct Variable_struct *variables; (array of global variables (obsolete))
See the definitions of these structures in 'compil.h'.
Checked paragraphs are stored into these arrays, which work like stacks. The stack
pointers are called 'next_type' and 'next_operation'.
*** (2.4) Indexation of symbols.
In a previous version of Anubis, when the interpretations of a symbol were searched
for, the array of types or of operations were looked up sequentially. During the year
2003 (as far as I can remember), it appeared that this was too slow (especially with
Olivier's program, containing thousands of types and symbols). I decided to index
these arrays by a hashing method. The result was that the compiler became two times
faster.
The 'symbol index' is just an array of 256 arrays of 'SymbolIndexEntry'. When a symbol
is searched for, its 'hash' is first computed. This hash is an U8 (computed by
'u8_symbol_hash' in 'interp.c'). With this hash we may go directly to the right array
of 'SymbolIndexEntry'.
This secondary array is searched for an entry corresponding to our symbol (each entry
contains the name of a symbol). When such an entry is found, we get the index of the
actual definition in the array of types or operations.
In the future this system should be replaced by a B-tree, indexed on names of symbols.
*** (3) The lexer.
*** (3.1) Generalities.
The role of the lexer is essentially to cut the input into 'tokens'. 'lexer.y' is a
FLEX file, which must be compiled by 'flex'. This produces the file 'lex.yy.c' which
becomes part of the C source.
However, the lexer performs several other tasks:
- handling 'read' and 'replaced by' paragraphs, computing visibilities,
- counting lines and columns, and level of nested comments,
- working on several special tokens (strings, floats, named arrows,...).
The lexer has several 'states' (in the sens of FLEX). They are the following:
- INITIAL (used when reading comments outside paragraphs)
- COM (used for reading /* ... */ comments)
- PAR (used for reading paragraphs)
- INCL (used for reading file names after either 'read' of 'replaced by')
- STR (used for reading strings)
- STRTL (used when a string is too long)
- CONF (used for reading the configuration file)
- CONFCOM (used for reading comments within the configuration file)
The names of all the tokens returned by the lexer are prefixed by 'yy__'. They are
declared in 'grammar.y'.
*** (3.2) Finding source files.
The compiler must identify files clearly, not only by the name, because different files
in different directories may have the same name. Notice that the path which is given
after the keyword 'read' is not enough for identifying the file. This is because this
path is itself relative.
The compiler keeps in memory the absolute path of a 'current' directory. Notice that
this current directory is a notion internal to Anubis. It has nothing to do with the
current directory in the sens of the host operating system. Anyway, the Anubis
compiler never changes the current directory in the sens of the host system.
When the compiler starts, the 'current' directory is the directory from which the
Anubis compiler has been launched.
When the compiler needs to find the source file corresponding to a path, either the
path given on the command line, or a path found after a keyword 'read' or 'replaced
by', it looks for the file using the following absolute paths successively:
<current directory>/<given path>
<my_anubis directory>/<given path>
<anubis directory>/<given path>
except if <given path> is absolute. In this case only this path is used.
When the file has been found, the directory of this file becomes the new 'current'
directory.
*** (3.3) The 'read' stack.
(also called 'include stack' in the source of the compiler).
Of course, when a 'read' or 'replaced by' keyword is encountered, the compiler must
read another file, but must be able to come back for reading the rest of the current
file. This is why it needs a stack.
Each 'frame' in this stack contains the following informations:
- a datum of type YY_BUFFER_STATE, needed by FLEX/BISON for each file,
- the absolute path of the file currently read
- the current line number in the file currently read,
- the absolute path of the 'current' directory for the file currently read.
Note: at the time of writting this, this stack is actually made of 4 distinct stacks,
one for each item. These stacks are growing in parallel, and hence share a unique stack
pointer. It would be a good idea to replace this by a single stack whose slots would be
a C structure.
When the end of a file is reached, a frame is poped from the stack, and the reading of
the 'calling' file may be continued.
At the time of writting this text, this stack may contain at most 100 frames.
*** (3.4) Remembering already read files.
The compiler remembers which files have been read. Indeed, during the same compilation,
it may encounter several times a request for reading the same file. Of course, the file
is read only once, and the compiler has just to adjust visibility.
The absolute paths of the files already read are stored into the array
'already_included' (of at most 'max_already_included' = 1024 paths, at the time of
writting this text).
*** (3.5) Computing visibilities.
A file B is visible from a file A only if the file A contains one of the paragraphs:
read B
replaced by B
Normally, a file containing a 'replaced by' paragraph does not contain any other
paragraph (but it may contain off paragraph comments). This kind of file is called a
'relay' file, and is used when the location of a file of the distribution is changed.
'read' is not transitive. In other words, if we have
read B
in file A, and
read C
in file B, this does not make file C visible from file A. However, if we have:
read B
in file A and
replaced by C
in file B, C is visible from file A.
The visibility informations are stored into several arrays. First of all, notice that
each file is identified by an 'id' which is the position of its path in the array
'already_included'. Notice that 'predefined.anubis' (which is 'pre-read') has id 0.
- the array 'number_of_visible' gives for each file id, the number (of type U16) of
files which are visible from this one.
- the array 'is_relay_file' is an array of booleans (actually an array of U8). For
each file it indicates if the file is a relay file or not.
- the two dimensional array 'visibility'. This array provides some kind of
'visibility stack' for each file id. For example, if the number of files visible from
the file whose file id is idA is 3, and if the ids of these files are idA, idB and
idC, we have:
visibility[idA][0] = idA
visibility[idA][1] = idB
visibility[idA][2] = idC
Notice that since a file is always visible from itself, the first entry into this stack
always contains the id of the file to which this stack belongs.
When file B becomes visible from file A, ibB is pushed into the visibility stack of
file A (if not already present). Such stacks are never poped off, because if B is
visible from A, it remains visible indefinitely. Notice that such stacks never need to
be enlarged, because the number of files read cannot be greater than
'max_already_included', which is the size of these stacks.
So, the question is just to know when a file becomes visible from another one. First of
all, each file must be visible from itself. Hence this is registered the first time the
file is encountered. As a particular case, 'predefined.anubis' is encountered before
any other one. Furthermore, it must be visible from all other files.
When a relay file becomes visible from some file, it must transmit its visibility to
the file it is pointing to.
*** (4) The parser.
The parser (grammar.y) is a BISON file. It provides the function 'yy_parse' used in
'main.cpp'. It contains essentially the following parts:
- Declaration of tokens and non terminals with their types.
- Precedence and association rules.
- Grammar rules.
The rules of precedence and association should not be changed (and their order is
important) because if they are changed, files which are currently compiling ok could
have syntax errors.
Each grammar rule has an 'action' part (between { and }) which constructs the
syntactical tree corresponding to the rule. These actions are the official source of
informations for the Lisp-like format of terms.
*** (5) Computing interpretations of terms.
*** (5.1) Overview.
The Anubis language allows overloading of symbols and type parameters. These features
introduce many ambiguities which must be resolved.
A term may have several distinct interpretations. For example, when you write the
number:
2
it has at least 3 interpretations, one as an 'Int8', one as an 'Int32' and one as a
'Float'. Similarly, a symbol or a binary operator like '+', may have several
definitions, hence several interpretations.
When atomic terms (like numbers and symbols) are combined together for forming compound
terms, like applicative terms, conditionals,... the compound term may itself have
several interpretation depending on which interpretation we choose for the atomic
subterms of this term.
Hence, what we need is a (necessarily recursive) function for computing the list of all
interpretations of a term. This function is called 'term_interpretations' and has
several auxiliary functions. All these functions are defined in 'interp.c'.
Also terms must be interpreted relative to a 'context'. A context is normally just a
set of declarations. But in the case of the Anubis compiler, a context is more
precisely an accurate representation of what will be the top of stack at run time. Such
a context is represented as a list, with the head of list being the most recently
pushed item.
For example, if we want to interpret a term like this one in some context C:
with x = a, t
we must first get the list of all interpretations of 'a' relative to C. For each
interpretation of 'a', we may compute the list of interpretations of 't', but it must
be relative to '((x . T) . C)', where 'T' is the type found for this interpretation of
'a', because after 'a' is computed (at run time) its value is pushed into the stack.
(Actually, in the case of this example, the compiler requires that 'a' has only one
interpretation. This is for making the source files more explicit and easier to read
for humans.)
*** (5.2) Interpreting applicative terms.
An applicative term is just made of a function applied to arguments. For example, it
may be:
f(a,b,c)
(or 'a + b'). Such a term is represented internally as:
(app <lc> [f] . ([a] [b] [c]))
where <lc> is a Lisp integer ('Expr') from which the name of the source file, together
with the line and the column of the term may be recovered.
In order to interpret this term, the compiler must first compute all interpretations of
the function itself. It also computes all the interpretations of the tuple of
arguments. This may lead to a very big number of interpretations. For example, if 'a',
'b' and 'c' have respectively 2, 4 and 7 interpretations, this yields 56
interpretations for the tuple '(a,b,c)'. If the number of interpretations of a tuple
exceeds say one million, the compiler may become very slow. The remedy (comming from
the user) is to type explicitly each argument in order to reduce this number of
interpretations.
I've tried to proceed another way. Since for each interpretation of the function, we
get types for the arguments, I've tried to recompute the interpretations of the typed
arguments in each case. But experimentations showed that this is even slower. For the
future, we need so more clever mecanism. This problem seems to be similar to logical
problems encountered in optimisations of SQL queries.
Now, for each interpretation of the function and, each interpretation of the tuple of
arguments, we must unify the signature of the function (type of the tuple of arguments)
with the type of the actual tuple of arguments. If this unification fails, this
combination is impossible. This is how the number of interpretations may stay at a
reasonable level.
*** (5.3) Interpreting conditionals.
The test is first interpreted. The compiler requires (again for making the source files
more explicit and easier to read for humans) that the test has one and only one
interpretation. For each interpretation of the test, the type of the test is known. Of
course all interpretations whose type is not a type with alternatives are
rejected. However, components in this type may still be unknown, the essential being
that alternatives are known.
Next, the compiler analyses the cases. We must have one case per alternative, and in
the same order.
For each alternative, the compiler checks that the name of the case is the same as the
name of the alternative, and that the number of resurgent symbols is the same as the
number of components. If a resurgent symbol is explicitly typed, its type is unified
with the type of the coresponding component. If this unification fails, an error
message is sent.
For each case, the compiler constructs a new context for interpreting the body of the
case. Indeed, resurgent symbols are declarations which enrich the context. This may
yield several interpretations for each body of case. The compiler constructs the list
of all interpretations of the tuple of bodies. Again (as for applicative terms) this
may lead to a big number of interpretations and slow down the compiler or even exhaust
the memory.
For each such interpretation of the tuple of bodies, the types of the bodies are
unified. This is because in a conditional, the bodies must all have the same type
(which will be the type of the conditional). When the unification fails, the tuple is
rejected. This leads in general to a reasonable number of interpretations of the tuple
of bodies. Here we could easily be more clever by unifying interpertations of bodies as
soon as they are computed, i.e. unifying the types of the first and second bodies, then
the type of the third body with the type of the first two and so on.
*** (5.4) Interpreting selective conditionals.
The selected alternative is found using the name of the head of case. If there are
several alternatives with this name, the compiler checks the number of resurgent
symbols. If there are several alternatives with the same name and same number of
resurgent symbols, the compiler uses the declarations (if any) of the types of the
resurgent symbols. These types are unified with the actual types taken from the
definition of the type of the test.
Of course, if there is an 'else' case, the type of this case must be the same as the
type of the body of the selected case. This is checked by unification.
*** (6) Implementation of types.
Notice that we implement only 'actual types', not type schemes (for the time being; see
the discussion on quantified types). An actual type is a type with no unknown nor
parameter.
Types in Anubis 1 are of the following sorts:
- primitive types: String, Int32, Listener, ...
- defined types: types which are defined by a paragraph, even if this paragraph is in
predef.c or predefined.anubis.
- functional types: (T_1,...,T_k) -> U
- address types: RAddr(T), WAddr(T), RWAddr(T), GAddr(T) (obsolete), Var(T), MVar(T).
- structure pointer types: These types are used for encapsulating C structures
comming from linked libraries (like SSL or SQLite3).
*** (6.1) Implementation of primitive types.
There is a particular implementation for each primitive type.
*** (6.1.1) 'String'.
A string in Anubis 1 is implemented as a pointer:
*
|
V
+-------+-------------------------------+---+
| cnt | a b c d .... | 0 |
+-------+-------------------------------+---+
0 4
The first 4 bytes are the counter for the garbage collector. Notice that strings which
are defined statically in the source files, have a 0 at this place, meaning that the
string is permanent (never collected).
Starting at offset 4, are the characters of the string. A trailing 0 is added at the
end, like in C.
*** (6.1.2) 'ByteArray'.
A byte array is a pointer as follows:
*
|
V
+-------+-------+-------------------------------+
| cnt | n | a b c d .... |
+-------+-------+-------------------------------+
0 4 8
where again 'cnt' is the counter, and 'n' the number of bytes in the byte array (not
including 'cnt' and 'n'). The bytes are stored starting at offset 8. There is no
trailing 0.
*** (6.1.3) 'Int32'.
An 'Int32' is just a 'int' of C (of 32 bits of course), i.e. a 32 bits word containing
the number itself.
*** (6.1.4) 'Float'.
Because data in Anubis 1 cannot have more than 32 bits, double precision floating point
numbers are implemented indirectly (which is very bad for performances):
*
|
V
+--------+-----------------+
| cnt | IEEE754 number |
+--------+-----------------+
0 4 12
where again 'cnt' is the counter for the garbage collector.
*** (6.1.5) 'Listener'.
Despite the fact that 'Listener' is a primitive type, it is implemented exactly as
network address types. See below.
*** (6.2) Implementation of defined types.
Defined types (types with alternatives, defined by paragraphs beginning by 'type' or
'public type') are implemented according to 3 different methods: 'small', 'mixed' and
'large'.
First of all, the width of the type is computed. The width is the number of bits
required for accomodating all data of this type. It is computed as follows.
Some kind of base 2 logarithm of the number of alternatives is computed. This is the
smallest number of bits required for accomodating all alternative numbers (also called
'indexes'). It is:
0 for at most 1 alternative (example: 'Empty' or 'One')
1 for 2 alternatives (example: 'Bit' or 'Bool')
2 for 3 to 4 alternatives
3 for 5 to 8 alternatives
4 for 9 to 16 alternatives
etc...
Since the maximal number of alternatives is limited to 256, the width of an index field
is at most 8.
This logarithm of the number of alternatives is called the 'index width' for this type.
Now, each alternative has its own width. It is the sum of the index width of the type,
and the widths of all components of the alternative. For example, the type
'Maybe(Int8)' has 2 alternatives:
failure
success(Int8)
the index width is 1 (2 alternatives). The width of the first alternative is 1 (= 1 +
0; no component), and the width of the second alternative is 9 (= 1 + 8), because
'Int8' is of width 8 (as expected).
Now, the width of the type is the supremum of the width of its alternatives (0 for
empty types). So, it is 9 for 'Maybe(Int8)' for example.
These computations are performed in 'typewidth.c'.
Notice that this width is not well defined in the case of a recursive type. For a
recursive type (or a cross recursive type), the width is infinite (which corresponds to
the fact that the type itself is infinite (if it is not empty)). This is why we do not
compute the width of recursive types. Hence, we need a tool for detecting if a type is
recursive (maybe cross recursive through other types). This tool is defined in
'rectype.c'. Notice that a recursive type which is empty (see may happen: see the
discussion on 'NonEmptyList' in 'predefined.anubis') is still considered as recursive
by Anubis 1, and its width is still infinite.
The width of a type is used to determine the number of bits needed for holding a datum
of this type. Above a certain limit, and in particular for recursive types, the type
must be implemented indirectly, i.e. through a pointer to a segment.
The width discussed above will be called the 'theoretical width'.
Hence, we have another notion: that of 'real width'. The real width, for Anubis 1 is an
integer between 0 and 33 included. The value 33 means that the theoretical width is at
least 33, and that the type will be implemented indirectly. Values from 0 to 32 mean
that the type will be implemented directly within that number of bits.
Of course, allowing 64 bits or 128 bits direct data in the future, means updating all
these programs.
If the real width of a type is at most 32, the type is 'small'. It cannot be recursive,
and all its components, and the components of its components, and so on, are small. Any
datum of this type may be represented on 'width' bits.
If the real width is 33, there are two cases:
- the number of alternatives is at most 4: the type is 'mixed'.
- the number of alternatives is 5 or more: the type is 'large'.
If a type is large, all data of this type are represented indirectly, i.e. as a
pointer to a memory segment:
*
|
V
+----------+---+--------------+--------+---------------+
| cnt | i | c1 | c2 | c3 |
+----------+---+--------------+--------+---------------+
0 4 5
The first 4 bytes of the segment are always occupied by the counter ('cnt') of the
garbage collector. The next byte (at offset 4) is the alternative index for this
datum. It indicates to which alternative the datum belongs. Starting at offset 5, are
the representations of the components of the datum.
Since all pointers returned by the memory allocator are multiples of 4 (the two least
significant bits are always 0), these two bits are available for any purpose. I've
decided to use them for storing the alternative index (this is why this method cannot
be applied if there are more than 4 alternatives). This has the advantage of avoiding a
pointer to a segment for some alteratives. For example, the type 'List(T)' (for any
type 'T') is mixed. The first alternative has width 1, the second alternative has an
infinite width. Hence, it is tempting to avoid the use of an indirection for the first
alternative. This is what the 'mixed' implementation realizes.
Hence, the empty list (for any type 'T') is just the 32 bits word 0, while a non empty
list looks like this:
pppppppp pppppppp pppppppp pppppp01
where the least significant two bits (01) indicate alternative number 1 (second
alternative), and where:
pppppppp pppppppp pppppppp pppppp00
is the pointer:
*
|
V
+----------+-------+---------------+
| cnt | head | tail |
+----------+-------+---------------+
0 4
Notice that there is no alternative number at offset 4 (since it is already in the
least significant bits of the datum itself), and that components are stored starting at
offset 4. 'tail' always occupies 4 bytes, but 'head' may occupy 0, 1, 2 or 4 bytes.
*** (6.3) Implementation of functional types.
Functions are implemented in two different ways depending on how they are defined.
1. If the function is defined by a paragraph, it is just represented by the address
of its code.
2. If the function is defined by the arrow '|->' (or by the recursive arrow '|-f->'),
it is represented as a 'closure'. A closure is a pointer to a segment like this:
*
|
V
+----------+----------+----------+----------+----------+----------+...
| cnt | code | del | x1 | x2 | x3 |
+----------+----------+----------+----------+----------+----------+... etc
0 4 8 12 16 20 24
where 'cnt' is the counter of the garbage collector as usual. All slots in a closure
occupy 4 bytes. The slot 'code' is the address of the code of the function. The slots
'x1', 'x2', etc... contain the values of the symbols which where present in the context
at the time the function (i.e. the closure) was created, and which are refered to from
within the body of the function definition. This part of the closure is called the
'micro context'. The slot 'del' contains the address of the garbage collector code able
to properly destroy the closure itself.
When we have a fonction at hand (i.e. a 32 bits word representing a datum whose type is
functional), we must be able to decide if the function is implemented as a simple code
address or as a closure. Again we take advantage of the fact that pointers to memory
segments are divisible by 4. If the pointer is even (least significant bit = 0), the
function is a closure. If it is odd (least significant bit = 1) it is a simple code
address. For this reason the code of a function defined at top level (by a paragraph)
is always positioned at an odd address. This is the reason of the existence of the
instruction 'odd_align' (which is of course of size 1 and never executed).
*** (6.4) Implementation of address types.
There are two sorts of address types:
1. File and network address types: 'RAddr(T)', 'WAddr(T)', 'RWAddr(T)', 'Listener'
2. Dynamic variable address types: 'Var(T)', 'MVar(T)'.
We ignore GAddr(T), which is obsolete.
*** (6.4.1) File and network address types.
A datum of such a type is a pointer:
*
|
V
+----------+---+---+---+---+---------------------------------+
| cnt | t | m | _ | _ | 12 bytes |
+----------+---+---+---+---+---------------------------------+
0 4 5 6 7 8 20
As usual the first 4 bytes are for the counter of the garbage collector. The next four
bytes are:
t type indicator
m mode indicator
_ not used
_ not used
The next 12 bytes starting at offset 8 are used depending on the value of the type
indicator.
The type indicator may take the following values ('conn' is for 'connection', a synonym
(!?) of 'address'):
conn_invalid invalid connection (should never happen)
conn_file file on disk
conn_listener listening TCP/IP connection (server: type 'Listener')
conn_network client TCP/IP connection
conn_ssl_network client SSL connection
conn_local (obsolete: local variable)
conn_global (obsolete: global variable)
conn_stdin standard input
conn_stdout standard output
conn_stderr standard error output
The values of the above are defined in 'vm.h'.
The mode is an OR of the following flags:
conn_read readable connection
conn_write writable connection
conn_read_pw (experimental: connection with password access)
conn_write_pw (idem)
in_progress (connection not yet up)
The 12 bytes after offset 8 are used as follows:
Files (conn_file):
+---+---+---+---+---+---+---+---+---+---+---+---+
| * | | | | | | | | |
+---+---|---+---+---+---+---+---+---+---+---+---+
V
FILE (in the sens of C)
Network (conn_network, conn_listener):
+---+---+---+---+---+---+---+---+---+---+---+---+
| socket handle | | | | | | | | |
+---+---+---+---+---+---+---+---+---+---+---+---+
SSL connection (conn_ssl_network):
+---+---+---+---+---+---+---+---+---+---+---+---+
| (SSL *) | TCP/IP handle | (char *) |
+---+---|---+---+---+---+---+---+---+---|---+---+
V V
SSL struct of OpenSLL server common name
Standard connections (conn_stdin, conn_stdout, conn_stderr):
+---+---+---+---+---+---+---+---+---+---+---+---+
| | | | | | | | | | | | | (not used)
+---+---+---+---+---+---+---+---+---+---+---+---+
*** (6.4.2) Dynamic variable address types.
A simple dynamic variables (type 'Var(T)') is a pointer:
*
|
V
+-----------+-----------+-----------+-----------+
| cnt | value | n | * |
+-----------+-----------+-----------+-----|-----+
0 4 8 12 | 16
V
monitors table
The segment contains 4 32 bits words:
- at offset 0: the counter
- at offset 4: the current value of the variable
- at offset 8: size of monitors table (number of slots)
- at offset 12: pointer to monitors table
'n' should be 0 exactly when the pointer to monitors table is NULL. When the variable
is created (by instruction 'create_var'), n is 0 and the pointer to monitors table is
NULL.
A monitor is just a datum of type 'One -> One' (most often it is a closure).
Multiple dynamic variables (type 'MVar(T)'). This is again a pointer:
*
|
V
+----------+----------+----------+----------+----------+-----------------------
| cnt | ns | nm | * | nbw | data ...
+----------+----------+----------+----|-----+----------+-----------------------
0 4 8 12 V 16 20
monitors table
In this segement, we have:
- at offset 0: garbage collector counter
- at offset 4: number of slots for data in the variable
- at offset 8: number of slots for monitors
- at offset 12: pointer to the table of monitors
- at offset 16: normalized bit width (see below)
- at offset 20: data
A multiple dynamic variable is nothing else than a 'mutable array'. This array has a
size (number of data of type T that we can put in the array).
Normally, we should do in such a way that slots for data has just the size required for
type T. For example, if we have a multiple variable of type 'MVar(Bool)', each slot
should occupy just 1 bit, not 32 bits. If we have a multiple variable of type
'MVar(Maybe(Int8))', each data slot should occupy 9 bits.
Because of byte boundaries, we use only powers of 2 for the number of bits of a
slot. Hence, for example, in a multiple variable of type 'MVar(Maybe(Int8))', each data
slot will be of 16 bits, the least number of bits which is a power of 2 and at least
equal to 9.
This numner of bits is called the 'normalized bit width'. It may be computed as follows
in C:
if (bw <= 0) nbw = 0; else
if (bw <= 1) nbw = 1; else
if (bw <= 2) nbw = 2; else
if (bw <= 4) nbw = 4; else
if (bw <= 8) nbw = 8; else
if (bw <= 16) nbw = 16; else
nbw = 32;
where 'bw' is the real bit width, and 'nbw' the normalized bit width.
However, for the time being, Anubis 1 sets nbw = 32. This should be worked out, and
maybe, we should also consider 24 as an acceptable normalized bit width.
*** (6.5) implementation of C structure pointer types.
A datum of type '£StructPtr(Name)' is as follows:
*
|
V
+--------+--------+
| cnt | * |
+--------+----|---+
0 4 | 8
V
C structure
i.e. it is a pointer to a segment of 8 bytes containing only a counter for the
garbage collector, and a pointer to the actual C structure of name 'Name'. Names of C
structures are declared in 'bytecode.h'.
*** (7) Code generation.
The function 'compile_term' (in 'compile.c') generates byte code from
interpretations. Below, we describe how this code is generated.
*** (7.1) General conventions.
Code generation is a recursive process (just because interpretations themselves are of
a recursive nature). The code resulting from the compilation of an interpretation
should behave always the same way with respect to the stack and the register R. Our
conventions are the following.
First of all, consider an interpretation '(head . env)'. We may of course replace all
occurences of a type unknown in 'head' by its value as given by 'env'. Hence, we
consider only interpretation heads, and assume that they contain no unknowns.
An interpretation head must be compiled relative to a context. This context is a
precise description of what the stack will contain at run time. We represent context by
lists, the head of list being the top of stack. Each element in the list corresponds to
a slot in the stack. Furthermore, the stack in Anubis 1 is homogeneous. Each slot is a
32 bits word, able to contain a datum of any type.
The Exprs which are listed in a context are the following:
(<symbol> . <type>)
This means that this slot contains a datum of type <type>, and that its name is
<symbol>.
(ret . <arity>)
This means that this slot contains a return address, corresponding to a call to a
function of <arity> arguments.
(f_micro_ctxt <fname> <ftype> (<symbol> . <type>) ...)
This slot contains a 'closure' (i.e. function defined by '|->'). '<fname>' is either
'nil' (in the case of '|->'), or "f" (in the case of '|-f->'). '<ftype>' is the type of
the function. The list of '(<symbol> . <type>) is the micro context proper. It
indicates in which order the data remembered by the function are placed in the memory
segment of the closure.
(argument . <type>)
This means that the slot is occupied by a datum which has been pushed into the stack
for being an argument to a function call. This datum is anonymous, but its type is
known.
For example, you may see below how the compiler prints the context:
(
(argument . "SSL_Connection")
(argument . "Bool")
(argument . "DenialOfService")
(ret . ...)
("connection" . "SSL_Connection")
(f_micro_ctxt nil ... ("sites" app_ts "List" "Web_Site_Description")
("dos" . "DenialOfService"))
(ret . ...)
)
into a '.sc' file:
| ;--- stack ------------------------
| ; 0 SSL_Connection
| ; 1 Bool
| ; 2 DenialOfService
| ; 3 (return address)
| ; 4 connection SSL_Connection
| ; 5.0 sites List(Web_Site_Description)
| ; 5.1 dos DenialOfService
| ; 6 (return address)
| ;----------------------------------
If '<head>' is an interpretation head, and 'Ctxt' is a context, we denote by:
[<head>]Ctxt
the sequence of instructions resulting from compiling '<head>' relative to the context
'Ctxt'.
Convention 1: The stack after execution of the sequence of instructions in
'[<head>]Ctxt' is the same as before this execution.
Convention 2: After the execution of '[<head>]Ctxt', R contains the value of '<head>'.
*** (7.2) Elimination of terminal recursion.
Since the language has no notion of loop, it is of fundamental importance to eliminate
terminal recursion. Roughly speaking, this amounts to replace the sequence:
(apply . k)
(ret . 0)
by:
(collapse . k)
(apply . k)
Indeed, when the first sequence is on the point to be executed, the situation is as
follows:
stack R
--------------------------------------------
a_1 ... a_k r s ... f
where 'a_1,...,a_k' are the arguments of the function, 'r' the return address and 'f'
the function itself. 's' is the return address of the previous call.
'(apply . k)' will perform a jump to the code of the function (found in 'R'), and after
the return from this function call, 'a_1 ... a_k r' have been removed from the stack
which becomes:
s ...
Then the instruction '(ret . 0)' removes 's' from the stack and jumps to address 's',
so that the stack becomes:
...
The other sequence of instructions works as follows. '(collapse . k)' removes 'r' from
the stack, which becomes:
a_1 ... a_k s ...
Then '(apply . k)' call the function 'f', and after the return of this call, the stack
is:
...
as expected.
Actually, what it does is just avoid to perform the instruction '(ret . ?)' two
consecutive times (one time with address 'r', and one time with address 's').
Actually, the situation is more complex because, in general we have the arguments of
a previous call between 'r' and 's', so that the stack looks like this:
a_1 ... a_k r b_1 ... b_i s ...
When we return from the call to 'f', the stack becomes:
b_1 ... b_i s ...
and the next instruction cannot be '(ret . i)' (except if i == 0), because the top of
stack 'b_1' is not a return address. In this situation, terminal recursion can still
be eliminated. If the recursion is terminal, the next instructions are:
(del_stack ...)
...
(del_stack ...)
(ret . i)
(as many 'del_stack' or 'collapse' as they are arguments, i.e. 'i'). At that point the
machine deletes the arguments from the stack and performs the 'ret'.
So the general situation is the following sequence of instructions:
(apply . k)
(del_stack_? . <depth>)
(collapse . <depth>)
...
(ret . i)
and the stack is the following:
stack R
--------------------------------------------
a_1 ... a_k r b_1 ... b_i s ... f
We replace this sequence by the following one:
(collapse . k)
(del_stack_? . k+<depth>)
(collapse . k+<depth>)
...
(apply . k)
Here is how it works:
Instructions Stack
----------------------------------------------------------------------------------
a_1 ... a_k r b_1 b_2 b_3 ... b_i s ...
(collapse . k) a_1 ... a_k b_1 b_2 b_3 ... b_i s ...
(del_stack_? . k+<depth>) a_1 ... a_k b_2 b_3 ... b_i s ...
(collapse . k+<depth>) a_1 ... a_k b_3 ... b_i s ...
...
a_1 ... a_k s ...
(apply . k) ...
The main difference with the original code is just that 'b_1 ... b_i' are deleted
before the call to 'f'.
*** (7.3) Generating the code.
*** (7.3.1) Protect.
The interpretation head has the following form:
(protect <lc> . <head>)
The code generated is:
[(protect <lc> . <head>)]Ctxt =
label a:
protect
[<head>]Ctxt
unprotect a
<end code>
The 'protect' instruction is of size 2 (2 bytes). The first byte is the instruction
name 'protect', and the second byte is the 'lock'. This byte is zero when the code is
generated.
At run time, the instruction 'protect' checks the value of the lock (which may be 0 or
1). If it is 1, the subsequent code is locked, and the instruction gives up. If it is
0, the piece of code is not locked. So the instruction puts 1 in the lock so that the
subsequence code is now locked, and IP (Instruction Pointer) is incremented by 2.
The subsequent code '[<head>]Ctxt' is executed. Then the instruction 'unprotect' is
reached. This instruction puts 1 into the byte at offset 'a+1' (i.e. into the lock) so
that the code is now unlocked.
*** (7.3.2) Lock.
*** (7.3.3) Global symbols.
Global symbols are those which are defined by paragraphs.
*** (7.3.4) Strings, Int32, small data and Floats.
*** (7.3.5) Local symbols.
*** (7.3.6) Microlocal symbols.
*** (7.3.7) Closures.
The code generated must construct a closure. Hence, it begins by the instruction 'alloc'.
[(closure <lc> (f_micro_ctxt fname ftype (sym . type)...) <args> . <body>)]Ctxt =
alloc 3+n 32 bits words (n = size of micro context)
put_copy y_0 in closure (at 3+0 word)
...
put_copy y_n-1 in closure (at 3+n-1 word)
put_closure_labels a d
<end code>
'n' is the number of values to be put in the micro context. Each value is taken from
the stack, virtually copied (if needed), and put at its place in the micro context.
This is what the instructions 'put_copy ...' are doing. However these instructions
are specialised depending on the type of the datum to be put. The function
'get_closure_copy' returns the list of these instructions. It receives the description
of the micro context of the closure to be constructed, and the current context
(description of the current stack).
When this is done, the closure is quite ready. Only the code of the function and the
code for garbage collection of the closure is missing. These codes are put into the
closure by the instruction 'put_closure_labels'.
Of course, these two codes must also be generated by the compiler. The code of the
function is generated as follows:
label a:
[<body>]<args>+(<micro_context>)+(ret . k)
del args
del_closure
ret k
When a function represented by a closure is called the stack is prepared as follows:
a_1 a_2 ... a_k c r ...
(the top of stack on the left side), where 'a_1' ... 'a_k' are the arguments (first
argument on top of stack), 'c' the closure itself, and 'r' the return address. Just
before the function returns the stack must become:
r ...
So the body of the function is compiled relatively to a context (denoted
<args>+(<micro_context>)+(ret . k) above) representing the above stack. When this code
has finished, the value to be returned by the function is in R. At that point, we just
have to delete (virtually) the arguments and the closure from the stack, and return.
*** (7.3.8) Applicative terms.
*** (7.3.9) Conditionals.
*** (7.3.10) Selective conditionals.
*** (7.3.11) Local definitions ('with ...').
*** (7.3.12) Delegate.
*** (7.3.13) Serialize/Unserialize.
*** (7.4) Peephole optimisation.
*** (8) Garbage collector code.
*** (9) Equality code.
*** (10) Serialisation/unserialisation.
Serialisation is also called 'marshalling'. Anubis 1 has a built-in
serialisation/unserialisation mechanism. In the next version of Anubis, this mechanism
should be written in Anubis using macros. It will be much simpler (and safer).
In principle any abstract datum could be serialised. In the current system, only the
data of the following types are serialisable:
- the primitive types: String, ByteArray, Int32, Float,
- defined types, provided that all their components are serialisable.
Functional types should in principle be serialisable, but they are not in Anubis 1.
This would require a more sophisticated mechanism, which has also something to do with
dynamically loadable libraries.
Serialisation is not performed by a piece of code, but by a piece of 'pseudo-code'.
The same 'pseudo-code' is used for serialising and for unserialising. The
'pseudo-instructions' have just a different execution. Hence the compiler produces (if
needed) only one such pseudo-code per serialisable type.
Actually the pseudo-code is essentially a description of the type of the data to be
serialised/unserialised.
Note (2006/08/16): I just remarked, looking at the '.sc' file of the web site, that the
compiler generates pseudo code paragraphs like this one:
---------|--------------------------------------------------------
|
| * * * type implementation pseudo code * * *
|
42901 | label 590
| ;--- stack ------------------------
| ; 0 Var(List(Int32))
| ; 1 (return address)
| ;----------------------------------
42901 | type_large_switch 1336
42907 | label 1336
42907 | large_alt_begin 9
42909 | indirect_type_mixed 2 591
42915 | pop1
42916 | large_alt_end 0
42918 | jmp 1335
42923 | label 1335
42923 | swap
42924 | ret 1
---------|--------------------------------------------------------
Strange, isn't it ? Because 'Var(...)' is not serialisable ! However, I also checked
that such pieces of code are never called (by 'serialize/unserialize' instructions). I
also checked with the following example:
global define One
truc
(
List(String) args
) =
forget(serialize(var(args))).
that the compiler doesn't see the problem. This is probably due to the very bad
implementation of 'Var(...)' which is implemented as a large type, not as a 'Var'
type. This must be fixed as soon as possible. The best way is to implement 'Var(...)'
correctly.
*** (10.1) How it works.
Serialising means 'transforming into a sequence of bytes'. This sequence of bytes (a
byte array) may be saved on disk, sent over a network connection etc... Of course,
serialisation is interesting because it does'nt loose information. Serialised data may
be unserialised.
Now, when we are unserialising a byte array, expecting a datum of a certain type, we
are not always sure that the byte array is the result of the serialisation of a datum
of this type. Hence, unserialisation may produce errors. This is why the primitive
'unserialize' returns a datum of type 'Maybe($T)'.
Notice that when a datum is serialised, its type is not written into the serialisation.
Hence, when unserialising a datum we need to know the type of the datum. In other
words, the type is the 'key' (or the 'grammar') for unserialising.
Serialisation is fairly simple. It works as follows:
- String: the characters of the string, followed by a zero byte are written.
- ByteArray: a 4 bytes word is written in little indian order, representing the
number of bytes in the byte array. Then the bytes of the byte array are written.
- Int32: The integer is written as a little endian 4 bytes number.
- Float: The floating point number is written in the IEEE754 8 bytes format.
- small type: The datum is written either as a 0 byte, 1 byte, 2 bytes or 4 bytes
word.
- mixed type: The index of the alternative is first written in the form of a
byte. Then the components are serialised. But in the case of a small alternative, the
components are serialized as a 'small datum', i.e. a 32 bits word containing all the
datum of the alternative (index included). As a consequence, in this case the index
is redundantly serialized. See the pseudo-instruction 'indirect_type_mixed' for more
informations.
- large type: Same as for mixed types.
The result of serialisation is a byte array which needs to be allocated before
serialisation begins. The virtual machine has special registers for serialisation:
U8 *serial_buf; buffer for serialisation
U32 serial_size; size of this buffer
U32 serial_next; next free position in the buffer
Bool unserial_failed flag for unserialisation
When serialisation begins (i.e. when the instruction 'serialize' is executed), the
datum to be serialised is on top of the stack. The instruction 'serialize' has an
address operand pointing to the pseudo code for the type of this datum. This
instruction works like a call (it calls the pseudo code), hence must push a return
address. However, the return address is inserted just below the datum to be
serialized. The instruction also changes the 'work sort' of the virtual machine from
'computing' to 'serializing', adn allocates a data segment of some default size whose
address is put in 'serial_buf'. 'serial_size' is intialized to the size of this
segment, and 'serial_next' is initialized to 8 = 4+4, so as to accomodate room for the
garbage collector counter and the word for the size of the resulting byte array. Of
course, if the segment cannot be allocated, the instruction gives up, waiting for next
time.
Form now on, we are serializing, executing some pseudo-code. This pseudo-code ends with
the instructions 'revert_to_computing' and 'ret'.
'revert_to_computing' changes the work sort from 'serializing' to 'computing',
reallocates the resulting byte array at the right size, puts the size at the right
place in the byte array, puts the new byte array in R, and a null pointer in
'serial_buf'.
'ret' returns from the pseudo-code to the normal code.
Unserializing works as follows.
The byte array 'b' to be unserialized is in R. The 'unserialize' instruction has an
address as its unique operand:
unserialize <addr>
It is the address of a piece of pseudo-code, namely the pseudo-code for the type 'T' of
the datum to be extracted from the byte array.
What 'unserialize' does is just:
1. put (the address of) 'b' into the 'serial_buf' register, and
initialize related registers ('serial_size' and 'serial_next'),
2. put 0 in the flag 'unserial_failed'
3. push a return address 'r',
4. put the machine in the state 'unserializing',
5. jump to <addr>
Hence, just after 'unserialize' has executed, we have:
R stack serial_buf unserial_failed
(empty) r ... b 0
The execution of the pseudo code yields one of the following two situations:
case 1: unserialization succeeded
R stack serial_buf unserial_failed
d ... b 0
case 2: unserialization did not succeed
R stack serial_buf unserial_failed
d ... b 1
In both cases, we get a datum 'd' of type 'T'. However,
- if 'unserial_failed' is 0, d is the datum successfully extracted from the byte
array,
- if 'unserial_failed' is 1, d is a valid datum of type T, at least from the point of
view of the garbage collector. Nevertheless, it is non significant.
The return address 'r' has been removed from the stack. The byte array 'b' is still in
the 'serial_buf' register. At that time, the machine is still in the 'unserializing'
state. The next instruction:
revert_to_computing <width>
has the responsability of testing 'unserial_failed', virtually delete 'b', put the
machine in the state 'computing', and:
case 1 (unserail_failed == 0):
- put 'd' into a 'success', so as to construct a datum of type 'Maybe(T)', and push a
zero on the stack. Hence, the situation becomes:
R stack serial_buf unserial_failed
success(d) 0 ... (empty) (empty)
case 2 (unserail_failed == 1):
- push 'd' on the stack, and put a zero ('failure' of type 'Maybe(T)') in R. Hence,
the situation becomes:
R stack serial_buf unserial_failed
failure d ... (empty) (empty)
The next instruction is:
del_stack_? 0
which will delete correctly the top of stack, hence yielding in any case the following
situation:
R stack serial_buf unserial_failed
result ... (empty) (empty)
where result is of type 'Maybe(T)'.
So, if the initial situation is:
R stack
(empty) ...
after execution of:
[term]
unserialize <addr>
revert_to_computing <width>
del_stack_? 0
it becomes:
R stack
d ...
where 'd' is the result of the unserialization (of type 'Maybe(T)').
Notice that the <width> operand is used by 'revert_to_computing' on order to construct
'success(d)'. Indeed,
- if <width> <= 31, success(d) is just (d<<1)|1
- if <width> >= 32, success(d) requires a new data segment of two words, with:
- at offset 0: counter (= 1)
- at offset 4: d
and the pointer to this data segment must be ORed with 1 (mixed index for type
'Maybe').
To summarize, the code for (unserialize <lc> <type> . <term>) is:
[<term>]
unserialize <addr>
revert_to_computing <width>
<end code>
except if the declared type is ByteArray. In that case, we have:
[<term>]
success 32 needed to put the byte array into a 'success'
<end code>
*** (10.2) Descriptions of small types.
They are lists on integers defined by the following grammar:
<na> := 8 bits integer (number of alternatives)
<iw> := 8 bits integer (index width)
<nc> := 8 bits integer (number of components)
<small type descr> := <iw> <na> <alt descr> ... (as many <alt descr> as <na>)
<alt descr> := <nc> <small type descr> ... (as many as <nc>)
Hence a small type description is a list of 8 bits integers. These integers become part
of serialization/unserialization instructions like 'type_8', 'type_16' etc... (see
'vminstr.c'). One question is: `how long this description may be ?'. The answer is not
so obvious ! Remember that the number of alternatives is limited to 256, and likewise
for the number of components in an alternative. Now, if 'iw' is the number of bits
required for the index, the total width of the components is at most: 'bw - iw' where
'bw' is the bit width of our type. So, let 'dl(iw,bw)' be the maximal length for a
description of a type of index width 'iw' and bit width 'bw'. We have,
dl(iw1,bw1) =< 2 + (2^iw1 * (1 + bl(iw2,bw1 - iw1)))
*** (10.3) Pseudo-instructions set.
The official list of pseudo-instructions is defined in 'bytecode.h'. Recall that these
instructions have two meanings: one for serialisation and one for unserialisation.
action while action while
serialising unserialising
---------------------------------------------------------------------------------------
swap swaps the top of stack nothing
type_0 nothing nothing
type_8
type_16
type_32
type_String
type_ByteArray
type_Int32
type_Float
indirect_type_0
indirect_type_8
indirect_type_16
indirect_type_32
indirect_type_String
indirect_type_ByteArray
indirect_type_Int32
indirect_type_Float
indirect_type_mixed
indirect_type_large
type_mixed_switch
type_large_switch
mixed_alt_begin
large_alt_begin
mixed_alt_end
large_alt_end
type_small_alt
revert_to_computing
type_mixed (obsolete; never used)
type_large (obsolete; never used)
small_mixed_alt (obsolete; never used)
*** (10.4) Serialising.
*** (10.5) Unserialising.
The machine is now doing unserialization. When unserialization begins (just after
the instruction 'unserialize'), the machine is in the following state:
R stack serial_buf unserial_failed
(empty) r ... b 0
where 'r' is a return address (to be removed by the 'ret' at the end of the pseudo
code, and where 'b' is a byte array.
During unserialization, the registers serial_buf and serial_size remain constant. On
the contrary, serial_next is incremented while the byte array is read. It always
gives the offset (relative to serial_buf) of the next byte to be read.
Below, we simulate a successful unserialization for various types. The instruction
'swap' does nothing during unserialization.
Below, 'p' is a pointer to a newly allocated segment, whose size (in bytes) is given
by the instruction ?_alt_begin. 'e' is the result of unserialization when
instructions call another pseudo code. 'pop1' does not pop anything, but put the
content of R at the address pointed to by the top of stack, and increments that
pointer. 'indirect' instructions which do not call another pseudo code, put their
result at the address pointed to by the top of stack.
small type: Bool
R stack
50 | label 20 (empty) r ...
50 | type_8 d r ...
51 | swap
52 | (ret . 1) d ...
primitive type: String
R stack
50 | label 4 (empty) r ...
50 | type_String d r ...
51 | swap
52 | (ret . 1) d ...
mixed type: List(String)
R stack
50 | label 15 (empty) r ...
50 | (type_mixed_switch 2 30 29)
61 | label 30 (empty) r ...
61 | type_32 d r ...
62 | (jmp . 28)
67 | label 29
67 | (mixed_alt_begin . 12) (empty) p+4 p r ...
69 | indirect_type_String (empty) p+8 p r ...
70 | (indirect_type_mixed 2 . 15) e p+8 p r ...
76 | pop1 (empty) p+12 p r ...
77 | (mixed_alt_end . 1) d = p|1 r ...
79 | (jmp . 28)
84 | label 28
84 | swap d r ...
85 | (ret . 1) d ...
large type: Printable_tree (defined in basis.anubis)
R stack
130 | label 20 (empty) r ...
130 | (type_large_switch 5 42 41 40 39 38)
152 | label 42
152 | (large_alt_begin . 5) (empty) p+5 p r ...
154 | (large_alt_end . 0) d = p|0 r ...
156 | (jmp . 37)
161 | label 41
161 | (large_alt_begin . 13) (empty) p+5 p r ...
163 | indirect_type_String (empty) p+9 p r ...
164 | (indirect_type_large . 20) e p+9 p r ...
169 | pop1 (empty) p+13 p r ...
170 | (large_alt_end . 1) d = p|1 r ...
172 | (jmp . 37)
177 | label 40
177 | (large_alt_begin . 10) (empty) p+5 p r ...
179 | indirect_type_8 (empty) p+6 p r ...
180 | (indirect_type_large . 20) e p+6 p r ...
185 | pop1 (empty) p+10 p r ...
186 | (large_alt_end . 2) d = p|2 r ...
188 | (jmp . 37)
193 | label 39
193 | (large_alt_begin . 13) (empty) p+5 p r ...
195 | indirect_type_Int32 (empty) p+9 p r ...
196 | (indirect_type_large . 20) e p+9 p r ...
201 | pop1 (empty) p+13 p r ...
202 | (large_alt_end . 3) d = p|3 r ...
204 | (jmp . 37)
209 | label 38
209 | (large_alt_begin . 13) (empty) p+5 p r ...
211 | (indirect_type_large . 20) e p+5 p r ...
216 | pop1 (empty) p+9 p r ...
217 | (indirect_type_large . 20) e p+9 p r ...
222 | pop1 (empty) p+13 p r ...
223 | (large_alt_end . 4) d = p|4 r ...
225 | (jmp . 37)
230 | label 37
230 | swap d r ...
231 | (ret . 1) d ...
Hence, instructions work as follows during unserialization (assuming no error occurs).
'direct type' instructions:
type_0
type_8
type_16
type_32
type_Int32
type_String
type_ByteArray
type_Float
--> unserialize a datum (by reading the byte array) and put it in R
'switch' instructions:
type_mixed_switch
type_large_switch
--> unserialize an index (always a byte), check it using second operand (number of
alternatives), and jump to the right label.
'alt_begin' instructions:
mixed_alt_begin
large_alt_begin
--> allocate a segment, push the pointer, duplicate it, and add either 4 or 5 to the
copy.
'indirect non calling type' instructions:
indirect_type_0
indirect_type_8
indirect_type_16
indirect_type_32
indirect_type_Int32
indirect_type_String
indirect_type_ByteArray
indirect_type_Float
--> unserialize a datum and put it at the address pointed to by pointer on top of
stack. Increment this pointer by the size of the datum.
'indirect calling type' instructions:
indirect_type_mixed
indirect_type_large
--> Call a pseudo code. This has the effect of putting an unserialized datum in R
upon return.
'pop1'
--> put the content of R at the address pointed to by the top of stack, and increment
that pointer (always by 4).
'alt_end' instructions:
mixed_alt_end
large_alt_end
--> remove the pointer on top of stack, put the next one in R, and OR it with the
index (given by the instruction itself).
Now, unserialization may fail, because the bytes in the byte array are arbitrary. As
soon as such a failure is detected, the flag 'unserial_failed' is set to 1. In that
case, we must continue a 'fake' unserialization, because, we must complete data
partially constructed. This is the reason why all instructions must first test the
flag.
If the flag is 0 they operate as described above. Nevertheless, since they may detect
a problem they may also put 1 in the flag.
'Fake execution' of the above instructions is as follows:
'direct type' instructions: put a value of 0 in R as the result of the unserialization
'switch instructions': jump to the first alternative, which always exists
'alt_begin' instructions: do not allocate a segment, but do as if, by pushing a NULL
pointer and duplicating it (hence we have two NULL pointers
on top of stack
'indirect non calling type' instructions: there are two cases:
- the two pointers on top of stack are NULL:
do not do anything
- the two pointers on top of stack are non NULL:
put a value of zero as the result of unserialization, and increment the topmost
pointer by the right value.
'indirect calling type' instructions: again there are two cases:
- the two pointers on top of stack are NULL:
do not do anything
- the two pointers on top of stack are non NULL:
do not call any pseudo MAM(m_code_begin), and put 0 in R.
'pop1': two cases:
- the two pointers on top of stack are NULL:
do not do anything
- the two pointers on top of stack are non NULL:
put 0 at the address pointed to by the top of stack, increment the pointer.
'alt_end' instructions: two cases:
- the two pointers on top of stack are NULL:
remove the two NULL pointers and put a NULL pointer in R,
- the two pointers on top of stack are non NULL:
remove the topmost pointer, and put the other one in R, and OR it with the index.
Notice that any instruction which detects a failure, and puts 1 in the flag, should do as if
the flag was already set.
Finally, we must also discuss 'revert_to_computing'. After the unserialization, the state of
the machine is:
R stack serial_buf unserial_failed
d ... b 0/1
where 'd' is a datum, which is significant only if unserial_failed is 0.
In any case, the byte array in serial_buf must be virtually deleted, and a zero must
be put in serial_buf (this will be checked by 'unserialize'/'serialize').
If unserial_failed is:
0 then construct 'success(d)' in R (use the <width> operand to know the
bit width of type of d), and push a zero in the stack, so that
the next instruction (del_stack_?) may operate.
1 then push R in the stack (it will be deleted by the next instruction),
and put 0 (failure of type 'Maybe(T)') in R.
Note: the sequence of instructions:
unserialize <addr>
revert_to_computing <width>
is always followed by:
del_stack_? 0 (for the type of the unserialized datum)
*** (11) Outputing the code (making an *.adm file).
*** (12) How to make a system of macros ?
*** (12.1) Overview.
Anubis has no system of macros, but should have one. This would (as already remarked)
induce many simplification in the system, and at the same time provide powerful
programming tools for the user. The best system of macros ever created is that of the
Lisp language. This system was natural in Lisp because of the particularly simple form
of the syntax. Of course, the syntax of Anubis is much more complex, but as we shall
see, constructing a system 'à la Lisp' is still possible.
What a language is good for ? It is good for representing 'things' (belonging to some
'target' world) by 'expressions'. For example, the natural language represents things
of the real word by words, sentences, etc... but also abstract things like statements,
etc... Of course, each language has a syntax (grammar) but also a semantics. The
'semantics' is the correspondance between the expressions (the 'signifiers') and the
thing they represent (the 'signified').
the semantics
|
symbolic side (language) | semantic side (target world)
-------------------------------------+----------------------------------------
|
|
expression |----------------------------> thing
(signifier) | (signified)
|
Remark that in most cases (programming languages, mathematics, but not the natural
language), the symbolic side is more concrete (less abstract) than the semantic
side. Indeed, for example in mathematics, we have many expressions for representing the
number (say) 2, like 2, 1+1, 5-3, etc... and these expressions may be manipulated using
computation rules. So, this is in some sens very concrete. On the contrary, the
signified, i.e. the mathematical object 2, is a complete abstraction by its very
nature.
So, we use the language for manipulating the things (or at least we give ourself the
illusion that we manipulate the things, because what we actually manipulate are the
expressions, for example when we are computing). This leads us in some circumstances to
write down several times the same expressions or at least similar expressions. So, we
may want to make this writing automatic. In order to do that, we need to consider the
expressions themselves as a new sort of things (the 'meta-things') and to create a new
language (the 'meta-language') for manipulating the meta-things:
meta-semantics semantics
meta-language | language | target world
-------------------------+---------------------------+--------------------------
| |
meta-expression |-----------> expression |--------------> thing
| |
| |
In most cases, the meta-language is different from the language and more simple. This
is the case for the C language, where the expressions of the meta-language are all
those which begin by: #define #if etc... (a very poor language actually).
However in Lisp, the meta-language is Lisp itself. The creators of Lisp were much
inspired when they did that. This has several consequences:
1. The meta-language is as powerful as the language itself.
2. The language is also the meta-meta-language, the meta-meta-meta-language, etc...
Fans of Lisp (like me) know how this feature is powerful.
*** (12.2) The trick.
In order to have the language be its own meta-language, it is enough to embed the
language into its own target universe. In other words, it should be possible to
consider all expressions of the language as data of the target universe:
+------------------------------------------------+
| |
| target universe (all sorts of data) |
| |
| |
| |
| |
| |
| +--------------------------+ |
| | | |
| | the language | |
| | | |
| | (a subset of the | |
| | universe) | |
| | | |
| | | |
| +--------------------------+ |
| |
| |
| |
+------------------------------------------------+
Hence the language, being able to represent all data, may also represent the
expressions of the language itself. It may also be able to represent expression which
themselves represent expressions and so on, so that the language is at the same time
the meta-language, the meta-meta-language, and so on... This is the big trick.
*** (12.3) How it works.
Basically, we need only a set of types (called 'syntactical types') for representing
expressions, and an operator (named the 'expansion' operator) for forcing the compiler
to evaluate data of these types. We may also have a 'quoting' operator for simplifying
the writting of data of these types.
One of the main types for representing expressions in Anubis should be a type of
'terms':
public type Term:
symbol(String name),
applicative(Term function, List(Term) arguments),
function(List(Declaration) arguments, Term body),
conditional(Term test, List(Case) cases),
...
The quoting operator (denoted: ' ) should simplify the writting of data of type
'Term'. For example, instead of writting:
conditional(symbol("x"),
[
case([("failure",[])],symbol("false")),
case([("success",[symbol("u")])],
applicative(symbol("f"),[symbol("u")]))
]
)
we would have just to write this:
'if x is
{
failure then false,
success(u) then f(u)
}
(Remark the quoting mark in front of this term.) The above term is not of type 'Bool'
but just of type 'Term'. Thus, the quoting mark just means: 'don't interpret this
term, just consider it as a term'.
Now, we may write down functions for producing terms (of type 'Term'), type definitions
and all sorts of 'syntactical concepts', considered as data of our semantical
universe. From now on, the expressions of the language may be manipulated as any other
kind of data.
Manipulating expressions as data is called 'meta-programming'. But this is not the
whole story, because the destiny of a datum of type 'Term' or of some similar type is
at the very end to be interpeted (evaluated).
When we want to evaluate a datum (say 't') of type 'Term', we just have to write:
@t
'@' is the 'expansion' (or 'evaluation') operator. In some sens it is the converse of
quoting. It transforms a datum of type (say) 'Term' into what this datum represents if
we consider it as an expression.
*** (12.4) An example.
The most difficult is to design the types: 'TypeDefinition', 'Definition', 'Term',
'Case', etc... They should be the perfect image of the whole language. As an example,
we consider only a small fragment of this family.
public type Type:
scheme (String name, List(Type) operands),
functional (List(Type) source_types, Type target_type),
...
public type Component:
anonymous (Type type),
named (Type type, String name).
public type Alternative:
singleton (String name),
usual (String name, List(Component) components).
public type TypeDefinition:
public (String name,
List(String) parameters,
List(Alternative) alternatives),
private (String name,
List(String) parameters,
List(Alternative) alternatives).
public type Paragraph:
type_definition(TypeDefinition),
...
Now, we may want to define this
define TypeDefinition
enumerated_type
(
String type_name
List(String) alternative_names
) =
private(type_name,
[],
map((String n) |-> singleton(n), alternative_names).
define List(String) my_names = ["ga","bu","zo", "meu"].
@enumerated_type("Shadok",my_names).
In this last paragraph (a 'macro-paragraph' beginning by '@'), the term:
enumerated_type("Shadok",my_names)
is of type 'TypeDefinition'. Because of the presence of '@', the compiler will expand
it into an actual paragraph, namely this one:
type Shadok:
ga,
bu,
zo,
meu.
The syntax of 'macro-paragraphs' is:
@t
(with '@' in the leftmost column) where t is a term of type 'Paragraph' or
'List(Paragraph)'. In the case of a list of paragraphs the compiler expands the term
into a sequence of paragraphs.
Of course, the expansion operator should also be usable for the expansion of terms (and
any other sort of syntactical entity), not only for paragraphs. For example, we may
write something like this:
define Term
my_cond
(
Term test
List(String) alts
) =
conditional(test,
map((String n) |-> singleton_case(n,length(n)),
alts)).
Now, we may write (within a paragraph):
@my_cond(_x,my_names)
which will be expanded into the term:
if x is
{
ga then 2,
bu then 2,
zo then 2,
meu then 3
}
where 'x' is the expansion of '_x'.
The expansion may also be implicit. For example if a term has an interpretation whose
type is a syntactical type, the compiler may also try to interpret its expansion. And
conversely the quoting may also be implicit. However, this may eventually lead to
problems and this should be experimented.