Commit 4414a71411962dbccb213466c43e89553cf8805c

Authored by Alain Prouté
1 parent 4e802eb7

added find_and_replace.anubis (still under construction)

Showing 1 changed file with 578 additions and 0 deletions   Show diff stats
anubis_dev/library/tools/find_and_replace.anubis 0 → 100644
  1 +
  2 + The Anubis Project.
  3 +
  4 + An efficient find and replace tool.
  5 +
  6 + Author: Alain Prouté 2015/07/15
  7 +
  8 +
  9 + *** Introduction.
  10 +
  11 + This file contains a 'multiple find and replace' tool. By 'multiple'
  12 + we mean that we use a dictionary in the form of a list of pairs (key,value)
  13 + and that the occurrences of any key are replaced by the corresponding value.
  14 +
  15 + A key can be part of another key, but the function replaces the longuest
  16 + possible occurrence. For example, if 'ab' and abc' are two keys in the
  17 + dictionary, and if the text contains 'abc', the function replaces 'abc'
  18 + by the corresponding value, not 'ab'.
  19 +
  20 + In order to be efficient, we compile the dictionary into an automaton
  21 + and the automaton is executed by a syscall (defined in predefined.anubis)
  22 + on the given text. The dictionary can be compiled in advance and the
  23 + automaton used any number of times.
  24 +
  25 + The purpose of the program in this file is mainly to compile the dictionary
  26 + into an automaton. The find and replace operation itself is performed by a
  27 + syscall.
  28 +
  29 +
  30 + *** Dictionaries.
  31 +
  32 + The type below defines a single entry in the dictionary. It is designed so that
  33 + you can put either a String or a ByteArray as the key and/or the value.
  34 +
  35 +public type FR_DictEntry:
  36 + entry (String key, String value),
  37 + entry (String key, ByteArray value),
  38 + entry (ByteArray key, String value),
  39 + entry (ByteArray key, ByteArray value).
  40 +
  41 + The dictionary itself is of type 'List(FR_DictEntry)'.
  42 +
  43 +
  44 + *** Compiling a dictionary.
  45 +
  46 + The result of compiling a dictionary is a datum of type 'FR_CompiledDict'.
  47 +
  48 +public type FR_CompiledDict:. (an opaque type)
  49 +
  50 + You can compile your dictionary as follows:
  51 +
  52 +public define FR_CompiledDict
  53 + compile
  54 + (
  55 + List(FR_DictEntry) dictionary
  56 + ).
  57 +
  58 + The type 'FR_CompiledDict' is serializable. Hence you can store the automaton
  59 + into a file or transmit it over the network, etc...
  60 +
  61 +
  62 + *** Finding and replacing.
  63 +
  64 + Once your dictionary is commpiled, you can use it for performing 'find and replace'
  65 + operations by using one of:
  66 +
  67 +public define macro String find_and_replace(FR_CompiledDict dict, String text).
  68 +public define macro String find_and_replace(FR_CompiledDict dict, ByteArray text).
  69 +public define macro ByteArray find_and_replace(FR_CompiledDict dict, String text).
  70 +public define macro ByteArray find_and_replace(FR_CompiledDict dict, ByteArray text).
  71 +
  72 + These fonctions are deterministic, because the 'text' is not altered. The result
  73 + is always a new datum. This is why we can call them 'functions' (by opposition
  74 + to 'commands').
  75 +
  76 +
  77 + *** Using this tool with streams.
  78 +
  79 + It can be convenient to use this tool with streams (files, connections,...) instead
  80 + of strings or byte arrays. For example, you may want to copy a file into another one,
  81 + or into a network connection, replacing keys by values. To that end, we provide
  82 + the following:
  83 +
  84 +read tools/connections.anubis
  85 +
  86 +public define Bool find_and_replace(FR_CompiledDict dict,
  87 + Connection source,
  88 + Connection target,
  89 + Int chunk_size).
  90 +
  91 + which returns true if no error occured.
  92 +
  93 + This command will do the job by reading the source connection by chunks of size
  94 + 'chunk_size'. Of course, it is writen in such a way that it handles correctly
  95 + keys which span across two consecutive chunks.
  96 +
  97 +
  98 + *** Remarks.
  99 +
  100 + The syscalls which performs the find and replace operation use Boyer-Moore-like
  101 + techniques for searching. We have tried to produce a tool as efficient in speed
  102 + as possible.
  103 +
  104 +
  105 +
  106 +
  107 + --- That's all for the public part ! -----------------------------------------
  108 +
  109 +
  110 +read tools/basis.anubis
  111 +
  112 +
  113 + *** Uniform dictionary entry suitable for our computations.
  114 +
  115 +type DictE:
  116 + e(List(Word8) key,
  117 + ByteArray value).
  118 +
  119 +
  120 + *** Transformation of the original dictionary.
  121 +
  122 +define List(Word8)
  123 + explode
  124 + (
  125 + ByteArray ba,
  126 + Int i,
  127 + Int n,
  128 + List(Word8) so_far
  129 + ) =
  130 + if i > n then reverse(so_far) else
  131 + if nth(i,ba) is
  132 + {
  133 + failure then reverse(so_far),
  134 + success(c) then explode(ba,i+1,n,[c . so_far])
  135 + }.
  136 +
  137 +
  138 +define List(DictE)
  139 + to_Dict
  140 + (
  141 + List(FR_DictEntry) d
  142 + ) =
  143 + map((FR_DictEntry en) |-> if en is
  144 + {
  145 + entry(k,v) then e(explode(k),to_byte_array(v)),
  146 + entry(k,v) then e(explode(k),v),
  147 + entry(k,v) then e(explode(k,0,length(k),[]),to_byte_array(v)),
  148 + entry(k,v) then e(explode(k,0,length(k),[]),v)
  149 + } ,d).
  150 +
  151 +
  152 + *** Getting the list of all keys sorted by increasing size.
  153 +
  154 +define List(List(Word8))
  155 + sorted_keys
  156 + (
  157 + List(DictE) dict
  158 + ) =
  159 + merge_sort(map(key,dict),(List(Word8) l1, List(Word8) l2) |-> length(l1) < length(l2)).
  160 +
  161 +
  162 + *** Formal automaton (first form).
  163 +
  164 +type AutoState1:
  165 + init (Word16 id,
  166 + Int jump,
  167 + Word16 next_state),
  168 + reject (Word16 id,
  169 + Int disp,
  170 + Word8 char,
  171 + Int jump_if_match,
  172 + Word16 next_if_match,
  173 + Int jump_if_dont_match,
  174 + Word16 next_if_dont_match),
  175 + accept (Word16 id,
  176 + Int disp,
  177 + Word8 char,
  178 + Int jump_if_dont_match,
  179 + Word16 next_if_dont_match),
  180 + reject_exit (Word16 id,
  181 + Int disp,
  182 + Word8 char,
  183 + Int jump_if_match,
  184 + Word16 next_if_match),
  185 + accept_exit (Word16 id,
  186 + Int disp,
  187 + Word8 char).
  188 +
  189 +
  190 +
  191 +
  192 + *** Single key automaton.
  193 +
  194 +define (
  195 + List(AutoState1), // the automaton
  196 + Word16 // next available state id
  197 + )
  198 + single_key_auto_aux
  199 + (
  200 + Word16 id, // first available state id
  201 + List(Word8) key,
  202 + Int disp
  203 + ) =
  204 + if key is
  205 + {
  206 + [ ] then ([ ],id),
  207 + [h . t] then
  208 + since single_key_auto_aux(id+1,t,disp-1) is (rest,next_id),
  209 + ([
  210 + if t is [ ]
  211 + then accept_exit(id,disp,h)
  212 + else reject_exit(id,disp,h,-1,id+1)
  213 + . rest], next_id)
  214 + }.
  215 +
  216 +
  217 +define (
  218 + List(AutoState1), // the automaton
  219 + Word16 // next available state id
  220 + )
  221 + single_key_auto
  222 + (
  223 + Word16 id, // first available state id
  224 + List(Word8) key
  225 + ) =
  226 + since single_key_auto_aux(id+1,reverse(key),length(key)-1) is (auto,next_id),
  227 + ([init(id,length(key)-1,id+1) . auto],next_id).
  228 +
  229 +
  230 +
  231 + *** Many keys automaton.
  232 +
  233 +
  234 +define AutoState1 -> AutoState1
  235 + glue_state
  236 + (
  237 + Word16 target_id,
  238 + Int init_disp
  239 + ) =
  240 + (AutoState1 s) |->
  241 + if s is
  242 + {
  243 + init(id,jump,next_state) then
  244 + s,
  245 + reject(id,disp,char,jump_if_match,next_if_match,jump_if_dont_match,next_if_dont_match) then
  246 + s,
  247 + accept(id,disp,char,jump_if_dont_match,next_if_dont_match) then
  248 + s,
  249 + reject_exit(id,disp,char,jump_if_match,next_if_match) then
  250 + reject(id,disp,char,jump_if_match,next_if_match,init_disp-disp,target_id),
  251 + accept_exit(id,disp,char) then
  252 + accept(id,disp,char,init_disp-disp,target_id)
  253 + }.
  254 +
  255 +define (
  256 + Word16, // id
  257 + Int, // initial displacement
  258 + Word16 // next state
  259 + )
  260 + init_state
  261 + (
  262 + List(AutoState1) l
  263 + ) =
  264 + if l is
  265 + {
  266 + [ ] then should_not_happen((0,0,0)),
  267 + [h . t] then
  268 + if h is init(id,j,n) then (id,j,n)
  269 + else init_state(t)
  270 + }.
  271 +
  272 +
  273 + Glueing a many key automaton on a single key automaton.
  274 +
  275 +define List(AutoState1)
  276 + glue_auto
  277 + (
  278 + List(AutoState1) auto_many,
  279 + List(AutoState1) auto_single
  280 + ) =
  281 + since init_state(auto_single) is (id,disp,next),
  282 + map(glue_state(next,disp),auto_many) + auto_single.
  283 +
  284 +define (
  285 + List(AutoState1), // the automaton
  286 + Word16 // next available state id
  287 + )
  288 + many_keys_auto
  289 + (
  290 + Word16 id, // first available id
  291 + List(List(Word8)) keys
  292 + ) =
  293 + if keys is
  294 + {
  295 + [ ] then ([ ],id),
  296 + [key1 . others] then
  297 + since single_key_auto(id,key1) is (auto1,id1),
  298 + since many_keys_auto(id1,others) is (auto_rest,next_id),
  299 + (glue_auto(auto_rest,auto1), next_id)
  300 + }.
  301 +
  302 +
  303 +
  304 +
  305 + *** Formal automaton (second form).
  306 +
  307 +type PatternChar: // a character at a given position
  308 + known (Word8 char,
  309 + Int disp).
  310 +
  311 +define List(PatternChar)
  312 + add
  313 + (
  314 + Word8 char,
  315 + Int disp,
  316 + List(PatternChar) pattern
  317 + ) =
  318 + with e = known(char,disp),
  319 + if e:pattern
  320 + then pattern
  321 + else [e . pattern].
  322 +
  323 +
  324 +type AutoState2:
  325 + init (Word16 id,
  326 + List(PatternChar) pattern,
  327 + Int jump,
  328 + Word16 next_state),
  329 + reject (Word16 id,
  330 + List(PatternChar) pattern,
  331 + Int disp,
  332 + Word8 char,
  333 + Int jump_if_match,
  334 + Word16 next_if_match,
  335 + Int jump_if_dont_match,
  336 + Word16 next_if_dont_match),
  337 + accept (Word16 id,
  338 + List(PatternChar) pattern,
  339 + Int disp,
  340 + Word8 char,
  341 + Int jump_if_dont_match,
  342 + Word16 next_if_dont_match),
  343 + reject_exit (Word16 id,
  344 + List(PatternChar) pattern,
  345 + Int disp,
  346 + Word8 char,
  347 + Int jump_if_match,
  348 + Word16 next_if_match),
  349 + accept_exit (Word16 id,
  350 + List(PatternChar) pattern,
  351 + Int disp,
  352 + Word8 char).
  353 +
  354 +
  355 +define AutoState1
  356 + by_id
  357 + (
  358 + List(AutoState1) l,
  359 + Word16 i
  360 + ) =
  361 + if l is
  362 + {
  363 + [ ] then should_not_happen(init(0,0,0)),
  364 + [h . t] then
  365 + if id(h) = i then h else by_id(t,i)
  366 + }.
  367 +
  368 +
  369 +
  370 +define (
  371 + List(AutoState2), // resulting tree
  372 + Word16 // next available state id
  373 + )
  374 + unfold
  375 + (
  376 + List(AutoState1) auto,
  377 + AutoState1 root,
  378 + Word16 root_id,
  379 + List(PatternChar) pattern
  380 + ) =
  381 + if root is
  382 + {
  383 + init(id,jump,next_state) then
  384 + if unfold(auto,by_id(auto,next_state),root_id+1,pattern) is (tree,next_id) then
  385 + ([init(root_id,pattern,jump,root_id+1)
  386 + . tree],next_id),
  387 +
  388 + reject(id,disp,char,jump_if_match,next_if_match,jump_if_dont_match,next_if_dont_match) then
  389 + if unfold(auto,by_id(auto,next_if_match),root_id+1,add(char,disp,pattern)) is (left,nid1) then
  390 + if unfold(auto,by_id(auto,next_if_dont_match),nid1,pattern) is (right,nid2) then
  391 + ([reject(root_id,pattern,disp,char,jump_if_match,root_id+1,
  392 + jump_if_dont_match,nid1)
  393 + . left+right], nid2),
  394 +
  395 + accept(id,disp,char,jump_if_dont_match,next_if_dont_match) then
  396 + if unfold(auto,by_id(auto,next_if_dont_match),root_id+1,pattern) is (right,nid1) then
  397 + ([accept(root_id,pattern,disp,char,jump_if_dont_match,root_id+1)
  398 + . right], nid1),
  399 +
  400 + reject_exit(id,disp,char,jump_if_match,next_if_match) then
  401 + if unfold(auto,by_id(auto,next_if_match),root_id+1,add(char,disp,pattern)) is (left,nid1) then
  402 + ([reject_exit(root_id,pattern,disp,char,jump_if_match,root_id+1)
  403 + . left], nid1),
  404 +
  405 + accept_exit(id,disp,char) then
  406 + ([accept_exit(root_id,pattern,disp,char)],root_id+1)
  407 + }.
  408 +
  409 +
  410 +
  411 +
  412 +
  413 + *** Testing.
  414 +
  415 +
  416 + define List(AutoState1)
  417 + dummy
  418 + =
  419 + [
  420 + init(0,4,1),
  421 + reject(1,'a',-1,2,2,3),
  422 + accept(2,'b',1,2),
  423 + reject_exit(3,'c',1,1),
  424 + accept_exit(4,'d')
  425 + ].
  426 +
  427 +
  428 +define String pad2(Int n) =
  429 + if abs(n) < 10 then "0"+abs_to_decimal(n) else abs_to_decimal(n).
  430 +
  431 +define String
  432 + tds
  433 + (
  434 + Int n
  435 + ) =
  436 + (if n >= 0 then "+" else "-")+pad2(n).
  437 +
  438 +define String format (AutoState1 a) =
  439 + with sep = " | ",
  440 + no_jmp = " - ",
  441 + no_next = " -- ",
  442 + if a is
  443 + {
  444 + init(id,j,n) then
  445 + concat([to_hexa(id)," ",tds(j),to_hexa(n),no_jmp,no_next,"init"],sep),
  446 + reject(id,disp,c,jm,nm,jnm,nnm) then
  447 + concat([to_hexa(id),implode([c]),tds(jm),to_hexa(nm),tds(jnm),to_hexa(nnm),"reject"],sep),
  448 + accept(id,disp,c,jnm,nnm) then
  449 + concat([to_hexa(id),implode([c]),no_jmp,no_next,tds(jnm),to_hexa(nnm),"accept"],sep),
  450 + reject_exit(id,disp,c,jm,nm) then
  451 + concat([to_hexa(id),implode([c]),tds(jm),to_hexa(nm),no_jmp,no_next,"reject/exit"],sep),
  452 + accept_exit(id,disp,c) then
  453 + concat([to_hexa(id),implode([c]),no_jmp,no_next,no_jmp,no_next,"accept/exit"],sep),
  454 + }.
  455 +
  456 +
  457 +define String
  458 + auto1_banner
  459 + =
  460 + " state | char | jmp if match | next if match | jmp if nomatch | next if nomatch | action\n".
  461 +
  462 +define String
  463 + auto1_sep
  464 + =
  465 + "-----------+--------------+----------------+-----------------+----------------+-----------------+-----------------\n".
  466 +
  467 +define One
  468 + show
  469 + (
  470 + List(AutoState1) l
  471 + ) =
  472 + if l is
  473 + {
  474 + [ ] then unique,
  475 + [h . t] then print(format(h)); print("\n"); show(t)
  476 + }.
  477 +
  478 +
  479 +
  480 +define String to_hexa (List(Word16) l) = "?".
  481 +
  482 +define Maybe(Word8)
  483 + find
  484 + (
  485 + Int d,
  486 + List(PatternChar) l
  487 + ) =
  488 + if l is
  489 + {
  490 + [ ] then failure,
  491 + [h . t] then if h is known(c,d1) then
  492 + if d = d1
  493 + then success(c)
  494 + else find(d,t)
  495 + }.
  496 +
  497 +define String
  498 + format
  499 + (
  500 + List(PatternChar) l,
  501 + Int max,
  502 + Int i
  503 + ) =
  504 + if i >= max then ""
  505 + else if find(i,l) is
  506 + {
  507 + failure then "_"+format(l,max,i+1),
  508 + success(c) then implode([c])+format(l,max,i+1)
  509 + }.
  510 +
  511 +
  512 +
  513 +define String format (AutoState2 a, Int max) =
  514 + with sep = " | ",
  515 + no_jmp = " - ",
  516 + no_next = " -- ",
  517 + if a is
  518 + {
  519 + init(id,p,j,n) then
  520 + concat([to_hexa(id)," ",tds(j),to_hexa(n),no_jmp,no_next,"init ",format(p,max,0)],sep),
  521 + reject(id,p,disp,c,jm,nm,jnm,nnm) then
  522 + concat([to_hexa(id),implode([c]),tds(jm),to_hexa(nm),tds(jnm),to_hexa(nnm),"reject ",format(p,max,0)],sep),
  523 + accept(id,p,disp,c,jnm,nnm) then
  524 + concat([to_hexa(id),implode([c]),no_jmp,no_next,tds(jnm),to_hexa(nnm),"accept ",format(p,max,0)],sep),
  525 + reject_exit(id,p,disp,c,jm,nm) then
  526 + concat([to_hexa(id),implode([c]),tds(jm),to_hexa(nm),no_jmp,no_next,"reject/exit",format(p,max,0)],sep),
  527 + accept_exit(id,p,disp,c) then
  528 + concat([to_hexa(id),implode([c]),no_jmp,no_next,no_jmp,no_next,"accept/exit",format(p,max,0)],sep),
  529 + }.
  530 +
  531 +
  532 +
  533 +define One
  534 + show
  535 + (
  536 + List(AutoState2) l,
  537 + Int max
  538 + ) =
  539 + if l is
  540 + {
  541 + [ ] then unique,
  542 + [h . t] then print(format(h,max)); print("\n"); show(t,max)
  543 + }.
  544 +
  545 +define String
  546 + auto2_banner
  547 + =
  548 + " state | char | jmp if match | next if match | jmp if nomatch | next if nomatch | action | pattern \n".
  549 +
  550 +define String
  551 + auto2_sep
  552 + =
  553 + "-----------+--------------+----------------+-----------------+----------------+-----------------+------------------------+--------------\n".
  554 +
  555 +
  556 +
  557 +global define One
  558 + fr_test
  559 + (
  560 + List(String) args
  561 + ) =
  562 + print(auto2_banner);
  563 + print(auto2_sep);
  564 + if many_keys_auto(0,[explode("cd"),
  565 + explode("abc"),
  566 + explode("abcde")]) is (auto,next_id) then
  567 + if unfold(auto,by_id(auto,7),0,[]) is (auto2,_) then
  568 + show(auto2,5);
  569 + print(auto2_sep).
  570 +
  571 +
  572 +
  573 +
  574 +
  575 +
  576 +
  577 +
  578 +
... ...