JSON パーサの性能評価 - 主題のない日記

Guile用のライブラリをR6RS修正したものとの比較で30倍程度の差があるようである。 まぁ、性能に関して遅い方がいいなどという要件は見たことがないので、ここはえいやっとやってしまおうということにした。

せっかく書き直すのだから、後で多少色を付けやすいように(text json)というライブラリを作り、(json)はそれを使用するという形にすることにした。まぁ、実装はゴリゴリと地道にパーサを書くだけなので省略。

っで、ベンチを取る。上記の記事で使用したデータは非公開のようなので、Twitterから適当にJSONデータを取得。(Sagittarius-net-twitterを使用した。これももう少しTwitter API v1.1に対応しないとなぁ・・・) 25KBと多少小さめだがまぁいいだろう。ベンチマークには以下のスクリプトを使用。
(import (rnrs) 
 (prefix (text json) new:)
 (prefix (json) old:)

(define (run-bench json-read)
  (call-with-input-file "bench.json"
    (lambda (in)
      (time (json-read in)))))

(run-bench old:json-read)
(run-bench new:json-read)
;;  (json-read in)
;;  0.815046 real    1.310000 user    0.000000 sys

;;  (json-read in)
;;  0.032002 real    0.031000 user    0.000000 sys


Implementing non portable syntax-case

There are 2 portable syntax-case implementations, psyntax and SRFI-72. Both can be run on R5RS implementations and provide not only syntax-case but also library mechanism defined in R6RS. It is very reasonable to have such a huge mechanism instead of just providing syntax-case only because bunch of things pretty much related with syntax-case (or more precisely procedural macro and library system).

You can use it as your expander however if an implementation already has own module system then it might not a good idea to use it as it is. As my understanding, at least SRFI-72, expands all bindings to only one namespace which makes the namespace or symbol table really huge and may impacts memory usage. And again, this is reasonable decision because R5RS doesn't provide module system and if you need to write portable library then everything needs to be defined in one single namespace.

Since Sagittarius is using none of them but implementing own syntax-case expander and library system, it wouldn't hurt to write an article for this. Plus, apart from my lack of searching skill, I couldn't find any article mentioning how to implement it, other then couple of papers (could be that's enough for everybody...), I thought I can contaminate the search engine result a bit more with mine. This is basically my memo before I forget everything.

WARNING: The following sections may or may not contain mistakes. DO NOT TRUST :P

How to implement syntax-case on top of own library system

Implementing syntax-case is not an easy task to do especially in a portable way. The 2 portable implementation provides syntax-case expander and library system. This is because R5RS doesn't provide module system and R6RS procedual macros require phasing to resolve from where the bindings are imported during macro evaluation (not expansion) period. However if an implementations already has own library system, it is better to implement syntax-case on top of it so that the implementations can reuse the existing libraries without modifying anything. The following sections describe how Sagittarius' macro expander is implemented on top of own library system.


On Sagittarius, 2 Scheme objects were introduced for syntax-case, identifiers and macros. Identifier has the following slots:
  • name: raw name of the identifier, symbol
  • envs: compiler environment frame, alist
  • library: a library where this identifier belongs, library object
  • identify: a unique symbol generated per macro expansion, uninterned symbol
  • pending: a flag that indicates if the identifier is a template variable or not, boolean
The last slot, pending, is used to make an identifier unique so that evaluating template variable makes unique identifier each time. This may not needed if macro expander itself generates unique bindings like SRFI-72. However doing so may cause symbol explosion so I decided to introduce this. Internally, macro expansion is done by renaming symbols or identifiers. Each expansion remembers which symbol or identifier renamed to which so that if a symbol is renamed to an identifier twice then it will be the same object in sense of eq?.

Macro has the followings slots:
  • transformer: a procedure which expands given expassion
  • envs: a compile time environment when this macro is bound, vector
Whenever a macro is invoked, then VM swaps current macro environment with the one the macro has. And new identity is generated to rename symbols or identifier properly. Swapped environment and identity are restored when the transformer returns.


Renaming, on this context, means converting symbol or identifier to new identifier. There are 2 rules when renaming is invoked:
  • Symbols are renamed to be global identifier (passing null environment frame)
  • Not all identifiers are renamed (pattern variables, non local variables, etc.)
The idea of the first rules is inspired by SRFI-72 implementation which very first step is wrap all symbols to identifiers with empty environment. To avoid unnecessary memory allocation, I've decided to delay this step as late as possible. Thus raw symbols and identifiers created with empty environment are treated the same.

The second one is preservation. For example, local variables need to be preserved so that it can be referred. The same goes pattern variable and pending identifier which the pending slot contains #t.

Renaming is done 3 times in total. The first one is during syntax-case compilation. In this step, both pattern and template are renamed. Then the pattern variables are stored in compile time environment. The second one is during syntax compilation. In this step, only raw symbols are renamed to identifiers. The last one is during expansion. The target template needs to be renamed so that it can generate fresh identifier each time. After the expansion, raw symbols are also renamed if there is. The raw symbol conversion is required because of R6RS procedures. (actually not needed if I don't care about compliance but unfortunately I do.)

Identifier equivalence

Comparing identifiers are crucial. R6RS defines 2 types of comparison procedures, bound-identifier=? and free-identifier=?. On Sagittarius, returning #t from bound-identifier=? means given 2 identifiers have the same identities and libraries. Thus they are generated on the same macro expansion phase. Free identifiers are pretty much simple, it just checks if the given 2 identifiers are bound to the same value. This also means different named identifier can be #t in sense of free-identifier=?. (R6RS allows implementations to behave like this.)

Local variables are also simply looked up. Compiler just needs to compare variables in sense of bound-identifier=?. Because of the trick described above section, the same source identifiers are renamed to the same object. This makes look up process simply and easy. Even though the process is simple, however, comparing identifier and symbol needs special treatment. In basic concept, raw symbols and global identifiers are the same. The special treatment is when an identifier contains the same environment frame as current compile time environment frame. Variable look up is done per frame and whenever an identifier is generated by macro expander, then it contains environment frame of that moment. Following is the example case:
(import (rnrs))

(let ((a 1))
  (let-syntax ((foo (syntax-rules () ((_) a))))
    (let ((a 2))
      (+ (foo) a))))
The macro foo contains the first a which will be an identifier when it's expanded. However the compiler environment frame contains raw symbol a. So to look up it properly when the identifier contains the same frame, then it needs to compare the target as either raw symbol or global identifier.

Pit falls

Fresh template variables: R6RS requires the following template variable to be unique:
(import (rnrs))

(define-syntax define-dummy
  (syntax-rules ()
     (define dummy))))

;; -> unbound variable error
As I described above section, SRFI-72 resolves this by renaming all bindings to unique symbols. However this might be overkill if implementations already have its module system or namespace. A symbol can have different bindings if module system allows. On Sagittarius, a binding uses the raw symbol name of an identifier. Thus without any treatment, above piece of code can be run without problem. To resolve this, I introduced pending slot. The syntax which creates bindings such as define, then it always swap raw symbol name of pending identifier to uninterned symbol so that the identifier can not be seen out side of the template where it's defined. This is rather ugly solution, so I want something better.

Pattern variables: pattern variables are also identifiers except its environment frame is not a valid compiler environment frame. This is needed because pattern variables need to be preserved. However preserving pattern variable made incompatibility of portable implementations. The particular issue is following:
(import (rnrs))

(define-syntax foo
  (syntax-rules ()
    ((_ x)
     (let-syntax ((bar (syntax-rules (x)
                         ((_ a) #t)
                         ((_ b) #f))))
       (bar b)))))

(foo a)
;; -> #f
The pattern variable a is not renamed so the first pattern of bar won't match (because the pattern variable a and input variable a are not the same in sense of free-identifier=?). I believe this behaviour is still R6RS compliant since it doesn't specify this type of pattern variable renaming (if I read it correctly). To avoid this, the following is one of the solution:
(import (rnrs))

(define-syntax foo
  (syntax-rules ()
    ((_ x)
     (let-syntax ((a (syntax-rules ())))
       (let-syntax ((bar (syntax-rules (x)
                           ((_ a) #t)
                           ((_ b) #f))))
         (bar b))))))

(foo a)
;; -> #t
This behaviour is, again if I read R6RS correctly, specified in R6RS. In this case, pattern variables need to be renamed.


Using portable implementations is much easier then implementing own syntax-case. As long as implementations don't have own module system, it is better to use it. I just wanted to show at least there is a way not to use portable implementations. If you can integrate syntax-case on your own implementation, then core part can be really small (less than 1000 LoC in my case).

味園ユニバース(La La La at Rock Bottom)

ロッテルダムでやっている国際映画フェスティバルで日本語の映画がやるということで、同僚(もうすぐ元になるが)と映画を見に行った。詳細はこちら(オランダ語):味園ユニバース(La La La at Rock Bottom)





私はちょいちょいポータブルなライブラリを書く。 R6RSかR7RSかはそのときの気分次第ではあるが。最近はR7RSの方が多い気がする、処理系の更新があるから。よく使うR6RS処理系(Mosh、Ypsilon)は更新とまってるんだよねぇ、はぁ・・・。ポータブルなものを書く理由は特にない、あるとすれば名前を売るか自己満足くらいだろう。Schemerが増えてSagittariusへのフィードバックが多くなるといいなぁとかも考えてはいるが、期待はしていない(正直特に反応もないし)。これはポータブルなものに限った話ではないのだが、ライブラリを書く際に部品が足りていないと非常に辛い。例えば優先度つきキューを考えてみる(最近適当なのを書いたのだ)、こんな超が付くほど有名なデータ構造は標準もしくは準標準でサポートされていてほしいと思うのが多少なりともプログラムを書いた人間なら思うだろう(学習用とかは除く)。しかし、残念ながらSchemeにはどちらにもないのである。

なければその場で書いてもいいだろう、 実際適当な実装であれば50行もあれば書けてしまうのだ。問題はそれがこれか何万回と続く可能性があることだ。誰もが使う可能性があるデータ構造、アルゴリズムというものは裏を返せば(返さなくても)自分もよく使うのである。R5RS以前のSchemeはモジュールという概念がなく、そういったものを標準の範囲内でまとめておくことができなかったが、R6RS以降はライブラリがあるのだ、使わない手はない。

もう一つ理由がある。インターフェースの統一である。例えばスレッド関連のSRFIとしてSRFI-18があるのだが、不思議なことに採用している処理系はそんなに多くない。POSIXスレッドモデルを採用している処理系であれば互換レイヤを書くのはそんなに難しくないのだが、世の中そんなに甘くない。私がよく使うR6RS処理系、MoshとYpsilon、はMutexすらユーザに見せていない。 そうすると同期を必要とするライブラリを書くのが非常に面倒になる(それらをサポートしないという手ももちろんあるが・・・)。こうなったときに処理系側にサポートを強要することが可能な一つの手段としてSRFIがあると思っている。






Schemeにはhygienic macroなるものがあり、清潔なマクロとか健全なマクロとか訳されるのであるが、それを使うと変数の衝突などの問題をあまり考えなくてもよくなるである。Scheme界では長年このマクロについて議論されてきたらしいのだが、R6RSで終に低レベルかつ健全なマクロの一つであるsyntax-caseが標準に入ることになった。これでマクロ展開時に式変形以上のことができると喜んだのもつかの間、R7RSではまた不便なsyntax-rulesだけに戻されてしまった。理由はよく分からない。風の噂ではR7RS-largeではExplicit Renamingと呼ばれる別の低レベルな健全マクロが入るとされている。

erマクロとはどんなものか?大まかに言えば、Common Lispのマクロに健全性を追加する機能を備えたものである。与えられた式の解析、どのシンボルを衝突しないものにするか等は全てユーザーに任される。どのような見た目になるのか、伝統的なaifをerマクロで作ってみることにした。
(define-syntax aif
   (lambda (form rename compare)
     (let ((test (cadr form))
           (then (caddr form))
           (els  (if (null? (cdddr form))
                     (cadddr form))))
       `(,(rename 'let) ((it ,test))
         (,(rename 'if) it ,then ,els))))))

(aif (assq 'a '((a . 0))) it)
;; -> (a . 0)
;; Oops!
(aif (ass 'a '((a . 0))))

    program: (aif (assq 'a '((a . 0))))
    source: "macro.scm":17
  &who car
  &message pair required, but got ()
  &irritants ()

stack trace:
  [1] raise
  [2] caddr
  [3] #f
    src: (caddr form)
  [4] macro-transform
  [5] pass1
  [6] #f
  [7] with-error-handler
  [8] load

(import (rnrs))

(define-syntax aif
  (lambda (x)
    (syntax-case x ()
      ((aif test then)
       #'(aif test then #f))
      ((aif test then else)
       (with-syntax ((it (datum->syntax #'aif 'it)))
         #'(let ((it test))
             (if it then else)))))))

(aif (assq 'a '((a . 0))) it)
;; -> (a . 0)

(aif (ass 'a '((a . 0))))
    program: (aif (assq 'a '((a . 0))))
    source: "macro.scm":32
    subform: #f
    form: (aif (assq 'a '((a . 0))))
  &who aif
  &message invalid syntax

stack trace:
  [1] raise
  [2] macro-transform
  [3] pass1
  [4] #f
  [5] with-error-handler
  [6] load




Shallow string copy

Since Sagittarius 0.5.10, symbols and keywords are also target of GC. And both are using string as it's real name. The strings were copied and made as literal string before so that symbol->string could just return the given symbols name. It's not the same story now. The whole purpose of GCable symbols and keywords is saving memory. For example, what would happen if symbols were not GCed:
(dotimes (i 10000000)
  (string->symbol (format "symbol.~a" i)))
;; -> 10000000 symbols will be created
You would think nobody writes this type of code. But what if the input was user dependent? To avoid these type of out of memory, I've decided to introduce GCable symbols and keywords.

Now, I didn't make literal string GCable. I don't remember the exact reason but I think it was too naive implementation to do it and caused a lot of problem (maybe I should try again but feels like it would be very tricky to do it since symbols' and keywords' names are string). So whenever symbol->string is called and the name is not a literal string then the procedure copies the string. Thus it can be O(n) process. Stupid isn't it? So I'm thinking to introduce shallow string copy that only takes O(1).

To do it, I need to reconstruct C world string structure. It's currently defined like this:
struct SgStringRec
  unsigned int literalp: 1;
  int size             : (SIZEOF_INT*CHAR_BIT-1);
  SgChar value[1];
This is because of GC efficiency. Strings are allocated as atomic memory means there is no pointer inside. This makes GC impact a bit lighter because collector won't traverse inside of allocated pointer (according to Boehm GC document). If I make shallow string copy takes O(1) then the structure should be something like this:
struct SgStringRec
  unsigned int literalp: 1;
  int size             : (SIZEOF_INT*CHAR_BIT-1);
  SgChar *start;
  SgChar *end; /* or add one more flag if this is shallow copied */
Then copy string just puts the source pointer. If modification happens, it just make sure the pointer will be deep copied. Seems fine, the consideration is how much GC performance impact it would cause.

The GC impact can be 2 parts:
  • Making string non atomic pointer
  • Limited range shallow copy
The first one is Boehm GC thing so I just need to profile it. The second one can be a problem. Lets say you create a huge string and substring it like this:
(define (split-string s)
  (values (substring s 0 10)
          (substring s 10 20)
          (substring s 20 30)))
The number of character you want is only 30. However if the input string is, let's say, 10000, then the original source pointer would remain until all 3 strings are GCed (or modified). On the other hand, if string copy happens so many times with small string, then actual memory consumption would be smaller. Well, it's not about memory consumption but processing order so this might be kinda trade off.

So to estimate how much impact we would have, I've counted the number of possibly affected procedures.
Not so many. This result even includes test cases. symbol->string is not called in many place. So the overhead of O(n) process may not be an issue. (Well, I know this is rubbish because it doesn't show how many it's actually called.)

Hmmmm, should I?



  • 識別子の比較は、名前とその識別子の色で行われる
  • 識別子が含む色はリストである(colors)
  • 色はマクロが呼ばれる度に生成される
  • 色は一意のシンボル(eq?でのみ比較される何か)である
  • ソースの式に含まれるシンボルは全て識別子に変換される
    • その際の色は空(nil)
  • 識別子が含む環境は、リネームされるたびに追加されている
  • 束縛は識別子の名前と色をキーとして作られる
  • 全ての束縛には一意の名前が付けられる
    • ライブラリからエクスポートされる際にはそれに対して別名がつく





(import (rnrs))

(define-syntax aif
  (lambda (x)
    (syntax-case x ()
      ((_ test then)
       #'(aif test then #f))
      ((k test then else)
       (with-syntax ((it (datum->syntax #'k 'it)))
         #'(let ((it test))
             (if it then else)))))))

(aif (assq 'a '((a . 0))) it)
;; -> error

何が問題かといえば、Sagittariusはこれをエラーにしない。さらにまずいことに僕自身がこの動作を期待してコードを書いている点である。datum->syntaxを多用した記憶はあまりないのだが、CLOS周りとかはかなり処理系の中身べったりで書いているので問題が起きる。また記憶が正しかったら(binary data)辺りもこれ系の動作を期待しているような気がする。

(define-syntax aif1
  (syntax-rules ()
    ((_ test then)
     (aif1 test then #f))
    ((_ test then else)
     (aif test then else))))

(aif1 (assq 'a '((a . 0))) it)



Environment frames on identifiers

Happy new year! I've kinda missed Japanese holidays around new years day since I needed to work on 2nd January. Some friends even remained me that it's holiday. Shoot!

I've written the article about the bug of datum->syntax. Now I think I've got an idea to resolve without doing O(n^2) identifier rewriting thing during showering. It's always good to let it sit for couple of days (can be weeks or even months).

The problem is that if a renamed identifier is a local variable however variable look up can't find it properly because it's comparing somewhat awkwardly. The reason why it's doing weirdly is more like historical reason (mostly my ignorance of hygienic macro). It might be cleaner if it can refer just comparing environment frame however because of the lack of knowledge, it's now a huge spaghetti even I don't understand why it's like that. Currently identifiers can hold one set of compile time environment frames. The structure of env frame is like this:
(type . ((var val) ...))
type indicates if this is a local variable frame or a pattern variable frame. To look up proper variables from this, it required (or not) somewhat very complicated pile of shit. It basically tries to find proper frames of given identifiers. However because macro expander doesn't rename all identifiers, some frames are the same in identifiers that should be different variables.

The idea came up during showering is making identifiers can contain multiple sets of env frames. The frames should be used from the latest to oldest and macro expander should always rename with prepended env frames. This seems identifier can hold proper environment where it's defined no matter how many it's got renamed. The issues I can see are:
  • Cache file explosion
  • Memory consumption
The first one should not be big deal but I may face this since I've already got huge cache file issue. The second one is sort of inevitable if we rename identifiers each macro expansion.

I'm not totally sure if this works or not yet but sounds like a plan. Need to consider a bit more before start doing this.