Thursday, December 16, 2010

Shed Skin 0.7, Type Inference Scalability

I have just released version 0.7 of Shed Skin, a (restricted) Python-to-C++ compiler. The new version comes with lots of minor fixes and some optimizations, as well as a new Windows package. The Windows package was upgraded to a recent version of MinGW, which means it now includes GCC 4.5. I hope to package each new release for Windows again from now on.

There are also two nice new examples, one of which is over 1,000 lines (sloccount), for a total of 52 examples. Both are ray tracers, and it's always nice to look at the output of ray tracers. The larger one has a GUI and uses the multiprocessing module in combination with a Shed Skin generated extension module to do its heavy lifting. The other one uses a very slow but more realistic ray tracing technique, for a very nice result. Thanks to Eric Uhrhane and Jonas Wagner for sharing their programs!

I'd like to take the opportunity to talk a bit about type inference scalability. This graph shows the time Shed Skin needs for each of the current examples (seconds versus sloccount; the point on the right is the C64 emulator). I have to mention that two examples are left out here, because of a non-termination issue - that is, Shed Skin aborts after a certain amount of iterations. The reason I haven't looked into fixing this issue yet is that they do compile and run fine after Shed Skin aborts, because the type analysis is already precise enough at this point (though it takes some time). Precisely because of that, I don't think this issue would be hard to fix.

So while the current type analysis is obviously not perfect, I think this graph shows that type inference may not be all that hard as some people would have thought, at least for statically restricted code (the restriction does make it easier!). And I'm sure a team of programmers or someone smarter than me could do a much better job.

For those with some knowledge of the type inference literature, I'd like to explain at this point how Shed Skin achieves this scalability. Who knows this might be useful to people starting out in this area. It is actually quite simple conceptually (but don't think you can implement this correctly in a few weeks or even months for a large language such as Python!).

Shed Skin combines Ole Agesen's Cartesian Product Algorithm (CPA) with the data-polymorphic part of John Plevyak's Iterative Flow Analysis (IFA), both published in 1996 if I'm correct, and as suggested by Agesen. Because the CPA algorithm has a the tendency to explode for larger programs (usually in combination with the first, very imprecise iteration of IFA), Shed Skin 0.6 adds an incremental approach to the mix. It simply repeats its analysis for increasingly larger versions of a program, starting at nothing, essentially, and slowly adding to the callgraph until the whole program is analyzed. This seems to greatly avoid the CPA from exploding.