Sunday, November 08, 2009

Shedskin: what next?

(Shed Skin is an experimental (restricted) Python-to-C++ compiler.)

After playing with Disco for a while, and after that being away for over a month, I'm ready to put some time in Shed Skin again. I've just applied a few patches that I received (thanks!!), fixed some minor issues myself and added a few optimizations. But I'm not really sure which 'big' thing to work on next, and I don't really know of a nice program to test with at the moment (such as Minilight or Disco). I'd love to hear some suggestions!

update: I posted a feature plan for 0.2.1 on the mailing list. thanks for the feedback!

29 comments:

Alec Thomas said...

How about seamless integration with the distutils tool-chain? It'd be really useful to be able to use shedskin for creating modules in a similar way to Cython, but with seamless fallback to pure Python.

Tom said...

Here's a suggestion that would probably be difficult to implement, but I think it would open up a whole new category of uses for shedskin.

Python is a great language for prototyping apps - I frequently write an initial prototype in Python and then convert it to C++ later. As a result, I would like to be able to use Shedskin to do a one time, 'best possible' conversion of a Python program to C++.

My Python programs are typically in 'RPython' style - if you want to be able to convert your Python program to any statically typed language then this is best - so they are suitable for conversion. However, it would be rare that one of my Python apps would work with shedskin since I am bound to use at least one module that shedskin doesn't support.

I would like shed-skin to be able to continue running even after encountering something that can't be converted. For example, if shedskin were to encounter an import for an unsupported module then it should create an empty file for that module with the members that are used.

This way, I might have to do some tweaking to get the generated C++ program to compile, but I would be able to use shedskin on all of my programs.

James said...

Great work!

One suggestion comes to mind now (I will try to think of others): add Shed Skin benchmark implementations to the great computer language shootout. Of course, Shed Skin is not a language in the usual sense, but the shootout does include Pysco benchmarks. (At least an earlier shootout website did, e.g. the URL below.)

http://shootout.alioth.debian.org/gp4/

Bertik said...

Hi. First of all thanks for the great work on shedskin. I really appreciate it.

If you want to implement something similar to minilight then i would suggest this:

http://www.kevinbeason.com/smallpt/

You will find implementations in other languages here:

http://leonardo-m.livejournal.com/79395.html

and here (in C++, Boo (a python-like language for the clr) and C#):

http://boolang.googlegroups.com/web/BooTestCases(Casts-ConstantFolding).zip?gda=Ab_gSl0AAABhYyo--a8H8WVZILzkxgc0NGI3_80IYY7j923lBkc2XLIIwHexjAQBQpJR-Jkh1fCaQ__-agx50wYxzs-guIwHactuCrO6nX14jzwWK3XqmuU2_747KStNgkfeVUa7Znk


It's a relatively short computation-intensive benchmark with a nice result and a good stress test for shed skin. It would also allow to compare the quality of the generated code of shedskin with the original C++ implementation.

I would love to see the results.

illume said...

Hi,

I second the distutils integration!

My use case for shedskin so far is using it with cpython modules. However for one project I didn't use the shedskin module just because it would require work for me to integrate it with the distutils. Even though it gave me a nice 2-3x speedup for that code, it wasn't worth the work to package it.

If I could have my setup.py try and use the pre generated cpp files, or try and use shedskin to compile to cpp... and then finally compile the cpp to a .so/.dll then use that from a .py that would be great. Extra points for generating a .py which looks for _module.so and uses the shedskin version if it has been installed.

Probably a lot of the pyrex/cython code could be reused for this.

I guess some framework for running tests would be useful for shedskin too. So it can run the tests for the module and then generate the code? Perhaps only using the shedskin version if the tests pass? This would allow profile guided optimisation to be used in there too right?


Another goal could be to implement some missing builtins. eg, map? There are python versions of those functions around(like in pypy). So would using the python versions be easy enough?

I can't think of much else at the moment.

cu!

Blaine said...

I have a few questions about GPLv3 Licensing. First: is the resulting C++ source code, that links against builtin.hpp GPLv3? I will assume this is "yes".

If yes, this would mean that I am required to provide the source code (the .cpp and .hpp files) along with the distribution of the binary that results when I compile the c++ file.

Now, what about when I make a python module and compile it? I can understand that if the C++ is GPLv3, I need to distribute the source code of the module.
* But what about a script that imports the GPLv3 module at runtime? Am I required to make this application GPLv3 as well?
* If this is true, how does this affect *other* modules that are imported by this application. I imagine that if I import the "os" module it is not affected.

I hope this isn't too confusing. I am aware that shed skin's source code must be distributed with the binary - this much makes sense.

Thanks,
Blaine

srepmub said...

thanks for the suggestions so far!

@alec, illume:

I agree it would be very useful to have seamless integration with distutils, but I'm afraid we'll have to wait until someone gets a big enough itch to look into this (especially the Windows part..)

@tom:

I'm afraid compile-time type inference won't allow such kinds of fallback mechanisms. but did you know you can build extension modules with shedskin? (shedskin -e, see the tutorial). this way, you may be able to split your code into compileable and non-compileable code, and still keep everything in Python. shedskin's goal really is not to compile arbitrary code, but to allow you to write fast computational code in Python, possibly at the cost of some effort.

@james:

thanks, but although many of the benchmarks compile fine, the shootout doesn't accept 'experimental' language implementations!

@bertik:

thanks! if you could send me a working Python version, I'd love to try it out.

srepmub said...

@blaine:

no need to worry, only the compiler core is GPL. all the stuff in lib/ is licensed under an MIT-like license (see the LICENSE file). so there are (afaics) no restrictions on compiled programs.

srepmub said...

in the meantime, I found a nice new test case here, and with some minor modifications got it to compile:

http://www.xs4all.nl/~rjoris/maximummatching.html

Joseph Coffey said...

If you make shedskin work with numpy like cython does that would probably expand the audience.

Most people who use python and care about performance use it in a scientific context.

srepmub said...

I agree Numpy support would be great, and would be happy to support anyone looking into adding this. the first hurdle would be to create a type model (see the tutorial) to see if Numpy can be supported at all. if this turns out to be impossible, it may be useful to look into supporting/wrapping some other library, such as Eigen.

illume said...

It looks like weave.blitz converts numpy expressions into C++ using the blitz++ library.

see http://www.scipy.org/PerformancePython

cu.

srepmub said...

thanks. the example they give is interesting, because you can just use shedskin _instead_ of numpy here, to get a similar speedup without needing anything but basic python syntax. I benchmarked the double-loop (for i.. for j) part (10,000 runs):

-using numpy indexing (u[i, j]): 9 seconds
-using python lists (u[i][j]): 1.6 seconds (avoiding indexing with tuples..)
-shedskin: 0.020 seconds
-shedskin -bw: 0.010 seconds

so now I'm wondering, how many people use numpy in this way, that is, just to get around python's speed limitations, in which case shedskin might be used instead, and how many actually use its more fancy linear algebra features?

srepmub said...

another option of course, that doesn't require numpy support in shedskin, is to generate an extension module (shedskin -e) for anything that may otherwise be slow. so you can just use numpy in the main program, and pass for example a matrix as a list to the extension module.

illume said...

hi,

Those are some good results :)

Lots of people using numpy are looking for nice ways to speed things up. As you can see from that page, there are a number of different techniques available.

I think people use numpy for its elegance, as well as for the easy speedup(since it is available on so many platforms). Also it does provide many fancy algorithms :)

Personally both is true for me for different programs. Some times I linalg module... and other times I just use the numpy.array.

However, lots find it troublesome to learn a different way of programming. I see this all the time from people trying to speed up their programs with pygame... and finding numpy weird. People coming from a matlab background find it quite nice of course.

The performancepython wiki page would be a good page to add shedskin to if that example can be sped up by it. I couldn't see the pure python version of the code, it imports numpy. However I guess the data structure could be replaced with a list... rather than a numpy array. Then removing all of the other non python stuff... it looks like it should compile with shedskin.

... I started converting the laplace.py here to just use python... but there's a few numpy things that need converting to more normal python style.
http://rene.f0o.com/~rene/stuff/numpy_laplace.py


One option would be for shedskin to support the new buffer interface in python. This is how pygame Surfaces talk with numpy with the surfarray module. This would let shedskin integrate more as a python module. This way you could pass in a pygame Surface.get_buffer(), or a numpy array, and use the shedskin module on that data.

The buffer interface is just a way to tell what the shape and strides of the data are. I guess it would be similar to implementing the python array module.

So in that example... other than converting it, you could compile a module with shedskin. Then the shedskin module would look at the buffer shape, data type and pointer.

Another option would be to convert the numpy array tolist()/from python lists, and feed it to the shedskin compiled module. (this is mentioned in the shedskin tutorial).


Perhaps code generated by shedskin that works on the right dimensioned lists of floats, could be modified fairly easily to work with numpy arrays too.


cheers,

James said...

One limitation of Shed Skin that makes it harder to use than I would like is the fact that "builtin objects are completely converted for each call/return from Shed Skin to CPython types and back, including their contents."

Suppose I want to modify a large NumPy matrix, M, many times:

for i in range(10000):
M = modify(M)
# use NumPy functions here
# to do other things with M

While it is tempting to simply compile the modify function (using tolist() in the input to modify() and array() to reformat the function’s output to NumPy format), the call/return copy overhead will make the loop too slow – especially when M is large (say 10000 x 10000). Currently the only solution is to compile the entire code loop above with Shed Skin, which will require lots of hand-coded translations if the code loop contains a lot of NumPy functions.

The ultimate solution may be to add NumPy support to Shed Skin – but if this proves difficult to implement, then in the meantime the ability to eliminate the call/return copy overhead would make it easy to speed up this example by simply compiling modify().

srepmub said...

thanks for the suggestions! but I have to say I'm quite happy at the moment with the simple integration that is possible at the moment. it's too bad some programs cannot be optimized using shedskin this way, but it never really has been my goal to support arbitrary programs. of course I'd still be happy to support anyone looking further into such issues, but I don't think I will start working on these myself.

Ben McDonald said...

In the tutorial, you describe how to call C++ by creating a type model file, *.py. Would you consider a type model generator that uses a C++ header?

C++, maybe unfortunately, is still widely used. I still have to use it for graphics and Computer Vision. If Shedskin could effortlessly call C++ it maybe very useful for some developers.

I don't know enough about Shedskin to guess how this could be done but even a generator that can only parse basic C++ code could be useful.

Thanks

srepmub said...

it sounds like you'd want to modify something like SWIG for this purpose, so you still get to develop/debug your code with CPython. then when you're done developing, it could automatically generate a type model and glue code.

but then again, Python is usually fast enough in glue-code situations, and if there's some particularly slow computation, in many cases it should be possible to isolate the respective code, and use shedskin -e to generate an extension module.

brentp said...

i agree with @illume that it would be great if shedskin's lists supported the array/buffer interface -- which would give seamless integration with numpy without copying/memory overhead.
is something like that possible?
http://docs.python.org/dev/c-api/buffer.html

Spinner said...

Great. Are you planning on anything at PyCon 2010 this year? A sprint maybe?

srepmub said...

I have to admit this array/buffer (and memoryview?) interfacing idea does sound quite interesting.. I will have a look at the documentation, and think about how to use this in shedskin. thanks illume for a potentially very useful suggestion! if anyone would like to play with this already, please start a discussion on the mailing list.

srepmub said...

no I won't be at pycon 2010.. but I should probably go to python conferences more, if only to promote shedskin..

James said...

How about Scipy 2010? It will be in Austin, TX in late June:

http://blog.jarrodmillman.com/2009/11/scipy-2010-coming-to-austin-tx-628-74.html

This would be an ideal forum for Shed Skin.

And there might be a way to raise some $ to help you with the trip expenses...

Byron said...

To make collaboration easier for all the git-affine people out there, I setup a mirror of shedskin on gitorious:

http://gitorious.org/shedskin

Once I heard about facebooks hiphop approach to speed-up php code, I was hoping something like it exists for python as well. Hopefully, many others will look for it too, and boost your project's development speed :).

Regards,
Sebastian

srepmub said...

nice, thanks!! I've actually just started using GIT myself for some other project, and am looking forward to reading 'pro git', which I received today.. :) I'd like to use GIT for shedskin too, but I'm not ready to leave the googlecode site (yet). is it possible to have the googlecode SVN repo automatically track the gitorious repo?? that would be wonderful.

srepmub said...

oh, I see some other projects that use googlecode, but use an external GIT repo instead of google-hosted SVN. I'd like to do that too at some point.

Byron said...

To enable a smooth transition, you should be able to use git-svn to pull the complete history of your subversion repository. You'd get essentially the same repository that I mirror on gitorious.

You could now work using all features of git, and at some point push your changes back to subversion using
git svn dcommit.

Nonetheless, your local repository could be hosted on gitorious as mainline as well, allowing others to clone it there and collaborate. Git and subversion interact quite flawlessly :).

If you are interested in testing that workflow and if you would like to claim your 'shedskin' project name on gitorious, please let me know and I will be happy to help :).

srepmub said...

thanks! I think I will try to work with the gitorious repo a bit, and if I'm happy with that, just drop SVN completely, and link to the GIT repo from the googlecode 'source' tab, as I've seen other projects do.