I had heard that pypy was fast. Like really fast.
Well, it’s true! In the following post, I’ll show you how one data
mining algorithm went from 23 seconds (cpython) to 4 seconds (pypy).
Without any modification, tweak, or special compiler/interpreter switch.
I actually installed pypy 1.5.2 from the archlinux community repository
so I did not compile it.
The Story
I recently searched for an implementation of a frequent item set mining
algorithm in Python but I could not find a library that was easy to use
and that implemented a recent algorithm (apriori is quite old now).
I ended up implementing three frequent item set mining algorithms in
python and, although I was pleased with the result, I found that they
were slow. The following session with Python 3.2.1 shows how much time it
took to run 10 rounds of three different frequent item set mining
algorithms with the library I
created:
Python 2.7.2 (default, Jun 29 2011, 11:10:00)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more...
>>> from pymining import itemmining
>>> itemmining.test_perf(seed=200)
Random transactions generated
Done round 0
Done round 1
Done round 2
Done round 3
Done round 4
Done round 5
Done round 6
Done round 7
Done round 8
Done round 9
FP-Growth (pruning off) took: 97.9762558937
Computed 1491 frequent item sets.
Done round 0
Done round 1
Done round 2
Done round 3
Done round 4
Done round 5
Done round 6
Done round 7
Done round 8
Done round 9
Relim took: 22.12490201
Computed 1491 frequent item sets.
Done round 0
Done round 1
Done round 2
Done round 3
Done round 4
Done round 5
Done round 6
Done round 7
Done round 8
Done round 9
Sam took: 50.3823070526
Computed 1491 frequent item sets.
This morning, I thought about pypy and gave it a shot:
Python 2.7.1 (?, May 28 2011, 20:50:19)
[PyPy 1.5.0-alpha0 with GCC 4.6.0] on linux2
Type "help", "copyright", "credits" or "license" for more...
And now for something completely different: ...
>>>> from pymining import itemmining
>>>> itemmining.test_perf(seed=200)
Random transactions generated
Done round 0
Done round 1
Done round 2
Done round 3
Done round 4
Done round 5
Done round 6
Done round 7
Done round 8
Done round 9
FP-Growth (pruning off) took: 21.6948671341
Computed 1491 frequent item sets.
Done round 0
Done round 1
Done round 2
Done round 3
Done round 4
Done round 5
Done round 6
Done round 7
Done round 8
Done round 9
Relim took: 4.04977107048
Computed 1491 frequent item sets.
Done round 0
Done round 1
Done round 2
Done round 3
Done round 4
Done round 5
Done round 6
Done round 7
Done round 8
Done round 9
Sam took: 20.4720339775
Computed 1491 frequent item sets.
Yup, you read this correctly. cpython was ~5 times slower than pypy for
FP-Growth and relim, and ~2 times slower for sam.
Although pypy is speedy, it still does not work with many c extensions
(particularly cython extensions), so I have to do some of my work in two
virtual environments. For a few seconds, I would not bother, but
considering the kind of data I work with these days, it could save me hours.
Interested in pymining?
Fork it on github! Or just
install it with pip:
pip install -e git://github.com/bartdag/pymining#egg=pymining