On the speed of pypy

I had heard that pypy was fast. Like really fast.

Well, it’s true! In the following post, I’ll show you how one data mining algorithm went from 23 seconds (cpython) to 4 seconds (pypy). Without any modification, tweak, or special compiler/interpreter switch. I actually installed pypy 1.5.2 from the archlinux community repository so I did not compile it.

The Story

I recently searched for an implementation of a frequent item set mining algorithm in Python but I could not find a library that was easy to use and that implemented a recent algorithm (apriori is quite old now).

I ended up implementing three frequent item set mining algorithms in python and, although I was pleased with the result, I found that they were slow. The following session with Python 3.2.1 shows how much time it took to run 10 rounds of three different frequent item set mining algorithms with the library I created:

Python 2.7.2 (default, Jun 29 2011, 11:10:00)
 [GCC 4.6.1] on linux2
 Type "help", "copyright", "credits" or "license" for more...
 >>> from pymining import itemmining
 >>> itemmining.test_perf(seed=200)
 Random transactions generated

 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 FP-Growth (pruning off) took: 97.9762558937
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Relim took: 22.12490201
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Sam took: 50.3823070526
 Computed 1491 frequent item sets.

This morning, I thought about pypy and gave it a shot:

Python 2.7.1 (?, May 28 2011, 20:50:19)
 [PyPy 1.5.0-alpha0 with GCC 4.6.0] on linux2
 Type "help", "copyright", "credits" or "license" for more...
 And now for something completely different: ...
 >>>> from pymining import itemmining
 >>>> itemmining.test_perf(seed=200)
 Random transactions generated

 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 FP-Growth (pruning off) took: 21.6948671341
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Relim took: 4.04977107048
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Sam took: 20.4720339775
 Computed 1491 frequent item sets.

Yup, you read this correctly. cpython was ~5 times slower than pypy for FP-Growth and relim, and ~2 times slower for sam.

Although pypy is speedy, it still does not work with many c extensions (particularly cython extensions), so I have to do some of my work in two virtual environments. For a few seconds, I would not bother, but considering the kind of data I work with these days, it could save me hours.

Interested in pymining?

Fork it on github! Or just install it with pip:

pip install -e git://github.com/bartdag/pymining#egg=pymining

Compiling PIL on Ubuntu Natty

Again, I just lost a precious hour trying to install the Python Imaging Library in a virtual environment on Ubuntu. Even though I had installed the required dependencies, the install script did not detect that freetype and zlib were installed… The culprit: Ubuntu installs the libraries in a very weird directory and you need to set these directories in the PIL setup.py script.

First, install the required dependencies:

apt-get install python-dev \
libfreetype6-dev zlib1g-dev libjpeg8-dev

tar -xvzf Imaging-1.1.7.tar.gz
cd Imaging-1.1.7.tar.gz
vim setup.py

Then, in the setup.py file, set these two variables

ZLIB_ROOT = ("/usr/lib/i386-linux-gnu", "/usr/include")
FREETYPE_ROOT = ("/usr/lib/i386-linux-gnu",
    "/usr/include/freetype2/freetype")

Then just run python setup.py install when in your virtual environment.

gitli: Lightweight Issue Tracker for git

Last week, I wanted to track issues (bugs, tasks, enhancements) for a private project and I could not find an issue tracker that was lightweight (command line) and dead easy to install and use. Trac is very nice and easy to use, but installing and configuring a trac instance is far from a walk in the park.

I considered using ticgit, which is a true distributed issue tracking system, i.e., tickets are hosted on a git repository and multiple contributors can manage the issues. Unfortunately, to avoid ticket number clash (e.g., you don’t want Bob and Alice to create ticket #3 at the same time on their own repository), the ticket numbers are hash values like d7f2d8f6d6ec3da1a1800a33fb090d590a533bac or d7f2d8. That’s just unacceptable to me because my brain is not wired to think about ticket d7f2d8. I don’t see myself typing that kind of identifiers on the command line too. Finally, I’m not even sure it would work for teams because talking about ticket d7f2d8 adds a lot more cognitive overload than talking about ticket #7.

Long story short, I created my own small issue tracking system for personal projects: gitli. You can pronounce it the way you want, but I prefer to say “jit lee” (jet li anyone?). You can install it the way you install any python package (e.g., pip install gitli) or you can just download the files and put them on your PATH.

Using gitli is easy:

git init
git li init
git li new 'This is my first ticket'
git li -h #get some help!

gitli is not a distributed issue tracker so conflicts can occur if multiple developers create issues on their own repository. Because the datastore is a set of line-based text files, I hope that merging such conflicts will be easy. In any cases, gitli was not intended to supplement centralized (or distributed) issue tracking systems. It’s just a nice hack for personal projects.

Consult the readme file to learn more about gitli commands. I’ll consider feature requests and bug reports because I’m using gitli every day now!

Updating MoinMoin - Major Python Update on ArchLinux

This morning, I upgraded my server, which runs ArchLinux. As you may know, ArchLinux recently updated their Python packages so that the default python points to python3. Python 2.6 was also updated to Python 2.7 (accessible from python2) Everything went relatively well, but I had to recreate all my virtual environments because they were too tightly integrated with python 2.6. I have one virtual environment for each big application running on my server (Trac, MoinMoin, and my ph.d. project, recodoc) and it took longer than expected... MoinMoin must be the worst wiki on earth because every time I want to update it (either MoinMoin or Python), I get a few extra gray hairs. Here are the steps that worked for me:

  1. Stop your wsgi/fcgi server (in my case, gunicorn)
  2. Backup wiki config (share/moin) and wiki data/underlay directories.
  3. Remove the .pyc files from the backup (e.g., share/moin/config/wikiconfig.pyc)
  4. If upgrading Python, remove the cache files from the backup:
    rm -r path_to_moin/data/pages/*/cache/*
    rm -r path_to_moin/underlay/pages/*/cache/*
  5. Update MoinMoin/Python/Whatever
  6. Restore share/moin directory and wiki data/underlay directories.
  7. Restart wsgi/fcgi

Steps 4 and 5 are essential if you are upgrading Python. Indeed, compiled python files are not necessarily compatible from one version to the other (that was my first problem). MoinMoin cache files contain python bytecode that may not be compatible. Backing up the data/underlay directory is not necessary (as long as you delete the cache files when upgrading Python), but I would hate losing my wiki data during an update. Frankly, the only reason I stick with MoinMoin is because it's Python and not PHP. MoinMoin updates are just plain unusable, broken, and evil. Come on, why do I need to run such potentially nasty commands like rm -r with stars? Ok, back to work.

Python for the Java Programmer - Part 3

Three more Python tricks (ok, one is about a standard python module) that I always get wrong the first time.

Reading a Web Page

import urllib2
page = urllib2.urlopen('http://www.infobart.com')
content = page.read()  

# content is now a "byte string" (ascii-like) 
# which does not support unicode. 
# Do this to properly encode your content: 
encoding=page.headers['content-type'].split('charset=')[-1] 
ucontent = unicode(content, encoding)  

# Reference: http://stackoverflow.com/questions/1020892/urllib2-read-to-unicode [/cc] 

Sorting Lists

>>> l1 = ['BBB','aaa','ccc','DDD'] 
>>> # Standard sort 
>>> sorted(l1) 
['BBB', 'DDD', 'aaa', 'ccc'] 
>>> 
>>> # Passing a function that will be applied 
>>> # to each value before sorting. Good for 
>>> # standalone function
>>> sorted(l1, key=str.lower)
['aaa', 'BBB', 'ccc', 'DDD'] 
>>> 
>>> # Using a lambda that will return the key 
>>> # for each value: better if you want to invoke 
>>> # an instance method. 
>>> sorted(l1, key=lambda s: s.lower()) 
['aaa', 'BBB', 'ccc', 'DDD'] 
>>> 
>>> # Also works with list.sort() 
>>> l1.sort(key=lambda s:s.lower()) 
>>> 
l1 ['aaa', 'BBB', 'ccc', 'DDD']

String Formatting

b1 = True 
f1 = 7.0/22.0 
i1 = 234 
s1 = 'World!'  

s = 'Hello ' + s1 + str(b1) + str(f1) + str(i1) 
# s = 'Hello World!True0.318181818182234'   

s = 'Hello %s %r %.2f %i' % (s1, b1, f1, i1) 
# s = 'Hello World! True 0.32 234'  

# Preferred way of formatting a string 
s = 'Hello {0} {1} {2:.2} {3}'.format(s1, b1, f1, i1) 
# s = 'Hello World! True 0.32 234' 

Python for the Java Programmer – Part 2

Three more Python tricks I miss a lot when I have to program in Java!

Using a Context Manager

from threading import lock;
lock = Lock()
with lock:
    print('I have the lock!')
    print('I still have the lock!')
print('I no longer have the lock')

# no more try/finally:
# try { 
#lock.lock(); 
# } finally { 
#   lock.unlock(); 
# } 
# Can also be used with files or your own objects!

List Comprehension and Filtering

>>> # List comprehension 
>>> l1 = [i for i in xrange(10)] 
>>> l1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
>>> # List transformation 
>>> l2 = [i*2 for i in l1] 
>>> l2 [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] 
>>> # List Filtering 
>>> l3 = [i for i in l2 if i % 3 == 0] 
>>> l3 [0, 6, 12, 18]

Accessing the index in a “for each” loop

my_list = ['a','b','c','d','e']
for (index,letter) in enumerate(my_list):
    print('%i: %s' % (index,letter))

# Will print: 
# 0: a 
# 1: b 
# 2: c 
# 3: d 
# 4: e  

# In Java, you have to either: 
# int i = 0; # for (String s : myList) { 
#     System.out.println(i + " " + s); 
#     i++; # } # # OR # # for (int i = 0; i

Python for the Java Programmer - Part 1

I’m launching a new series of short posts on small Python tricks that I slowly discovered and that were not immediately obvious coming from a Java background.

How to call Exception.printStackTrace()?

from traceback import print_exc

if __name__ == '__main__':
    try:
        i = 1 / 0
    except Exception:
        print_exc()

Looping through two collections

namesList = ['Alice','Bob','Carl','David','Earl']
agesList = [25,40,10,5,19]
for (name,age) in zip(namesList,agesList):
    print('%s is %i yrs old' % (name,age))

# For very long lists, izip is better because
# it does not create a list (like range and xrange): 
from itertools import izip

for (name,age) in izip(namesList,agesList):
    print('%s is %i yrs old' % (name,age))

Was the loop interrupted?

namesList = ['Alice','Carl','David','Earl']
for name in namesList:
    if name == 'Bob':
        break
else:
    raise Exception('Bob was not found')
    # Note: else is only executed if no break was executed