Job queue in bash

All right, I’ve been busy starting a new job while finishing my Ph.D. during the weekends, but on the plus side, I have tons of little tricks to post. Let’s start with this one.

I configured our server to perform a build after each push to our mercurial repositories: test, coverage analysis, documentation generation, email to the team, etc. I wanted to make sure that only one build at a time could occur so I tried to use the at command in Linux, only to realize later that although at had the concept of a job queue, it did not ensure that the jobs were executed one at a time. *

Finally, I found a nice trick in bash to do this. It is not perfect (there is still the possibility that two jobs will run concurrently or that the queue will get corrupted), but for our use case, this is acceptable and quite unlikely.

To create a bash queue, create a file, say /opt/scripts_output/queue, then add this code at the beginning of your script:

towait=`tail -n 1 /opt/scripts_output/queue`
echo $$ >> /opt/scripts_output/queue

while kill -0 $towait >/dev/null 2>&1; do
    sleep 1
done

Essentially,

  1. The bash script reads the pid of the last job.
  2. The script writes its pid to the queue file.
  3. The script waits for the last job to terminate (kill -0) and then proceed.

Because 1 and 2 are not atomic, this could be problematic if you believe that your bash scripts can be called and executed exactly at the same time. One could create a small but unfair lock file to still keep the solution simple, but the ordering of the jobs would not be necessarily preserved.

You can factor this process in a bash function and you can use it in different scripts. As long as all scripts point to the same queue file, you will be (relatively) sure that they are only executed sequentially.

For us, if a problem occurs because of this script, I get an error in my mailbox and I just wait for the next push (or I manually relaunch the build process). This never happened since I wrote this script, but we know it is a possibility. If you want to use this little bash trick for mission-critical systems, that’s another story.

  • Actually, when a job is started by at, the job goes to another queue and at moves on to the next job and starts it if the time is right. The end result is that all jobs in a queue can run concurrently.

pymining is now available on pypi!

Just a quick post to say that I released the very first version of pymining on pypi. It is now super easy to install this small but hopefully useful library:

pip install pymining

The library includes three frequent item set mining algorithms and one association rule mining algorithm. As shown in the previous post, running this library with pypy results in impressive performance.

Feature request, improvements and bug reports are welcome as usual.

Happy data mining :–)

On the speed of pypy

I had heard that pypy was fast. Like really fast.

Well, it’s true! In the following post, I’ll show you how one data mining algorithm went from 23 seconds (cpython) to 4 seconds (pypy). Without any modification, tweak, or special compiler/interpreter switch. I actually installed pypy 1.5.2 from the archlinux community repository so I did not compile it.

The Story

I recently searched for an implementation of a frequent item set mining algorithm in Python but I could not find a library that was easy to use and that implemented a recent algorithm (apriori is quite old now).

I ended up implementing three frequent item set mining algorithms in python and, although I was pleased with the result, I found that they were slow. The following session with Python 3.2.1 shows how much time it took to run 10 rounds of three different frequent item set mining algorithms with the library I created:

Python 2.7.2 (default, Jun 29 2011, 11:10:00)
 [GCC 4.6.1] on linux2
 Type "help", "copyright", "credits" or "license" for more...
 >>> from pymining import itemmining
 >>> itemmining.test_perf(seed=200)
 Random transactions generated

 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 FP-Growth (pruning off) took: 97.9762558937
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Relim took: 22.12490201
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Sam took: 50.3823070526
 Computed 1491 frequent item sets.

This morning, I thought about pypy and gave it a shot:

Python 2.7.1 (?, May 28 2011, 20:50:19)
 [PyPy 1.5.0-alpha0 with GCC 4.6.0] on linux2
 Type "help", "copyright", "credits" or "license" for more...
 And now for something completely different: ...
 >>>> from pymining import itemmining
 >>>> itemmining.test_perf(seed=200)
 Random transactions generated

 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 FP-Growth (pruning off) took: 21.6948671341
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Relim took: 4.04977107048
 Computed 1491 frequent item sets.
 Done round 0
 Done round 1
 Done round 2
 Done round 3
 Done round 4
 Done round 5
 Done round 6
 Done round 7
 Done round 8
 Done round 9
 Sam took: 20.4720339775
 Computed 1491 frequent item sets.

Yup, you read this correctly. cpython was ~5 times slower than pypy for FP-Growth and relim, and ~2 times slower for sam.

Although pypy is speedy, it still does not work with many c extensions (particularly cython extensions), so I have to do some of my work in two virtual environments. For a few seconds, I would not bother, but considering the kind of data I work with these days, it could save me hours.

Interested in pymining?

Fork it on github! Or just install it with pip:

pip install -e git://github.com/bartdag/pymining#egg=pymining

Compiling PIL on Ubuntu Natty

Again, I just lost a precious hour trying to install the Python Imaging Library in a virtual environment on Ubuntu. Even though I had installed the required dependencies, the install script did not detect that freetype and zlib were installed… The culprit: Ubuntu installs the libraries in a very weird directory and you need to set these directories in the PIL setup.py script.

First, install the required dependencies:

apt-get install python-dev \
libfreetype6-dev zlib1g-dev libjpeg8-dev

tar -xvzf Imaging-1.1.7.tar.gz
cd Imaging-1.1.7.tar.gz
vim setup.py

Then, in the setup.py file, set these two variables

ZLIB_ROOT = ("/usr/lib/i386-linux-gnu", "/usr/include")
FREETYPE_ROOT = ("/usr/lib/i386-linux-gnu",
    "/usr/include/freetype2/freetype")

Then just run python setup.py install when in your virtual environment.

Interrogation marks and keyboard layouts

I recently switched from KDE to XFCE because I could not stand KDE bugs anymore. The upside is that XFCE is super simple, fast, and minimalist. The downside is that everything looks “blocky” (as in “lego blocks”).

Anyway, two issues have been bugging me these past few days and I finally got around them today.

Accented characters displayed as interrogation marks in terminal

I write in French and English so I sometimes need to type accented characters on the command line. Unfortunately, after installing XFCE on archlinux, I found that every time I typed an accented character in the terminal, it was replaced by an interrogation mark.

After messing around with my configuration files for three days, I finally found the solution on the archlinux wiki:

# in ~/.bashrc
export LC_ALL=
export LC_COLLATE="C"
export LANG=en_US.utf8

# in /etc/environment
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

I believe that only the /etc/environment is necessary, but setting your language in your bash file is considered to be a good practice and other programs might use these variables to determine your locale. Obviously, this is in complement of configuring your /etc/locale.gen file properly.

Keyboard layouts and xfce4-xkb-plugin

This was an extremely annoying bug that probably cost me a few years of my life expectancy in stress. After installing the xfce4-xkb-plugin, I added an extra keyboard layout (English Canada) and I set the layout switch key combination to alt-shift. Unfortunately, the plugin, as other users have reported, has a tendency to “forget” the layout and/or the key combination every once in a while.

The plugin seems to be patched in various distributions, but since archlinux rarely patch programs (which is a generally a good thing), I was stuck with a buggy version.

After some research, I uninstalled the plugin and added this script in /etc/X11/xinit/xinitrc.d/50-keyboard-bart

#!/usr/bin/bash

# Reset keyboard options
setxkbmap -option ""
# Set layout
setxkbmap -layout "us,ca"
# Set international keyboard
setxkbmap -model "pc105"
# Use alt-shift to switch layout
setxkbmap -option "grp:alt_shift_toggle"

If you go down that path, don’t forget to make the script executable. You also need to restart your X session (or you can just execute the script in the current session). The downside is that you do not see your current layout on your panel, but I see this as less cutter :–)

HelloWorld moving to Posterous

After evaluating blogging, micro-blogging, and sort-of-in-the-middle blogging platforms, I decided to leave my self-managed wordpress tech blog and transfer all my posts to posterous.

Although my tech blog started well (they all do), I never found the motivation to post regularly. It is surprising how logging to your own website/blog can be a deterrent. I must say that fighting with the editor was not pleasant and left some scars: publishing code is a pain and line breaks seem to be interpreted randomly in wordpress post editor (it is slightly better in blogger’s post editor, but not perfect).

Still, there are many small things I would like to post regularly so I tried to find a blogging platform that would suit my need. After a big two hours of research, I selected Posterous because:

  1. You can post by email and they seem to be good at it. Talk about a low-barrier to publish :–)

  2. You can import your Facebook posts, but you will still need to edit them if you used fancy html or square tags (e.g., [cc])

  3. You can automatically cross-post to Facebook and Twitter, something that I had to manually perform each time I published a new post.

  4. After signing up, I discovered that it is ridiculously easy to customize your theme… You can select one of their theme, then select “advanced editing” and you get a nice html+css page that you can edit. For example, I increased the font size and changed the syntax highlighting font of the default theme.

Only time will tell if this is enough to vanquish my general apathy when it comes to blogging!

gitli: Lightweight Issue Tracker for git

Last week, I wanted to track issues (bugs, tasks, enhancements) for a private project and I could not find an issue tracker that was lightweight (command line) and dead easy to install and use. Trac is very nice and easy to use, but installing and configuring a trac instance is far from a walk in the park.

I considered using ticgit, which is a true distributed issue tracking system, i.e., tickets are hosted on a git repository and multiple contributors can manage the issues. Unfortunately, to avoid ticket number clash (e.g., you don’t want Bob and Alice to create ticket #3 at the same time on their own repository), the ticket numbers are hash values like d7f2d8f6d6ec3da1a1800a33fb090d590a533bac or d7f2d8. That’s just unacceptable to me because my brain is not wired to think about ticket d7f2d8. I don’t see myself typing that kind of identifiers on the command line too. Finally, I’m not even sure it would work for teams because talking about ticket d7f2d8 adds a lot more cognitive overload than talking about ticket #7.

Long story short, I created my own small issue tracking system for personal projects: gitli. You can pronounce it the way you want, but I prefer to say “jit lee” (jet li anyone?). You can install it the way you install any python package (e.g., pip install gitli) or you can just download the files and put them on your PATH.

Using gitli is easy:

git init
git li init
git li new 'This is my first ticket'
git li -h #get some help!

gitli is not a distributed issue tracker so conflicts can occur if multiple developers create issues on their own repository. Because the datastore is a set of line-based text files, I hope that merging such conflicts will be easy. In any cases, gitli was not intended to supplement centralized (or distributed) issue tracking systems. It’s just a nice hack for personal projects.

Consult the readme file to learn more about gitli commands. I’ll consider feature requests and bug reports because I’m using gitli every day now!

Category not showing up in Eclipse Update Site

This issue has plagued my update sites for years and I finally found a workaround.

The problem: you create an Eclipse update site, you add a category, then you add a feature under the category. You build the update site (using either Build or Build All). Sometimes, when you try to install the features from the update site with "Install New Software..." you don't see your category and you need to uncheck "Group items by category". Even if you try to rebuild everything, delete the artifacts.jar and content.jar, you still cannot see your category. Annoying isn't it?

The solution: remove the category and the features from your update site. Save. Add them again. Build All. You need to do this every time you make a change (e.g., update the version number of a feature). This is a silly bug in Eclipse, but it is the only way I could reliably work around the issue.

Root cause: it seems that there is some caching of the site.xml involved and that restarting Eclipse may help, but it never worked for me.

Updating MoinMoin - Major Python Update on ArchLinux

This morning, I upgraded my server, which runs ArchLinux. As you may know, ArchLinux recently updated their Python packages so that the default python points to python3. Python 2.6 was also updated to Python 2.7 (accessible from python2) Everything went relatively well, but I had to recreate all my virtual environments because they were too tightly integrated with python 2.6. I have one virtual environment for each big application running on my server (Trac, MoinMoin, and my ph.d. project, recodoc) and it took longer than expected... MoinMoin must be the worst wiki on earth because every time I want to update it (either MoinMoin or Python), I get a few extra gray hairs. Here are the steps that worked for me:

  1. Stop your wsgi/fcgi server (in my case, gunicorn)
  2. Backup wiki config (share/moin) and wiki data/underlay directories.
  3. Remove the .pyc files from the backup (e.g., share/moin/config/wikiconfig.pyc)
  4. If upgrading Python, remove the cache files from the backup:
    rm -r path_to_moin/data/pages/*/cache/*
    rm -r path_to_moin/underlay/pages/*/cache/*
  5. Update MoinMoin/Python/Whatever
  6. Restore share/moin directory and wiki data/underlay directories.
  7. Restart wsgi/fcgi

Steps 4 and 5 are essential if you are upgrading Python. Indeed, compiled python files are not necessarily compatible from one version to the other (that was my first problem). MoinMoin cache files contain python bytecode that may not be compatible. Backing up the data/underlay directory is not necessary (as long as you delete the cache files when upgrading Python), but I would hate losing my wiki data during an update. Frankly, the only reason I stick with MoinMoin is because it's Python and not PHP. MoinMoin updates are just plain unusable, broken, and evil. Come on, why do I need to run such potentially nasty commands like rm -r with stars? Ok, back to work.