A Successful Optimization Writings index       Home

Optimizing code has never been my strong point (avoiding bugs has always been my emphasis, quite successfully for the most part). Quite often, my attempts at speeding up my code resulted in either negative "improvement", or no improvement, but the case I describe here was dramatically good.

I was working on a program that took as input another program, and translated all the variables into short names. This was for a web application, so using shorter names would accomplish two things:

In order to do this, I had to build a lookup table, and since I was using Perl, a Perl hash was the obvious thing to do:

$hashname{$old_variable_name} = $new_variable_name;

At first I just built a single hash structure, and populated it as just indicated.

Then as I read through the sources, for each line that wasn't a comment, I would go through all the names and substitute any that were found.

The running time for this was around two minutes.

Then I decided to make a separate hash for each letter of the alphabet. I used the first letter of each variable to determine which hash it would enter. But since some letters were very common, and others didn't appear at all, this wasn't very balanced.

During the scan I could check if a given letter was present in the line of code, and if not, skip checking the entries in the corresponding hash altogether.

This reduced the run time by about half, or around one minute.

Then I tried to figure out a way to better balance the lists, and I hit upon this idea, based on how Huffman codes are assembled:

This improved algorithm reduced the run time to about 18 seconds, less than 1/5 of what it had been before, a more than 80% improvement.

Here are some numbers:
There are 441 functions and 349 variables.
The longest list of names for functions is 28.
The longest list of names for variables is 18.
Most of the lists are shorter, and since there are 52 (a - z AND A - Z) of them
for each category, they average 8.5 for functions and 6.7 for variables.

Gathering some statistics, the total number of lookups was on average around 64 identifiers per line, as opposed to 890 using the original algorithm. No wonder the time was reduced by so much!