1) Introduction:
One thing I notice as I read forums and blogs is that a lot of people are moving to more modern and simpler languages than C, such as Python, Java and C#. Most of them don't even learn the basic concepts of algorithm complexity and data structure. I believe that this trend is creating a generation of programmers that have no idea of how to solve big problems by themselves.
I want to be perfectly clear here: I am not saying that you shouldn't stick to your favorite language. I just strongly suggest that every programmer should at least learn what is the difference between data structures, when to use one over another and, specially, how their performance scale.
I have lost the count of how many times I have seen people who don't know the difference between a hash and a binary tree or just believe that the <insert language name here> optimizer will do a better job than they could.
In this post I will go around a real world problem I faced in my yet few years as a professional programmer. I will not, of course, reveal the name nor the context of the application.
2) The Problem:
On the system I was working for there is the need of keeping track of the files that are saved in the disk (in this case, a NFS). This is made by saving some data in a text file.
The system is accessible by the users via network and, when a user logs in, the system would load the text file information into its memory. Due to multiple processes accessing the data at the same time, the system needs to check if any files were removed and update the internal list and the text file accordingly (we can't use locks in the system because of the NFS implementation used). Another important mechanism to avoid collision was the filename pattern that was used, it was based on the timestamp of the file creation, the PID of the process that created it, the size of the file and the name of the machine that was running the process.
One day I received the task to look into the function that would search if the list of files still existed in the text, because it was REALLY slow (several minutes in some cases). Since it was a program accessed via internet, the section would timeout before the check operation would finish and the user couldn't access its information.
Please keep in mind that this is a pretty old system now and when it was written the amount of files that the users could keep was really small. As it scaled over the years, the amount of CPU was getting way too high and something had to be done.
3) The Original Approach
The solution the program originally used was the most "intuitive" one, in a step by step solution it would be:
- Find the text file size using stat.
- Create a buffer with malloc that would fit the whole text.
- Load the text into the buffer.
- Mark each end of line as end of string.
- Read the nth entry of the files list (where n starts from 1)
- Check each line of the buffer to find if it is nth entry of the file list (there was a small optimization here where it would mark the line as read, so it wouldn't check the same line twice)
- Increase the value of n by one.
- If (n < number of files in list go to 5, else go to 9).
- Free the buffer and finish.
So, why was this algorithm slow? The first reason is that it was using a linear search. For each entry in the file list, it would have to go through a lot of registers, in the worst case scenario (which would happen every time the file was not in the list), it would have look at every register.
The second, is that the data wasn't binary, the algorithm was comparing two strings to check if the name of the file would match with an entry in the list.
4) Sandboxing and New Approaches:
As I have said before, this is a small part of a big program. Testing any changes in the program itself would require me to create an user, create several fake files and go through the authentication process and the normal sequence of operations to get to the point that the problematic algorithm would run (not to mention the fact that I could end up getting a timeout).
To avoid all those steps, I created a fake program that would test the algorithm for me. I also coded a python script that would generate a fake test entry, as big as I needed. This was my sandbox environment.
As for new approaches, I decided to use an ordered array (which I could use a binary search) and my contractors asked me to try with a hash table too. On the first solution, I used two different ordering methods, one parsing the entries' names so that I could handle the timestamps, size and PID as integers and the other using the MD5 of the file name. I will explain the methods with more detail in the next two sections.
5) Hash Tables 101
Most of people have already used a hash table, they come in most languages as a ready to use container and have several names, such as: map, dict or hashmap. Still, very few people actually know what a hash table is or how it works.
The hash table implementation is just an array and a function that maps a key to its value's real index in the array. For example, let's say we need to map and ID to an employee name, then we:
- Create an array of n positions (let's use 211 for this example).
- Create a function that would receive an integer key and map it to an index of the array (in this example, we are going to use id % 211).
- John Doe, id 67302
- Alice A., id 98321
- Floating Duck, id 48021
- Burning Witch, id 20109
- King Julian, id 32302
The first problem is called a collision. To understand it, imagine we want to add a new Employee to the hash table: Jon Snow, id: 96657. When we apply our hash table function to Snow's id we get 19 as result, but the index 19 of the hash table is already being used by King Julian. In other words, a collision happens when two or more keys generate the same result.
There are several ways to solve a collision, for instance: using a small list instead of mapping it directly to the wanted value, using the closest empty entry and many others. The main point is that all them introduce one big problem, they make the search for the values slower.
Finally, it is important to notice that since we now know that collisions may happen, we have to add one extra step to our search algorithm. Now we also need to store the key in the entry and check if the value we passed to our hash table function is the one we are actually looking for, not just a collision. Also keeping the key with the register is important to deal with collisions, as any approach will need to the key to match the other possible entries with the one we are looking for.
The second problem is the limited space. As you add more entries, you will eventually need to expand the table (also, the more entries, the more likely it is to have collisions). After the table expansion the hash function must be updated so that its results cover the whole new size. For instance, in our example, we are using the mod operation, it will never result in a number equals to or bigger than 211. So, if we want to increase the table to 483, we will have to recalculate all the keys, which can take quite some time as the number of keys increase.
Finally, the last problem is that the memory cost is pretty big, in our example we had to create 211 elements to use only 5. Since we need to create a lot of entries to have a decent space to map our entries to and we definitely want to avoid expanding our table.
6) Binary search 101
First, let's say we have an array of integers with these values:
318, 394, 73, 950, 286, 457, 825, 487, 504, 876,
521, 560, 192, 1, 627, 483, 717, 338, 70, 541
If we want to find 717 in this array, we would have to look for it, element by element, until we find it, after seventeen tries. If we sort the array, we would have this:
1, 70, 73, 192, 286, 318, 338, 394, 457, 483,
487, 504, 521, 541, 560, 627, 717, 825, 876, 950
This example ilustrates how the binary search works, in a single check, we eliminated several possible positions. A real search would actually look at the middle point of the array and set it as a its upper limit or lower limit, then get a new mid point and so on. Here is the step by step in this array:
- lower = 1 upper = 20 mid = 10. Value = 483.
- lower = 10 upper = 20 mid = 15. Value = 560.
- lower = 15 upper = 20 mid = 17. Value = 717.
We must keep in mind that none of this problems is specially hard to solve. The sort algorithm must be run only once, so it in not likelly to have a huge impact. The single key is mostly easy to solve and in this case even more as the name of messages is already made to avoid those kind of problems.
7) Implementations used and complexity:
To test which algorithm to choose I coded three solutions:
- A hash table, using the filename as a key.
- A binary search using the name MD5 as a key.
- A binary search using the name data as multiple keys.
To implement the binary search I used the <stdlib.h> header. I used the qsort function to sort the array and the bsearch to execute the binary search. In the implementation that used MD5, I used the <openssl/md5.h> header, from openssl, and the MD5 function to generate the keys.
From the complexity point of view we would have an average case for the search of (assuming the parsing of the entries is always O(1)):
- Original method: n * n = O(n^2)
- Hash: n * (1 + n/k) = O(n)
- Binary Search: n * log2n = O(n * log2 n)
- Original method: n * n = O(n^2)
- Hash: n * n = O(n^2)
- Binary Search: n * log2 n = O(n * log2 n)
8) Results and conclusions:
Using my fake files list generator I created lists with 150, 1500, 15000 and 150000 entries. Here are the results (the time is in microseconds, also I used dots every three numbers to make it easier to read big numbers, not to denote a floating point number):
150 entries:
Total time of the method 'Comparative Search' is 265 us.
Total time of the method 'Hash Table' is 169 us.
Total time of the method 'Binary Search' is 361 us.
Total time of the method 'Binary Search with md5' is 339 us.
1500 entries:
Total time of the method 'Comparative Search' is 21.885 us.
Total time of the method 'Hash Table' is 1.301 us.
Total time of the method 'Binary Search' is 1.616 us.
Total time of the method 'Binary Search with md5' is 1.747 us.
15000 entries:
Total time of the method 'Comparative Search' is 2.103.593 us.
Total time of the method 'Hash Table' is 34.684 us.
Total time of the method 'Binary Search' is 15.006 us.
Total time of the method 'Binary Search with md5' is 16.783 us.
150000 entries:
Total time of the method 'Comparative Search' is 215.411.798 us.
Total time of the method 'Hash Table' is 400.879 us.
Total time of the method 'Binary Search' is 162.278 us us.
Total time of the method 'Binary Search with md5' is 188.064 us.
Looking at those results we can clearly see that as the number of entries increases, the amount of time the old algorithm (called Comparative Search) increases exponentially. We can actually see that every time the entry increases 10 times, the processing time for the comparative search increases nearly 100 times.
Another result we can notice is that we would expect the hash table to be faster than the binary search, but this doesn't reflect the truth. The reason being here is the time lost due to the collisions. The lession we should learn is that hash tables are not that great when you have a lot of insertions and you can't determine them before hand.
A final side note I want to address here is that some people may call me a cheater because I left the initialization times out of the calculation and quicksort worst case is O(n^2). I want to make it clear that this situation is actually impossible to happen here due to the way the filename generation works. If you want to take the inialization into matter, the avarage initialization time for the hash table is about 10% of the search time, while the binary seach is about 20%.
9) References and Further Reading:
Hash Tables: Wikipedia.
Big O Notation: Wikipedia.
Quicksort: Wikipedia.
Hash table C functions: search.h information, hcreate man page, hsearch man page and hdestroy man page.
Binary search C functions: qsort man page and bsearch man page.
This site, with several different sorting algorithm and visual comparing applied to several special cases.