Thursday, June 25, 2009

Pseudo-Uniqueness

The need to give things unique names is something that is present in many applications, from variable names to database fields. Openjet is no different and we already have two cases where we need the system to automatically generate unique identifiers.

The first thing we need a unique ID for is a visitor. In every part of the system we want to be able to refer to a certain visitor, to be able to retrieve and store information specifically for that user and track what the user does. Secondly, we are going to store points of interest, which have coordinates, a name, and a locale (language and culture setting). These three fields define the point of interest, and if they are the same as one already existing the points are considered equal.

Of course, we could get a unique ID for each item by just inserting them into the corresponding database table and let MySQL auto-increment the ID. In the case of visitors though, we do not want someone to be able to guess other people's IDs so it would have to be more complex than an number that increases by one for each visitor. When it comes to the points of interest we want a globally unique identifier since it might be compared with IDs from other types of locations. Because of this we cannot rely on MySQL to generate the IDs.

So what did we do? Well, I wrote a utility class that creates hashes from input strings of course! Using the MD5 algorithm for hashing, we will get a 32-digit hexadecimal number which will be virtually unique. Now I can hear some of you think "Virtually unique? How can you be sure it will not collide with an already existing ID?". Well the short answer is that we cannot!

If you are interested, the long answer now follows.

If I generate a 32-digit hexadecimal number, it consists of 128 bits. 128 bits can be set in 2^128 different combinations, which means that the chance of getting the same hash for another item is 1 in 340282366920938463463374607431768211456 (39 digits). Lets say we have an unrealistically busy site, generating a million new visitors or points of interest a day. Then after a thousand years we will have generated less than 4*10^11 different hashes. This means that have a chance of 4*10^11 in 2^128 of calculating the same hash at this point which is a probabiltiy in the magnitude of 10^-28.

Maybe our site is really successful and continues until no life can exist on earth (8Gyears from now). Well I am not going to bore you with the calculations for this one, but at that point we would have generated 3.2*10^21 different hashes. This means we would have the overwhelming probability of below 10^-18 of hitting one of the hashes we already used.

This is why I say virtually unique. The chance of getting the same ID is so small that it will probably never happen, and if it does the probability is so small that it probably did not. So I now leave it up to you to decide if this is unique enough for you!

No comments:

Post a Comment