Have you ever thought that how does Google perform a fast search on a wide variety of file types? For example, how Google is able to suggest you a list of search expressions while you are typing in your keywords?

Another example is Google image search: You upload an image and Google finds the similar photos for you in no time.

The key to this magic is SimHash. SimHash is a mechanism/algorithm invented by Charikar (see the full patent). SimHash comes from the combination of Similarity and Hash. This means that instead of comparing the objects with each other to find the similarity, we convert them to an N-bit number that represents the object (known as Hash) and compare them. In the other words, if instead of the object, we maintain a number that represents the object, so that we will be able to compare those numbers to find the similarity of the two objects.

The basics of SimHash are as below:

- Convert the object to a hash value. (From my experience, this is better to be an unsigned integer number).
- Count the number of matching bits. For example, are the bit 1 in two hash values the same? Are the 2
^{nd}bits the same? - Depending on the size of the hash value (number of bits), you will have a number between 0 and N, where N is the length of the hash value. This is called the Hamming Distance. Hamming distance was introduced by Richard Hamming in 1950 (see here).
- The number you have achieved must be normalized and finally be represented in a value such as percentage. To do so, we can use a simple formula such as Similarity = (HashSize-HamminDistance)/HashSize.

Since the hash value can be used to represent any kind of data, such as a text or an image file, it can be used to perform a fast search on almost any file type.

To calculate the hash value, we have to decide the hash size, which normally is 32 or 64. As I said before, an unsigned value works better. We also need to choose a chunk size. The chunk size will be used to break the data down to small pieces, called shingles. For example, if we decide to convert a string such as “Hello World” to a hash value, If the chunk size is 3, our chunks would be:

1-Hel

2-ell

3-llo

Etc.

To convert a binary data to a hash value, you will have to break it down to chunks of bits. E.g. you have to pick every K bits. Google says that N=64 and K=3 are recommended.

To calculate the hash value in a SimHash manner, we have to take the following steps:

- Tokenize the data. To tokenize the data, we will have to break it down to small chunks as mentioned above and store the chunks in an array.
- Create an array (called a vector) of size N, where N is the size of the hash (let’s call this array V).
- Loop over the array of tokens (assume that
*i*is the index of each token), - Loop over the bits of each token (assume that
*j*is the index of each bit), - If Bit[j] of Token[i] is 1, then add 1 to V[j] otherwise subsidize 1 from V[j]
- Assume that the fingerprint is an unsigned value (32 or 64 bit) and is named F.
- Once the loops finish, go through the array V, and if V[i] is greater than 0, set bit
*i*in F to 1 otherwise to 0. - Return F as the fingerprint.

Here is the code:

private int DoCalculateSimHash(string input)

{

ITokeniser tokeniser = new Tokeniser();

var hashedtokens = DoHashTokens(tokeniser.Tokenise(input));

var vector = new int[HashSize];

for (var i = 0; i < HashSize; i++)

{

vector[i] = 0;

}

foreach (var value in hashedtokens)

{

for (var j = 0; j < HashSize; j++)

{

if (IsBitSet(value, j))

{

vector[j] += 1;

}

else

{

vector[j] -= 1;

}

}

}

var fingerprint = 0;

for (var i = 0; i < HashSize; i++)

{

if (vector[i] > 0)

{

fingerprint += 1 << i;

}

}

return fingerprint;

}

And the code to calculate the hamming distance is as below:

private static int GetHammingDistance(int firstValue, int secondValue)

{

var hammingBits = firstValue ^ secondValue;

var hammingValue = 0;

for (int i = 0; i < 32; i++)

{

if (IsBitSet(hammingBits, i))

{

hammingValue += 1;

}

}

return hammingValue;

}

You may use different ways to tokenize a given data. For example you may break a string value down to words, or to n-letter words, or to n-letter overlapping pieces. From my experience, if we assume that N is the size of each chunk and M is the number of overlapping characters, N=4 and M=3 are the best choices.

You may download the full source code of SimHash from SimHash.CodePlex.com. Bear in mind that SimHash is patented by Google!