Can Techies be Leaders?


Although I am a techie myself one thing that I am passionate about is teaching management skills to new managers and talents both online and on-site.

One thing that I have learned myself through teaching management and mentoring talents is that transitioning from a tech person to a manager is quite difficult for most people. I often hear from my senior engineers that “I do not see management capabilities in myself”.  You may think that this may be a self esteem-related issue, which is somewhat correct, however we have to bear in mind that lack of confidence can rise from lack of knowledge and skills.

I have also witnessed that companies tend to promote technical staff (e.g. senior developers) to tech leadership positions. Although in rare cases it turns to be a successful experience, in most cases it does not end well.

In fact my personal experience with this has been appalling! In 2005 when I was asked to manage a team for the first time, I was actually a senior developer who had many good ideas and was able to deliver work fast and with good quality. But I am not sure if it was enough for me to become a manager. From what I remember, the first few weeks went well but then issues started coming up. There were quite a few reasons for that and amongst them the most important one was my “Technical Mentality”!  A “Technical Mentality” is the thinking process of technical guys (e.g. engineers, developers etc) which is very black and white. 1 + 1 is always equal to 2 in a techie’s mind, and white is always white!  Plus, they tend to think that they are always right, and it is too hard to convince them otherwise!

So you may ask that “isn’t 1+1 equal to 2?”. Or you may ask “What do you mean that white is not always white?”.

The fact is that for good managers and for leaders, there is always a gray area and they tend to find a win-win solution when the disagreements arise. They are flexible and instead of insisting on their own opinion or solution, they listen actively and try to underestand where the other person comes from. For them a win-win situation is better than a triumph in a technical debate which  may damage the other person or even the entire team’s engagement, positive energy and moral.

You may not believe that how many tech guys I know that they dislike someone else just because the other person is not as good as they are in technical subjects! Unlike these types of techies, leaders never like or dislike any of their teammates for their strengths or weaknesses. A manager’s job is to focus on people’s strengths and help to improve their weaknesses. What I believe and  always say is that: If you do not love your team, you cannot do anything good for them!

But is this “techie mindset” fixable? My answer to this questions is “Hell yes!”.  A techie mindset is like a crooked tooth! What it needs is some force in the right direction and then over time it will go where it should go to! The force for tech people and new managers is courses and/or books, and direction comes from mentoring! Just like any other job, one can become a manager by learning the concepts and skills, then putting them in practice and making them a habit!


Here is a comprehensive and easy-to-follow management/leadership course for beginners.  The reviews on this course are saying that its students have discovered new potentials in them and have learned the management skills that have improved their leadership skills.

Setting up a TeamCity 2017 Cluster


It is very common for companies, teams and engineers to setup TeamCity in a way that TeamCity will not be able to unleash its real power!

The main two major .mistakes that are made in setting up a TeamCity-based continuous integration and delivery system are:

  • Putting the build agent software on TeamCity server (server hosting both TeamCity software and the build agent software).
  • Installing TeamCity as a single server.

Although TeamCity allows you to to install both TeamCity web application and its build agent service on the same computer it does not mean that it is a good idea! Only for learning purposes can someone install both TeamCity software and TeamCity build agent on the same machine otherwise this setup is not good and will cause headaches. So why should not you put both the TeamCity 2017 web site and its build agent on the same computer?

  • Because servers must be single responsible! One server should not have two responsibilities, which means your server must be either a TeamCity server or a Build Agent server.
  • Because build agents run the risk of getting corrupt and you may need to kill them at times! If this happen then you will have to kill your TeamCity server too!
  • Because you may want to take advantage of cloud build agents, in order to save $$$. If you set up your build agent on your server, you will not be able to terminate/stop your build agents when you do not use them.
  • You cannot scale out your build agents. The more concurrent builds you have the more build agents will be spun up by TeamCity but if you put the agent and the web-app on the same server this cannot be done.

Also setting up TeamCity as a single server does not seem to be a good idea for being used in production capacity as it will not be able to scale out. The figure below shows a typical (and not ideal) TeamCity setup:

pic

SIngle-server setup of TeamCity

As you see the TeamCity server will be able to spin up multiple TeamCity Build Agents (if you setup cloud build agents) but it will not be able to scale out the actual TeamCity application. It will also not be highly available as a failure in the TeamCity Server will take down your CI/CD system.

In order to have a highly available and highly scalable TeamCity setup it is needed to use a centralized database such as SQL Server or MySQL so that all the projects, build configurations, templates etc can be shared amongst TeamCity servers.

Apart from a shared database, the data directory of the TeamCity servers have to be shared too so when you install the servers make sure you choose the same shared (e.g. network) location for all servers. You can also update the path of the data directory later via modifying the configuration file.

Copy of LearnTeamCity.pngScalable and Highly Available setup of TeamCity

Speaking in Amazon Web Services language, in the above diagram, our TeamCity Serves are part of an Auto Scaling group, which means that as stress and load goes up on the servers, AWS will spin up more and more TeamCity servers.

If you want your set up be highly available as well you will have to make sure that your Auto Scaling group stretches across multiple availability zones (physical locations) too.

To learn more about CI/CD with TeamCity 2017 and AWS CodeDeploy, you can see this online course. I have set a very special price on this course for my blog readers 🙂

TeamCity 2017: Build and deploy the modern way!

Grafana, Graphite and StatsD:Visualize your metrics!


Hey guys,

I have recently published two new courses on Udemy, and one of them has become a “Best Seller” so fast, in only two weeks! This course is interesting for a lot of DevOps engineers because monitoring and visualising the metrics of infrastructure, websites and applications is an absolute must-have skill for all DevOps engineers!

Developers also show interest in this course because the almighty Grafana-Graphite pair is an excellent tool for instrumentation and health check of application and websites.

In my course I have explained in over 25 lectures and 5.5 hours of videos that how you can extract and visual metrics from various sources whether it is supported by Grafana out of the box or not!

If you are an IT professional I highly recommend this course to you because based on my experience good instrumentation and having visibility on your systems metrics is what differs a good software system from the average ones!

Here is a coupon for my blog readers, which offers 80% discount!

Grafana, Graphite and StatsD:Visualize your metrics

Grafana

Working with AWS S3 through C#


A while ago I needed to use AWS S3 (the Amazon’s cloud-based file storage) to store some files and then download them or get their listings through C#. As ironic as it sounds I noticed that there was no .NET implementation nor a documentation for S3 so I decided to create a file repository in C# which lets .NET developers access S3 programmatically.

Here is a rundown as to how you would work with AWS S3 through C#:

In order to use any APIs of Amazon Web Services (AWS) you will have to add the nugget package that is provided by Amazon. Simply bring up the Nuget Package Manager window and search for the keyword AWS. The first item in the search result is most likely AWS SDK for .NET which must be installed before you can access S3.

 

 

Once the SDK is installed we will have to find the properties of our S3 bucket and place it somewhere in web.config (or app.config) file. Normally these three properties of the S3 bucket is required in order to access it securely:

 

  1. Secret key
  2. Access key
  3. Region end point

These details will be provided to you by your cloud administrator. Here is a list of region end points that you can place in your configuration file (e.g. us-west-1)

 

Region name Region Endpoint Location constraint Protocol
US Standard * us-east-1 You can use one of the following two endpoints:

  • s3.amazonaws.com (Northern Virginia or Pacific Northwest)
  • s3-external-1.amazonaws.com (Northern Virginia only)
(none required) HTTP and HTTPS
US West (Oregon) region us-west-2 s3-us-west-2.amazonaws.com us-west-2 HTTP and HTTPS
US West (N. California) region us-west-1 s3-us-west-1.amazonaws.com us-west-1 HTTP and HTTPS
EU (Ireland) region eu-west-1 s3-eu-west-1.amazonaws.com EU or eu-west-1 HTTP and HTTPS
EU (Frankfurt) region eu-central-1 s3.eu-central-1.amazonaws.com eu-central-1 HTTP and HTTPS
Asia Pacific (Singapore) region ap-southeast-1 s3-ap-southeast-1.amazonaws.com ap-southeast-1 HTTP and HTTPS
Asia Pacific (Sydney) region ap-southeast-2 s3-ap-southeast-2.amazonaws.com ap-southeast-2 HTTP and HTTPS
Asia Pacific (Tokyo) region ap-northeast-1 s3-ap-northeast-1.amazonaws.com ap-northeast-1 HTTP and HTTPS
South America (Sao Paulo) region sa-east-1 s3-sa-east-1.amazonaws.com sa-east-1 HTTP and HTTPS

 

In order to avoid adding the secret key, access key and region endpoint to the <appSettings> part of your configuration file and to make this tool more organised I have created a configuration class for it. This configuration class will let you access the <configurationSection> element that is related to S3. To configure your app.config (or web.config) files you will have to add these <sectionGroup> and <section> elements to your configuration file:

 

<configSections>

<sectionGroup
name=AspGuy>

<section
name=S3Repository
type=Aref.S3.Lib.Strategies.S3FileRepositoryConfig, Aref.S3.LiballowLocation=true
allowDefinition=Everywhere />

</sectionGroup>

</configSections>

The
S3FileRepositoryConfig 
class is inherited from ConfigurationSection class and has properties that map to some configuration elements of your .config file. A sample configuration for S3 is like this:

 

<AspGuy>

<S3Repository


S3.ReadFrom.AccessKey=xxxxxxxxxx


S3.ReadFrom.SecretKey=yyyyyyyyyyyyyyyyy


S3.ReadFrom.Root.BucketName=-bucket-name-


S3.ReadFrom.RegionName=ap-southeast-2


S3.ReadFrom.RootDir=“”>

</S3Repository>

</AspGuy>

Note that <AspGuy> comes from the name property of <sectionGroup name=”AspGuy”> element. Also <S3Repository> tag comes from the name of <section> element. Each property of S3FileRepositoryConfig
is mapped to an attribute of <S3Repository> element.

Apart from SecretKey, AccessKey andBucketName you can specify a root directory name as well. This setting is there so you can begin accessing the S3 bucket from a specific folder rather than from its root, and obviously this setting is optional. For example imagine there is a bucket with the given folder structure:

  • Dir1
  • Dir1/Dir1_1
  • Dir1/Dir1_2

If you set the RootDir property to “” then when you call the GetSubDir methods of the S3 file repository if will return “Dir1” because Dir1 is the only top-level folder in the bucket. If you set the RootDir property to “Dir1” and then call the GetSubDirs method you will get two entries which are “Dir1_1” and “Dir1_2”.

Here is the code of the configuration class mentioned above:

 

For the repository class I have created an interface because of removing the dependency of clients (e.g. a web service that may need to use with various file storages) on S3. This will let you add your implementation of file system, FTP and other file storage types and use then through dependency injection. Here is the code of this interface:

 

In this interface:

  • Download: Downloads a file hosted on S3 to disk.
  • ChangeDir: Changes the current directory/folder to the given directory. If the new directory (relativePath parameter) starts with / then the path will be representing an absolute path (starting from the RootDir) otherwise it will be a relative path and will start from the current directory/folder.
  • GetFileNames: Retrieves the file names of the current folder
  • GetSubDirNames: Retrieves the name of folders in the current folder
  • AddFile: Uploads a file to S3
  • FileExists: Checks to see if a file is already on S3
  • DeleteFile: Deletes the file from S3

The implementation of these method are quiet simple using AWS SDK for .NET. The only tricky part is that S3 does not support folders. In fact in S3 everything is a key-value pair and the structure of entries is totally flat. What we do however is to use the forward slash character to represent folders and then we use this character as a delimiter to emulate a folder structure.

Untitled

 

 

 

You can clone the repository of this code which is on GitHub to play with the code. Feel free to send a pull request if you want to improve the code. The GitHub repository is located at https://github.com/aussiearef/S3FileRepository

 

 

 

 

 

 

Building high-performance ASP.NET applications


If you are building public facing web sites one of the things you want to achieve at the end of the project is a good performance under the load for your web site. That means, you have to make sure that your product works under a heavy load (e.g. 50 concurrent users, or 200 users per second etc.) even though at the moment you don’t think you would have that much load. Chances are that the web site attracts more and more users over time and then if it’s not a load tolerant web site it will start flaking, leaving you with an unhappy customer and ruined reputation.

There are many articles on the Internet about improving the performance of ASP.NET web sites, which all make sense; however, I think there are some more things you can do to save yourself from facing massive dramas. So what steps can be taken to produce a high-performance ASP.NET or ASP.NET MVC application?

  • Load test your application from early stages

Majority of developers tend to leave performing the load test (if they ever do it) to when the application is developed and has passed the integration and regression tests. Even though performing a load test at the end of the development process is better than not doing it at all, it might be way too late to fix the performance issues once your code has been already been written. A very common example of this issue is that when the application does not respond properly under load, scaling out (adding more servers) is considered. Sometimes this is not possible simply because the code is not suitable for achieving it. Like when the objects that are stored in Session are not serializable, and so adding more web nodes or more worker processes are impossible. If you find out that your application may require to be deployed on more than one server at the early stages of development, you will do your tests in an environment which is close to your final environment in terms of configuration and number of servers etc., then your code will be adapted a lot easier.

  • Use the high-performance libraries

Recently I was diagnosing the performance issues of a web site and I came across a hot spot in the code where JSON messages coming from a third-party web service had to be serialized several times. Those JSON messages were de-serialized by Newtonsoft.Json and tuned out that Newtonsoft.Json was not the fastest library when it came to de-serialization. Then we replaced Json.Net with a faster library (e.g. ServiceStack) and got a much better result.

Again if the load test was done at an early stage when we picked Json.Net as our serialization library we would have find that performance issue a lot sooner and would not have to make so many changes in the code, and would not have to re-test it entirely again.

  • Is your application CPU-intensive or IO-intensive?

Before you start implementing your web site and when the project is designed, one thing you should think about is whether your site is a CPU-intensive or IO-intensive? This is important to know your strategy of scaling your product.

For example if your application is CPU-intensive you may want to use a synchronous pattern, parallel processing and so forth whereas for a product that has many IO-bound operations such as communicating with external web services or network resources (e.g. a database) Task-based asynchronous pattern might be more helpful to scale out your product. Plus you may want to have a centralized caching system in place which will let you create Web Gardens and Web Farms in future, thus spanning the load across multiple worker processes or serves.

  • Use Task-based Asynchronous Model, but with care!

If your product relies many IO-bound operations, or includes long-running operations which may make the expensive IIS threads wait for an operation to complete, you better think of using the Task-based Asynchronous Pattern for your ASP.NET MVC project.

There are many tutorials on the Internet about asynchronous ASP.NET MVC actions (like this one) so in this blog post I refrain from explaining it. However, I just have to point out that traditional synchronous Actions in an ASP.NET (MVC) site keep the IIS threads busy until your operation is done or the request is processed. This means that if the site is waiting for an external resource (e.g. web service) to respond, the thread will be busy. The number of threads in .NET’s thread pool that can be used to process the requests are limited too, therefore, it’s important to release the threads as soon as possible. A task-based asynchronous action or method releases the thread until the request is processed, then grabs a new thread from the thread pool and uses it to return the result of the action. This way, many requests can be processed by few threads, which will lead to better responsiveness for your application.

Although task-based asynchronous pattern can be very handy for the right applications, it must be used with care. There are a few of concerns that you must have when you design or implement a project based on Task-based Asynchronous Pattern (TAP). You can see many of them in here, however, the biggest challenge that developers may face when using async and await keywords is to know that in this context they have to deal with threads slightly differently. For example, you can create a method that returns a Task (e.g. Task<Product>). Normally you can call .Run() method on that task or you can merely call task.Result to force running the task and then fetching the result. In a method or action which is built based on TBP, any of those calls will block your running thread, and will make your program sluggish or even may cause dead-locks.

  • Distribute caching and session state

    It’s very common that developers build a web application on a single development machine and assume that the product will be running on a single server too, whereas it’s not usually the case for big public facing web sites. They often get deployed to more than one server which are behind a load balancer. Even though you can still deploy a web site with In-Proc caching on multiple servers using sticky session (where the load balancer directs all requests that belong to the same session to a single server), you may have to keep multiple copies of session data and cached data. For example if you deploy your product on a web farm made of four servers and you keep the session data in-proc, when a request comes through the chance of hitting a server that already contains a cached data is 1 in 4 or 25%, whereas if you use a centralized caching mechanism in place, the chance of finding a cached item for every request if 100%. This is crucial for web sites that heavily rely on cached data.

    Another advantage of having a centralized caching mechanism (using something like App Fabric or Redis) is the ability to implement a proactive caching system around the actual product. A proactive caching mechanism may be used to pre-load the most popular items into the cache before they are even requested by a client. This may help with massively improving the performance of a big data driven application, if you manage to keep the cache synchronized with the actual data source.

  • Create Web Gardens

As it was mentioned before, in an IO-bound web application that involves quite a few long-running operations (e.g. web service calls) you may want to free up your main thread as much as possible. By default every web site is run under one main thread which is responsible to keep your web site alive, and unfortunately when it’s too busy, your site becomes unresponsive. There is one way of adding more “main threads” to your application which is achievable by adding more worker processes to your site under IIS. Each worker process will include a separate main thread therefore if one is busy there will be another one to process the upcoming processes.

Having more than one worker process will turn your site to a Web Garden, which requires your Session and Application data be persisted out-proc (e.g. on a state server or Sql Server).

  • Use caching and lazy loading in a smart way

    There is no need to emphasize that if you cache a commonly accessed bit of data in memory you will be able to reduce the database and web service calls. This will specifically help with IO-bound applications that as I said before, may cause a lot of grief when the site is under load.

    Another approach for improving the responsiveness of your site is using Lazy Loading. Lazy Loading means that an application does not have a certain piece of data, but it knows that where is that data. For example if there is a drop-down control on your web page which is meant to display list of products, you don’t have to load all products from the database once the page is loaded. You can add a jQuery function to your page which can populate the drop-down list the first time it’s pulled down. You can also apply the same technique in many places in your code, such as when you work with Linq queries and CLR collections.

  • Do not put C# code in your MVC views

    Your ASP.NET MVC views get compiled at run time and not at compile time. Therefore if you include too much C# code in them, your code will not be compiled and placed in DLL files. Not only this will damage the testability of your software but also it will make your site slower because every view will take longer to get display (because they must be compiled). Another down side of adding code to the views is that they cannot be run asynchronously and so if you decide to build your site based on Task-based Asynchronous Pattern (TAP), you won’t be able to take advantage of asynchronous methods and actions in the views.

    For example if there is a method like this in your code:

    public async Task<string> GetName(int code)

    {

    var result = …

    return await result;

    }

This method can be run asynchronously in the context of an asynchronous ASP.NET MVC action like this:

    public Task<ActionResult> Index(CancellationToken ctx)

{

    var name = await GetName(100);

}

But if you call this method in a view, because the view is not asynchronous you will have to run it in a thread-blocking way like this:

var name = GetName(100).Result;

.Result will block the running thread until GetName() processes our request and so the execution of the app will halt for a while, whereas when this code is called using await keyword the thread is not blocked.

  • Use Fire & Forget when applicable

If two or more operations are not forming a single transaction you probably do not have to run them sequentially. For example if users can sign-up and create an account in your web site, and once they register you save their details in the database and then you send them an email, you don’t have to wait for the email to be sent to finalize the operation.

In such a case the best way of doing so is probably starting a new thread and making it send the email to the user and just get back to the main thread. This is called a fire and forgets mechanism which can improve the responsiveness of an application.

  • Build for x64 CPU

32-bit applications are limited to a lower amount of memory and have access to fewer calculation features/instructions of the CPU. To overcome these limitations, if your server is a 64-bit one, make sure your site is running under 64-bit mode (by making sure the option for running a site under 32-bit mode in IIS is not enabled). Then compile and build your code for x64 CPU rather than Any CPU.

One example of x64 being helpful is that to improve the responsiveness and performance of a data-driven application, having a good caching mechanism in place is a must. In-proc caching is a memory consuming option because everything is stored in the memory boundaries of the site’s application pool. For a x86 process, the amount of memory that can be allocated is limited to 4 GB and so if loads of data be added to the cache, soon this limit will be met. If the same site is built explicitly for a x64 CPU, this memory limit will be removed and so more items can be added to the cache thus less communication with the database which leads to a better performance.

  • Use monitoring and diagnostic tools on the server

    There might be many performance issues that you never see them by naked eyes because they never appear in error logs. Identifying performance issues are even more daunting when the application is already on the production servers where you have almost no chance of debugging.

    To find out the slow processes, thread blocks, hangs, and errors and so forth it’s highly recommended to install a monitoring and/or diagnostic tool on the server and get them to track and monitor your application constantly. I personally have used NewRelic (which is a SAS) to check the health of our online sites. See HERE for more details and for creating your free account.

  • Profile your running application

    Once you finish the development of your site, deploy it to IIS, and then attach a profiler (e.g. Visual Studio Profiler) and take snapshots of various parts of the application. For example take a snapshot of purchase operation or user sign-up operation etc. Then check and see if there is any slow or blocking code there. Finding those hot spots at early stages might save you a great amount of time, reputation and money.

Web API and returning a Razor view


There are scenarios in which an API in a WEB API application needs to return a formatted HTML rather than a JSON message. For example we worked on a project where APIs are used to perform some search and return the result as JSON or XML while a few of them had to return HTML to be used by an Android app (in a webview container).

One solution would be breaking down the controller into two: one inherited from MVCController and the other one derived from ApiController. However since those APIs are in a same category in terms of functionality I would keep them in the same controller.

Moreover,  using ApiController and returning HttpResponseMessage lets us to modify the details of the implementation in future without having to change the return type (e.g. from ActionResult to HttpResponseMessage) and also would be easier for us in future to upgrade to Web API 2.

The advent of IHttpActionResult In Web API 2 allows developers to return custom data. In case you are not using ASP.NET MVC 5 yet or you are after an easier way keep reading!

To parse and return a Razor view in a WEB API project, simply add some views to your application just like when you do it for a normal ASP.NET MVC project. Then through Nuget, find and add RazorEngine which is a cool tool to read and parse Razor views.

Inside the api simply create an object to act as a model, load the content of the view as a text data, pass the view’s body and the model to RazorEngine and get a parsed version of the view. Since the api is meant to return HTML, the content type must be set to text/html.

Image

In this example the view has a markup like the one given below:

Image

As it’s seen, the model is bound to type “dynamic” which let’s the view accept a wide range of types. You can move the code from your API to a helper class (or anything similar) and create a function which accepts a view name, a model and then returns a rendered HTML.

The magic behind the Google Search


Have you ever thought that how does Google perform a fast search on a wide variety of file types? For example, how Google is able to suggest you a list of search expressions while you are typing in your keywords?

Another example is Google image search: You upload an image and Google finds the similar photos for you in no time.

The key to this magic is SimHash. SimHash is a mechanism/algorithm invented by Charikar (see the full patent). SimHash comes from the combination of Similarity and Hash. This means that instead of comparing the objects with each other to find the similarity, we convert them to an N-bit number that represents the object (known as Hash) and compare them. In the other words, if instead of the object, we maintain a number that represents the object, so that we will be able to compare those numbers to find the similarity of the two objects.

The basics of SimHash are as below:

  1. Convert the object to a hash value. (From my experience, this is better to be an unsigned integer number).
  2. Count the number of matching bits. For example, are the bit 1 in two hash values the same? Are the 2nd bits the same?
  3. Depending on the size of the hash value (number of bits), you will have a number between 0 and N, where N is the length of the hash value. This is called the Hamming Distance. Hamming distance was introduced by Richard Hamming in 1950 (see here).
  4. The number you have achieved must be normalized and finally be represented in a value such as percentage. To do so, we can use a simple formula such as Similarity = (HashSize-HamminDistance)/HashSize.

Since the hash value can be used to represent any kind of data, such as a text or an image file, it can be used to perform a fast search on almost any file type.

To calculate the hash value, we have to decide the hash size, which normally is 32 or 64. As I said before, an unsigned value works better. We also need to choose a chunk size. The chunk size will be used to break the data down to small pieces, called shingles. For example, if we decide to convert a string such as “Hello World” to a hash value, If the chunk size is 3, our chunks would be:

1-Hel

2-ell

3-llo

Etc.

To convert a binary data to a hash value, you will have to break it down to chunks of bits. E.g. you have to pick every K bits. Google says that N=64 and K=3 are recommended.

To calculate the hash value in a SimHash manner, we have to take the following steps:

  1. Tokenize the data. To tokenize the data, we will have to break it down to small chunks as mentioned above and store the chunks in an array.
  2. Create an array (called a vector) of size N, where N is the size of the hash (let’s call this array V).
  3. Loop over the array of tokens (assume that i is the index of each token),
  4. Loop over the bits of each token (assume that j is the index of each bit),
  5. If Bit[j] of Token[i] is 1, then add 1 to V[j] otherwise subsidize 1 from V[j]
  6. Assume that the fingerprint is an unsigned value (32 or 64 bit) and is named F.
  7. Once the loops finish, go through the array V, and if V[i] is greater than 0, set bit i in F to 1 otherwise to 0.
  8. Return F as the fingerprint.

Here is the code:

private int DoCalculateSimHash(string input)

{

ITokeniser tokeniser = new Tokeniser();

var hashedtokens = DoHashTokens(tokeniser.Tokenise(input));

var vector = new int[HashSize];

for (var i = 0; i < HashSize; i++)

{

vector[i] = 0;

}

foreach (var value in hashedtokens)

{

for (var j = 0; j < HashSize; j++)

{

if (IsBitSet(value, j))

{

vector[j] += 1;

}

else

{

vector[j] -= 1;

}

}

}

var fingerprint = 0;

for (var i = 0; i < HashSize; i++)

{

if (vector[i] > 0)

{

fingerprint += 1 << i;

}

}

return fingerprint;

}

And the code to calculate the hamming distance is as below:

private static int GetHammingDistance(int firstValue, int secondValue)

{

var hammingBits = firstValue ^ secondValue;

var hammingValue = 0;

for (int i = 0; i < 32; i++)

{

if (IsBitSet(hammingBits, i))

{

hammingValue += 1;

}

}

return hammingValue;

}

You may use different ways to tokenize a given data. For example you may break a string value down to words, or to n-letter words, or to n-letter overlapping pieces. From my experience, if we assume that N is the size of each chunk and M is the number of overlapping characters, N=4 and M=3 are the best choices.

You may download the full source code of SimHash from SimHash.CodePlex.com. Bear in mind that SimHash is patented by Google!