The Saga of Cryptographic Hash Functions

Former Member · ‎08-29-2014

Introduction

The volume of data today on the internet bears testimony to the exponential rise in user activity in the digital space. We've more users and more devices to make it conducive for them to create and use digital data. However our data is as good as we maintain/protect it. Data security has been a perennial topic which has been at the fore front of any application. As wide as its gamut gets, data security is indispensable, inevitable and (somewhat) ubiquitous. We're not going to discuss the length and breadth of this humongous topic, but rather constrict ourselves to a particular method of data security.

Data security via Secrecy

Data security via secrecy has been a magic pill which has in most cases served the purpose. Two successful paradigms which provide security via secrecy are:

Encryption - Encryption allows data security following a traditional lock-key model. We've a secret key which is used to encrypt our message into gibberish and then either the same key (Symmetric Key) or a different key (Asymmetric/Public Key) is used to decrypt the digest into the original message. Encryption is used when we have sensitive message to transmit but the receiver definitely needs to process on the original message. So we take our message encrypt it, send it over the network, decrypt it at the receiver and then process it.
Hashing - Hashing is another way of providing data security wherein we take our message and hash it (using a hash function) to generate a gibberish output. The catch here is unlike encryption there's a corresponding DE-HASHING process. It's unidirectional and converts a message into a digest. Hashing converts a message into a fixed length output irrespective of the message's length. This implies that two or more messages might collide onto the same digest. This process is called "Collision". The challenge here is to select a clever hash function which might reduce (if not remove) the number of collisions.

Encryption yet is another esoteric side of data protection, but we will digress onto Hashing. As discussed, hashing is a ONE-WAY process of generating obscure (protected) data from the original data. Since it's unidirectional, there exists minimum probability of getting back the original data from the obscure data, well at least this is what it's meant to do. Hashing doesn't come to us as an unknown jargon. We might have interacted with a couple of situations which might have used hashing in its own way. In this article we'll focus on how do we use hashing to enhance the obscurity of our sensitive data or how do we make them cryptographically secure.

A simple hash function h could very much look like:

h(x) = x%5

This hash function takes a number x and returns the remainder when its divided by 5. It might make a manageable hash function but a ridiculously poor cryptographic hash function (CHF). Lets understand the rational behind a CHF.

Cryptographic Hash Function (CHF)

In a nutshell, a cryptographic hash function performs all operations performed that by a regular hash function but not vice-versa. A regular hash function needs to comply with certain rules to be cryptographically secure:

Pre-imaging resistant - This rule states that given a hash h, it should be highly improbable to derive a message m whose digest/hash equals h i.e. h = hash(m). This rule is in harmony with the unidirectional property of a cryptographic hash function.
Collision resistant - As we know that hashing produces a fixed length message digest irrespective of input, so it's likely that two or more different inputs might hash onto the same digest. This process is called Collision. A cryptographic hash function should try to keep collision at bare minimum (~ zero).

Now, lets apply the hash function above to check if it satisfies to be a CHF.

Pre-imaging - lets say our hash h is 3. Then we definitely have a (infinite) set of probable values of x - 3, 8, 13, 18, ….. Thus it is quite vulnerable to pre-imaging
Collision - as we prepared the infinite set of values of x, we do have a considerable collision.

CHFs serving their true purpose

Some popular hash families which are quite ubiquitous:

MD* - producing a 128 bit message digest, MD4, MD5 etc. are one of the most popular message digest algorithm
SHA (Secure Hash Algorithm) - SHA1 (160 bit digest), SHA2 (different digest lengths), SHA3 (arbitrary digest length), SHA256Crypt, SHA512Crypt etc. SHA0 and SHA1 have suffered successful attacks while SHA2 have been exposed to theoretical attacks by NIST.
PBKDF (Password Based Key Derivation Function) - PBKDF helps to generate a secret key based on a passphrase/password. As it says it's a key derivation function, usually pseudo-random, which allows slow generation of the secret key thereby adding substantial entropy to the digest. A very popular hashing algorithm in this regard is PBKDF2 that has a 256 bits key size.

MD* and SHA families are good CHFs which are designed to create digests for larger amounts of data. The catch is they do so very fast as well. Since they are fast while creating digests, they are vulnerable to (possibly) reversing the process and getting the original message because they do not add enough entropy to the digest. Such CHFs are a perfect fit for situations like creating file digests. But if our requirement were to hash passwords, then I am of the opinion that such CHFs do not fit the bill. Hashing passwords requires meticulous effort preserving its security. Hashing passwords should add considerable randomness to the digest so that it should be highly improbable to guess or derive the original message back. Using a hash function which hashes fast could allow techniques to revert back the hashing faster as well. PBKDF based hashing, in spite of its positive lethargy in digest generation, suffers from some advanced vulnerabilities we will be surfing as we move along.

Breaking Benjamin

There exists some probability that our hash digests could be broken in spite of the fact that we made them pre-image resistant. Successful techniques have been deployed to break the unidirectional hallmark of hashing and derive the original message:

Brute force - usually the least effective, but this technique employs guessing all possible character combinations to generate a digest matching a message. It's usually considerably complex but guaranteed to success keeping apart the timespan it needs to succeed.
Dictionary attacks - this attack does a bit better than brute force by reducing the combination space to a (pretty long) list of well structured dictionary words and their combinations. This list also includes most popular yet vulnerable passwords.
Rainbow tables - Rainbow tables is an advanced method of cracking hashes. It's signifies a practical usage of STO (Space Time Tradeoff) which states that either one could compromise on the processing time by less memory or improve upon it by using huge amounts of memory. Rainbow tables are pre-hashed set of data for every possible combination of characters. This implies that these tables are going to ridiculously large. We start with trying to search if our original hash is present in these tables. If yes, then we directly retrieve the message for the hash. If not, we try to reduce the hash to some intermediary message and hash this intermediary message. This process is repeated until we find a hash that matches our original hash. Finally, if along this chain of reduced/converted intermediary message a hash matches our original hash then the intermediary message for which the hashes matched is the required message. This is made possible because of the gigantic rainbow tables that hold the pre-hashed data. Thus we try to speed up our processing by using tremendous amounts of memory.
Hardware acceleration - This attack method is an icing on the cake. We will discuss two categories here.
1. GPU/FPGA acceleration - Certain hash algorithms are vulnerable to non-PC hardware acceleration especially GPU/Field Programmable Gate Array (FPGA) acceleration. FPGAs are small semiconductor devices which are manufactured and then programmatically altered for a particular business functionality. These non-PC hardware attacks require less/minimum memory to run which makes them perfect candidate for hash algorithms which do not have substantial memory requirements. FPGAs are designed to run parallelism. That makes them faster again to crack hashes. For e.g. hash algorithms like MD5, SHA1 etc. are fast hash functions which means that (if we could) the de-hashing process could also be faster. Further, if we're having a GPU to accelerate our attack process using any of the above defined methods, then we could make substantial amount of combination matches in a limited time. PBKDF2 is a well known slow hashing algorithm but yet is susceptible to GPU acceleration.
1. OpenMP (Open Multi Processing) - It's an API used to explicitly direct multi-threaded, shared memory parallelism. It allows developers to inject parallelism using Uniform Memory Access or Non-Uniform Memory Access to an existing piece of code without having to rewrite it. This makes sense as now as multi-core architecture gets embedded into our computer hardware, our processing ability gets augmented. However this has been proven to be our Achilles' heel. We could use OpenMP to accelerate our hash computation, using the techniques described above, if we have a multi-core support. Qualified tools like John The Ripper (JTR) are now empowered to use OpenMP to enhance their hash cracking ability. HashCat also uses multi-threading to utilize all cores available in the system to crack hashes.

Old toys in town

As of now, certain (futile) methods are employed by many to add more entropy to the hashing process. Why futile? It's because it appears neat from a 60,000ft view but is as messy as possible, from the inside. Such methods could be listed:

Iterate - Many of us believe that iterating over the hashing algorithm could increase entropy. This is kind of true because iterations do slow down the hashing process but doesn't add immunity. The problem isn't the fact that our hashing doesn't loop sufficient rounds before generating the digest. The problem is that it's fast and hence reversal, if possible, could also be fast. Further, if the hashing algorithm can be non-PC accelerated, then iterating x times wouldn't add to the benefit.
Add salt - salts are random identifiers which are attached to the original message before being hashed. As we know, hashing converts all inputs into a fixed length digest, there exists collision. So, lets say there are two users Alice and Bob whose passwords are, coincidentally, same would then generate the same hash. Now if I were able to retrieve the original hash (using rainbow tables maybe) then I have unlocked the passwords not only for Alice but also for Bob and maybe of many others. But if we add random identifiers (aka salt) to our original message and then hash it then the resulting hashes would be different even though the original message was the same. But finally what we're doing is hashing a message using a fast hash function which might be accelerated by GPU or FPGA.

This boils down to the fact that we do need slower hash functions. The slower it is to hash a message, the longer it will take to reverse the process (if it's possible). Further, we also need CHFs that are nascent to hardware acceleration.

Data security via Complexity

The amount of effort going into breaking a hash is referred to as the work factor associated with the hash. Higher the work factor the more difficult it is to break the hash but at the same time is predominantly heavy to compute in the first place.

Bcrypt - By bcrypt we refer to the Blowfish based 448 bit hashing algorithm. Bcrypt is perfectly slow and is immune to GPU/FPGA attacks. For e.g. bcrypt when ran on an AMD 7970 was found to be slower than it ran on a CPU. This proves its resilience towards GPU acceleration. Bcrypt algorithm demands for memory, more than PBKDF2, which makes it immune to non-PC attacks. The hashing process is repeated 2work times for a work factor work. This could be really expensive but yet very slow to crack. Bcrypt features as the default hashing algorithm used in OpenBSD and PHP5.5 (password_hash() method). A typical bcrypt hash looks like:

$2a$06$qWo8bf7STORVWWjBt7Wd7uMg3l.CkGYHH8LuiiZihmxaI1qzP8/wq

The hash could be dissected as follows:

The $ symbols are delimiters
2a describes the version of bcrypt used. The most recent version out as of writing this article is 2a.
The number 06 refers to the work factor used in the hash process. This could be adjusted to make the hashing really slow or fast.
qWo8bf7STORVWWjBt7Wd7u is the 22 character salt used while hashing some data
Mg3l.CkGYHH8LuiiZihmxaI1qzP8/wq is the 31 characters long cipher text generated

The variation of work factor with time could look something like:

data courtesy: Joseph

Some biggies who vote for bcrypt:

Twitter - https://dev.twitter.com/docs/security/best-practices#Password_Retention
OpenBSD - Bcrypt is used by default in the OpenBSD OS. Their excerpts read out:

We have implemented bcrypt and deployed it as part of the OpenBSD operating system. Bcrypt has been the default password scheme since OpenBSD 2.1

EDIT:: Bcrypt has a low memory requirement, approximately 4KB. This makes it potentially vulnerable to FGPA attacks. As FGPAs are designed to run on a parallel paradigm, a considerably large FPGA if employed could decipher
Bcrypt quite efficiently. To address this particular issue, Scrypt as a solution was proposed.

2. Scrypt - It's another PBKDF which surpasses its predecessors like PBKDF2 because of its exorbitant memory requirements thereby making it immune to FPGA attacks and partially to GPU attacks. Scrypt although is an algorithm very similar to Bcrypt, yet it possesses more computation complexity and memory requirements as compared to bcrypt. But the point to be noted here is that Scrypt is relatively new and not corroborated to a great extent.

There's a certain price that needs to be paid while using tools as above and that's related to the Space Time trade-off. I personally used bcrypt to hash and store user passwords. Based on the work factor associated with the algorithm which ensures considerable entropy in the digest, the entire process was time consuming. For e.g. in a node.js environment on a 64bit Windows system (appreciable hardware configuration), hashing an 9 character password using a work factor of 12 costs as much 0.6s. Well, there are perspectives to this. On one side, bcrypt is safe but the flipside is, 0.6s is quite large for a login check. So, finally it's an individual's call to arbitrate on this based on their requirement and also respecting the principle of STO.

Conclusion

As our hardware prowess grows according to Moore's law, the consequential quest against data security also strengthens. Based on the degree of obscurity of data required, we need to use the right set of tools to achieve the purpose. All decisions we make have consequences. One might make a choice of any CHF from the myriad of options, based on ones requirement. However the choice shouldn't have a myopic vision. Factors such as hardware acceleration or OpenMP (Open Multi-Processing) attacks in case of multi processor conditions or malware/advanced hash cracking tools were not in the viewport of application security. But now that these factors are apparent, our decisions should account for them.

The Saga of Cryptographic Hash Functions

From Developer to Solution Architect: Essential skills for climbing the IT professional ladder

How an SAP Career Can Be Like a Train Journey

Top 10 Guidelines for your SAP Interviews