Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I came across some source code that loosely does the below in order to achieve a

ID: 648116 • Letter: I

Question

I came across some source code that loosely does the below in order to achieve a 32 bit hash.

The input string is passed through MD5 to get 16 bytes Hash (as usual). Then the 16 bytes are split into 4 byte words . Each of these words are ? with each other in order to get a final 4 byte word, this is considered as a 32 bit hash.

The application does not really care about security in sense of pre-image resistance etc. but all it wants is Collision Resistance.

So, does the above guarantee no collisions (any better than birthday-bound)? Of course it depends on the input string's entropy etc., but given that we don't know before hand what possible strings could be passed as input: how do we analyze such scheme for collisions?

Explanation / Answer

As far as I understand, the scheme is:

MD5(x)=a1||a2||a3||a4?H(x)=a1?a2?a3?a4,

with ai 4-byte/32-bit words.

Obviously you can't guarantee a unique 32-bit hash from an unbounded domain, due to the pidgeonhole principle. Neither can you make finding collisions infeasible, since 216 MD5 invocations takes a few seconds if that.

If MD5 was an ideal 128-bit hash, it wouldn't matter whether you take the first 4 bytes, XOR the 32-bit words or anything else. All of them would be ideal 32-bit hashes. However, it isn't and you can find collisions in about 233 operations. That doesn't directly lead to an attack on this, because it's slower than brute force, but it's possible it could be extended. There's a chance the XOR could make such an attack slightly harder or easier than truncation, but it really doesn't matter with a 32-bit hash since brute force is sufficient.

Assuming you don't let an attacker control the inputs (since that would be totally broken), the question isn't about the collision resistance, but pseudorandomness of MD5. MD5 is still thought to behave like a PRF in the sense that if the input strings are chosen "randomly", the output is uniformly random as well. (Which is why HMAC-MD5 is secure.) XOR doesn't affect that, so you should have the 32-bit hash behave like a PRF as well.

So, with inputs that aren't under the attacker's control, you should see collisions at the birthday bound. No sooner or later, other than what variance you expect with a 32-bit hash. However, there's no real benefit to the XOR, so you might as well just use the first 4-byte word.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote