Following a recent post by A. Bouchez about an optimized CRC32 hash, I took it as an opportunity to re-run a small String Hashing Shootout on the worst hash function collision torture test I know: ZIP codes in UTF-16 (Delphi’s default String format).

Following a recent post by A. Bouchez about an optimized CRC32 hash, I took it as an opportunity to re-run a small String Hashing Shootout on the worst hash function collision torture test I know: ZIP codes in UTF-16 (Delphi’s default String format).

CRC32 is just too good...
http://www.delphitools.info/2014/08/25/string-hashing-shootout

Comments

  1. There is a theorem that states that hash collisions are fewer if the hash list size is a prime. Have you tested with prime sized hash lists?

    ReplyDelete
  2. Lars Fosdal Are you referring to http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus ?

    Primes are part of some hash functions, and can be part of hash tables, but for different reasons.

    When part of hash functions it's to help spread the entropy to all bits, but the result of a hash function should have equal entropy on all its bits, and power-of-two reduction of hashes won't introduce bias (while a prime reduction would).

    When part of the hash-table design, it's for reasons of stride for some hash table designs.

    ReplyDelete
  3. Yes. I've used primes for hash table sizes with positive outcome on collision reduction before, but I can't remember if it was CRC16, CRC32 or BobJenkins.

    ReplyDelete
  4. I would guess that the hash was poor (or the table growth conditions too strict, so teh entropy loss was offset by the gains on the stride).

    ReplyDelete
  5. It still would be interesting to see if it had any effect on your results.

    ReplyDelete
  6. I guess it'll depend on how you do the reduction to prime...
    The closest prime to 65536 is 65537, how did you went from a 0..2^32-1 hash to a 0..65536 range? (x mod 65537) or floor( x*(65537/(2^32)) ) or something else?

    ReplyDelete
  7. Ok non-normalized collisions values:

    mod 65536:
    - CRC32: 41408
    - KR32: 81296
    - others: 48340 to 48739

    mod 65537:
    - CRC32: 48136
    - KR32: 50662
    - others: 48485 to 48952

    So with a prime size, KR32 is less ugly but still the worst, CRC32 lost most of its advantage, and all others got sligthly more collisions, even though there is an extra bucket.

    ReplyDelete
  8. So the idea of a prime sized hash table only helps for the worst of hashing expressions?

    ReplyDelete
  9. Using a prime modulo to reduce the hash basically means you're doing a poor man's hash of the hash from 32 bits down to whatever you need, so it helps keep all the entropy from a 32 bits hash that had less than 32 bits of entropy.

    A good N bits hash on the other hand will have N bits of entropy.

    You can think of a good hash as a random number generator using your data as seed. By doing a modulo by a prime, you introduce what is called "modulo bias", so you reduce the quality of the good hash (see http://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator).

    Doing a modulo by a power of two (or an "and" by power of two minus one) selects an integral number of bits, so you don't introduce any bias, and it's fast/cheap/unambiguous.

    ReplyDelete

Post a Comment