TObjectDictionary Advice Needed

TObjectDictionary Advice Needed
I've been using Delphi for many years but have never got round to using Delphi's Generics capabilities. I need to write an application which handles lists of zip code objects (with lat/long, state info etc). I'd like to be able to access the zip codes quickly. So I thought TObjectDictionary might be a good solution (and it would be fun to learn about generics in the process). 

If I wasn't using generics I'd create a TZipcodeList object and keep the data in a sorted TList. I'd then use a binary search to access the zip codes.

From my experience with computer chess I understand hashing. I think TObjectDictionary hashes the key. I imagine it then uses this hashed value to access a table (using the lower xxx bits) and retrieve the value. So if I want to store all 43000 zip codes in a TObjectDictonary list, how big will the list be? Is it x2, x10 or x100 the number of entries.  If it only has 43k entries then I imaging a binary search of the hashed key will be the best way to go. But I don't see how this would be any faster than my solution using a sorted TList. So I assume it will be (considerably) bigger than 43k entries. Does anyone have any experience with the size of TObjectDictionary based lists? And is an TObjectDictionary approach really any faster than a sorted list and binary search access? 

Also, it seems the TObjectDictionary object has procedures which are not virtual e.g. Add. Is this correct? So descended objects cannot override them.  If so why?

Steve

Comments

  1. If you are going to use generics, the collections at all in Spring4D are the ones to go: they are much cleaner and consistent than the ones shipping with Delphi.
     Disclaimer: I'm on the Spring4D team (:
    http://spring4d.org

    ReplyDelete
  2. The generics in Generics.Collections were not designed with extensibility in mind actually. That is why there are almost no virtual methods and some things are also tightly coupled to each other (like the use of TList inside of TEnumerable).
    I am honestly not an expert when it comes to algorithm performance measurement but if I am not completely wrong in a sorted list your operations like Add and Remove are always O(n) or O(log n) but in a dictionary they are O(1) or O(n). The sorted list has a smaller memory footprint though but I don't know about the numbers. Spring4D does not use an own implementation for its dictionary but is backed by the one from Generics.Collections. The sorted list uses binary search on the backing array.

    ReplyDelete
  3. The TDictionary in Generics.Collections uses an open addressing scheme with linear probing.

    To make sure it's performance doesn't degrade too much due to collisions it ensures that the load factor is never greater than 0.75. When growing capacity it doubles, meaning it always has an overhead of 50-100%. However if you store objects, which you indicate you will, then we're talking of storing just a pointer or two per element, so that's negligible (at worst <700kByte on 32bit, twice that on x64).

    Ideally it has O(1) lookup/add/remove complexity, compared to O(log n) for a binary search in a sorted list. However the TDictionary implementation is a bit shit so can be slow at times if you're unlucky.

    Though, if you have a good hash function and keep the load factor at around 0.5, it should work out all right.

    As for not being overridable, I'd consider writing my own wrapper anyway. I strongly dislike the interface of Delphi's TDictionary.

    ReplyDelete
  4. In terms of dictionary performance, the main thing to watch is the hash function, the default one in Delphi is rather poor.

    In your case, memory is not going to be a problem, so if your zipcodes are the typical numbers, rather than a sorted list or a hash (which are just going to add complexity for a negligible memory gain, if any), I would personally just make 100k entries array/list, and directly access in there.

    Odds are this could end up using less memory than the dictionary (no need to store the hash per entry, no load factor), much less than a binary tree (no internal pointers), and only marginally more than a sorted list (which will waste to over-allocation unless you manage Capacity manually).
    It would also be faster than all the others, with much simpler source code & debugging (inspecting dictionaries is very very cumbersome f.i.)

    ReplyDelete
  5. Thanks for the input - I've learnt a lot!

    Stefan Glienke it looks as if generics really aren't extensible. I found a simple inherited object will compile but gives a "invalid typecast" error when run.

    Asbjørn Heid thanks - this is helpful. The TObjectDictionary is clearly powerful. I think it may not be suitable for all situations.

    Eric Grange great idea. However, the application must generalize for postcodes in other countries so a simple 100k list won't work.

    Steve

    ReplyDelete
  6. As Eric has said, I would be inclined to keep it simple, and would probably use an array of records. However, with respect to footprint, the issue of place names is a concern, both because they will likely dominate the memory footprint, and because there will be a good deal of redundant information there. 

    When you say you'd like to access the zip codes "quickly", what do you mean? Any solution discussed here will give "quick" access. But doing a lookup in the user interface is obviously far different than if you have to process a list with 10 million entries, and do a lookup for each of them.

    ReplyDelete
  7. Bill Meyer a simple array won't work in this case as the app needs to also accommodate international postcodes when used outside of the US.

    I was really using this as a vehicle for leaning about Delphi's generics capabilities. My conclusion is they cannot be inherited from and a wrapper is the best way to go.

    By today's "big data" standards by list of 43k zipcodes is really quite small. So I think I'm going to stick with my sorted list and binary search approach, which is you point out is fast enough.

    Thanks, Steve

    ReplyDelete
  8. A wrapper is in any case a good idea I think. It allows you to have a nice interface that does what you need, and it allows you to change the implementation if the need arises.

    If you decide to use the dictionary, I strongly suggest you do not use the default hash function, and instead use Murmur hash or something like it.

    ReplyDelete
  9. Asbjørn Heid wow - I've just done a quick Google search on GetHashCode. I didn't realize it existed. I can really see how TDictionary could be super powerful. I can pre-calculate the hash values and store them in the object. I had previously assumed the hashing was done on the fly as needed - resulting in an overhead. It also looks as if there is a Delphi implementation of the MurmurHash here: http://stackoverflow.com/questions/3690608/simple-string-hashing-function
    Thanks!

    ReplyDelete
  10. While you can precompute the hash value, it doesn't often make sense, as the value is only calculated once when adding the object to the dictionary. 

    However if you will add/remove the same object from one or more dictionaries, then it can make sense to cache the hash value.

    When performing a lookup, the lookup key must be hashed of course, but that's hard to avoid.

    I'm assuming you'll use a string for the key?

    ReplyDelete
  11. Asbjørn Heid in my case I think precomputing the hash values makes sense. I have many zipcode lists and I'd like to quickly find out if a list holds a specific zipcode. When I create the zipcode object I'll calculate the hash and use it for the look-up. Great stuff!

    P.S. Yes - the key is a string!

    ReplyDelete
  12. Asbjørn Heid how do I make the TObjectDictionary use the new hashing system? I can override the GetHashCode in the TObject, but there doesn't seem to be a calculate hash code event in TObjectDictionary!

    Thanks - Steve

    ReplyDelete
  13. If the key will always and only be a numeric zipcode, I would propose using them as indexes for a simple array of TObject. Would be unflexible, but nothing would be faster. Space wasted is only (100000-NumberOfExistingZip) * SizeOf(Pointer)  ~ 450 kb .

    TDictionary gives you only the advantage of additional debugging problems and you save some time during development, because there is no need to manually do the typecasting for the TList mthod overrides.

    ReplyDelete
  14. Steve Maughan You supply your own IEqualityComparer implementation to the dictionary constructor. The IEqualityComparer is responsible for computing the hash and compare two objects for equality. Here's a quick example: https://www.dropbox.com/s/y51d4kh95z13mkb/CustomHasher.dpr

    Since string is your key, it's difficult to pre-compute in a sensible way. The dictionary doesn't take the hash value directly but requests it on demand by the IEqualityComparer implementation.

    Anyway, I suggest you leave that as a possible optimization for later. It will involve making a new type for the key. I suggest a record, where you can store the zip code as string and the hash value directly.

    ReplyDelete
  15. Oliver Funcke As mentioned by Steve Maughan, he requires storing international zip codes as well, and they may be alphanumeric.

    ReplyDelete
  16. Asbjørn Heid Many thanks! This has been a great thread - I've learnt loads!

    ReplyDelete
  17. Hi Asbjørn Heid - I've done some experiments to assess speed. For my simple test application, which uses zip code strings of five characters as the key, the default hashing algorithm was quicker than Murmur2 or Murmur3. The difference was 500 ms vs. 700 ms. Is there any other reason why I should use Murmur other than speed? Or is it the case Murmur will outperform the default when the size increases due to Murmur's uniformlt spread random distribution?

    ReplyDelete
  18. Steve Maughan The key point is that the TDictionary in Delphi absolutely requires a good hash function, or it can perform very poorly in certain situations. So, in the name of defensive programming, I recommend using the Murmur hash over the default. Mostly because I know what Murmur can provide, while the default hash function is likely to have worse characteristics (I haven't studied it in detail mostly because it's not well known).

    So the trade-off is slightly slower best-case speed, but likely much better worst-case.

    ReplyDelete
  19. Steve Maughan You might want to try FNV-1a (https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function), it can be implemented in a couple of lines, and exhibits both good spread and high performance IME.

    ReplyDelete
  20. Steve Maughan Generics are extensible and the error you had is most likely on your side.

    ReplyDelete
  21. Stefan Glienke you're right!

    On my create event I had:

    fDictionary := TObjectDictionary.Create([doOwnsValues, doOwnsKeys]);

    Since the key is a string it cannot be "owned" and caused a typecast error at run-time. The key's type is known at compile-time, so I would have thought this could be picked up by the compiler.

    ReplyDelete
  22. Steve Maughan The compiler cannot evaluate any logic inside your class.

    ReplyDelete
  23. Steve Maughan like Stefan Glienke mentioned, this is not compile-time logic. It is run-time logic, so ideally, `TObjectDictionary` should check the validity of the arguments.

    ReplyDelete
  24. Jeroen Wiert Pluimers It does. The error he mentioned is raised by the constructor when the TKey or TValue type conflicts with the passed TDictionaryOwnerships.

    ReplyDelete

Post a Comment