Wait... WHAT!?

Wait... WHAT!?
Exercise for the reader - how much faster would this be with a hash table or binary search?

unit Data.Db;

function TFields.FindField(const FieldName: string): TField;
var
  I: Integer;
  HashValue: Cardinal;
begin
  if FList.Count > 0 then
  begin
    HashValue := TNamedItem.HashName(FieldName);
    for I := 0 to FList.Count - 1 do
    begin
      Result := FList.Items[I];
      if (Result.FFieldNameHashValue = HashValue) and
         (AnsiCompareText(Result.FFieldName, FieldName) = 0) then
        Exit;
    end;
  end;
  Result := nil;
end;

Comments

  1. Do you remember what they did in xe2 with normalizing trect ?

    ReplyDelete
  2. How many fields do you have?

    If it's dozens or hundreds, hashtable would win, if it's half a dozen, the hashtable would lose (both in terms of performance and code complexity).

    Naive binary search would likely lose to both that snippet and a hastable (because AnsiCompareText is not cheap), but a binary search on the hash might work better than both if you have no more than dozens of fields.

    If you have dozens or hundreds of fields, I suspect this method is going to be more canary than problem ;)

    Also a bigger problem is if you're calling that code multiple times per record (in which case you deserve the crappy performance), rather than once per field.

    ReplyDelete
  3. Lars, I tend to disagree. This is a list that is not sorted. Sure you could have a sorted and hashed fields dictionary used to speed up searches, in case you need to search often (hint: use persistent fields).
    Oh, you can do that, just with a little code. And why it's not in each and every TDataSet by default? Because in many cases it is not used, would imply an overhead in each and every application.
    I know this might sound as taking excuses, but a library has design principles in terms of being general purpose and balancing speed/memory/time for ALL users. Adding a hash for the field name for faster searches (like shown above) still costs extra to some, but probably is a good trade-off. Pushing FindField to the max might not suit every user as well... and adding that feature when needed is really at reasonably easy reach. 

    I'm not pointing this out for the specific example, but because while we have don and probably still do mistakes that we need to correct, there are scenarios in which there are design decisions that are not in the best interest of some of our users, but they tend to balance different usage scenarios. And you know developers use Delphi for the most diverse applications out there. Good enough you have the power to see (full source code) and to add your own "improvements".

    I've witnessed other cases in which a speed up for a customer would result in a slow down for another, and even a complete crash!

    ReplyDelete
  4. This is not hash table. This is just a list with hash values. Code shows that the developers tried to reduce the number of AnsiCompareText calls.

    ReplyDelete
  5. See below for my current quick fix which creates the indexed lookup the first time FieldByName function is called.  Note that I did this in a wrapper and not by modifying the code.  

    The 220+ views have from a few to 70+ columns.

    This significantly speeded up reading records and accessing by named field instead of by order - which IMO is the future-proof way of reading row values (i.e. no sequence dependencies).  

    If there is a better way to set property values from a query, than

    Obj.Prop1 := DataSet.FieldByName['PropField1'].AsPropertyType;
    Obj.Prop2 := DataSet.FieldByName['PropField2'].AsPropertyType;
    ...
    I would be willing to try it out.

    function TPSDRecordset_FD_Abstract.FieldByName(const FieldName:String):TField;
    var
      F : TField;
      ix : Integer;
    begin
      if not Assigned(IndexedFieldList)
      then begin
        IndexedFieldList := TFieldIndexList.Create;
        IndexedFieldList.Sorted := False;
        for F in DataSet.Fields
         do IndexedFieldList.AddObject(LowerCase(F.FieldName), TObject(F));
         IndexedFieldList.Sorted := True;
      end;
      if IndexedFieldList.Find(LowerCase(FieldName), ix)
       then Result := IndexedFieldList.Items[ix]
        else raise EPSDDbFieldNotFound.Create('Field "'+FieldName+'" not found in record set!');
    end;

    ReplyDelete
  6. Furthermore...
    There are two more things I wonder about in this code.
    1. Why create a hash, but still compare both the hash and the field name?

    2. MSSQL supports unicode table and column names.  
    CREATE TABLE អតិថិជន
    (
     ឈ្មោះអ្នកខ្ចី INT
    )
    Is this supported?  (Thinking about that AnsiCompareText)

    ReplyDelete
  7. Two strings might have the same hash. The Ansi I'm not convinced either.

    ReplyDelete
  8. 1. Hash function for a completely different strings can return the same value.
    2. Usually when accessing such fields: 
    - names of the fields are enclosed in quotation marks (in SQL-query);
    - names of the fields are case sensitive.

    ReplyDelete
  9. Eric Grange - For most reads, we use the simple way of just passing the record set to the object instance and let it pick out the fields from the current row.  If it is necessary to read a large number of rows, we have routines that indeed do only fetch the field once.

    ReplyDelete
  10. Is this an ideal candidate for TDictionary?

    ReplyDelete
  11. Steve Maughan, I just tested that on the same dataset.  It added about 250ms to the load time.  Plain generic TDictionary is not a speed demon.

    ReplyDelete
  12. Steve Maughan Lars Fosdal and TDictionary is case sensitive in that case no? so it actually had an easier job

    ReplyDelete
  13. Николай Зверев Case sensitivity can be turned off, and tends to be off by default - at least for MSSQL, MySQL, and Oracle.  We are about to move to PostgreSQL, where that is NOT the case - so that will get interesting.

    ReplyDelete
  14. Marco Cantù The Ansi in the name is misleading. It actually calls the CompareStringW function in kernel32.

    ReplyDelete
  15. Uwe Raabe Good catch, I should have known...

    ReplyDelete
  16. Uwe Raabe - That's one less worry. Thanks!

    ReplyDelete
  17. Eric Grange - I lower-cased the field names on TDictionary.Add and TryGetValue.  
    Any other alternative you could recommend that would beat a sorted TStringList with a custom compare string (SysUtils.CompareStr)?

    ReplyDelete
  18. Lars Fosdal since you're creating a HashValue for the field name anyway, have you tried to use this as a custom hash for the TDictionary? 

    Also, how many fields are there? I would have thought TDictionary would have been faster once the number of fields goes up.

    ReplyDelete
  19. Steve Maughan - I don't make a hash, I just use the lower case field name.  Field counts: 75, 74, 73, 59, 59, 55, 54, 54, 50, 49, 46, 46, 45, 45, 43, 43, 41, 40, 39, 39, 39, 38, 38, 37, 36, 36, 36, 35, 35, 34, 32, 32, 32, 31, 31, 30, 29, 29, 29, 27, 27, 27, 27, 25, 24, 24, 24, 23, 23, 23, 22, 22, 22, 21, 21, 20, 20, 20, 19, 19, 19, 19, 19, 18, 18, 18, 18, 18, 17, 17, 17, 17, 17, 17, 17, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 15, 15, 14, 14, 14, 14, 14, 14, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1

    ReplyDelete
  20. Lars Fosdal I'm amazed TDictionary is slower since it only costs the hashing of the FieldName and rest is a direct look-up.

    ReplyDelete
  21. Well, given how pervasive the Generics Collections have become, it makes you wonder if they should do more to ensure that they are super fast.  
    F.x. Why do we call BobJenkinsHash, when it's just a wrapper for HashLittle?

    ReplyDelete
  22. In one complex form we have I found that over half the time spent during its 15+ second loading time is due to FindField. Most of it was actually spent hashing the strings. If you look at the HashName function you should begin to cry (unless they've sorted it out since XE3).

    ReplyDelete
  23. The brute force simple binary search cuts the time quite a bit, but I think I'll a proper hash list might speed it up further.

    ReplyDelete
  24. In a loop, always use a local variable with each field.
    It will be faster and more readable.
    The whole db.pas unit is far from optimized. For instance, TDataSet is a real bottleneck.

    ReplyDelete
  25. A. Bouchez - If you need to read many rows, that really helps - but if you have many relations, and a large set of data - you still will need to look up a lot of fields, and if each of those are a linear enumeration through the list of fields - it really is seriously slow.

    To make an efficient lookup mechanism, there are already several constraints that helps.  We know there will not be duplicate field names.  We know the number of fields will be fairly moderate, and we know the list will constant once built, for the lifetime of the retrieval.  We could - in theory - cache the list between uses - if we make the assumption that the database schema doesn't change between uses.

    ReplyDelete
  26. You might want to try Spring4D generic stuff and compare if it is faster than the built in stuff. It is better designed, but I'm not sure if it is faster.

    ReplyDelete
  27. Eric Grange Indeed. When profiling the slow loading form, I found like 70% of the time was spent in string allocation and Move. The first due to AnsiUpperCase and then the Move you mention.

    The reason FieldByName was called so much during startup on this form was due to our DevExpress grids, which apparently uses it to populate the grids, along the lines of
      col.Value := FieldByName(col.DataBinding.FieldName).Value;
    One of the grids contained rather a lot of records.

    Unfortunately our "main" table is rather denormalized (almost 2^9 columns, doh!), so in our case the hashing can still pay off, but I'm certain a FastCode-style rewrite of that hash function would do wonders.

    ReplyDelete
  28. Jeroen Wiert Pluimers FWIW the hash table and hash set currently in Spring4D are just wrappers around Generics.Collections.TDictionary. TDictionary is fast on a good day, but requires a good hash function as it's very prone to clustering, in which case you're back to O(n).

    ReplyDelete
  29. Asbjørn Heid thanks for clarifying that.

    ReplyDelete
  30. ----
    If there is a better way to set property values from a query, than

    Obj.Prop1 := DataSet.FieldByName['PropField1'].AsPropertyType;
    Obj.Prop2 := DataSet.FieldByName['PropField2'].AsPropertyType;
    ...
    I would be willing to try it out.
    -----

    Is this code in a loop? Or even just executed once, but executed again and again as some current rec somewhere else changes?

    Don't want to use persistent fields at design time because they tie you down to much?

    Then set up "semi-persistent" fields: declare an instance member for every field you use, populate them using FieldByName once when you open the dataset and use the instance members in the rest of your code.

    But I am sure you know about that? So what am I missing here?

    ReplyDelete
  31. Marjan Venema - No offence, but you should have read all the previous comments.

    ReplyDelete
  32. Did anyone create his own dataset and added a new implementation of FindField? Would be interesting if the speed gains are mainly in benchmarks or if there is a noticable difference in real world applications.

    ReplyDelete
  33. Lars Fosdal Non taken. I did skip over them a bit thinking they were mostly about hash tables/functions etc.

    ReplyDelete

Post a Comment