Wait... WHAT!?

- September 19, 2014

Wait... WHAT!?
Exercise for the reader - how much faster would this be with a hash table or binary search?

unit Data.Db;

function TFields.FindField(const FieldName: string): TField;
var
I: Integer;
HashValue: Cardinal;
begin
if FList.Count > 0 then
begin
HashValue := TNamedItem.HashName(FieldName);
for I := 0 to FList.Count - 1 do
begin
Result := FList.Items[I];
if (Result.FFieldNameHashValue = HashValue) and
(AnsiCompareText(Result.FFieldName, FieldName) = 0) then
Exit;
end;
end;
Result := nil;
end;

Comments

Arioch TheSeptember 19, 2014 at 12:53 AM
Do you remember what they did in xe2 with normalizing trect ?
ReplyDelete
Replies
Eric GrangeSeptember 19, 2014 at 12:58 AM
How many fields do you have?

If it's dozens or hundreds, hashtable would win, if it's half a dozen, the hashtable would lose (both in terms of performance and code complexity).

Naive binary search would likely lose to both that snippet and a hastable (because AnsiCompareText is not cheap), but a binary search on the hash might work better than both if you have no more than dozens of fields.

If you have dozens or hundreds of fields, I suspect this method is going to be more canary than problem ;)

Also a bigger problem is if you're calling that code multiple times per record (in which case you deserve the crappy performance), rather than once per field.
ReplyDelete
Replies
Marco CantùSeptember 19, 2014 at 1:07 AM
Lars, I tend to disagree. This is a list that is not sorted. Sure you could have a sorted and hashed fields dictionary used to speed up searches, in case you need to search often (hint: use persistent fields).
Oh, you can do that, just with a little code. And why it's not in each and every TDataSet by default? Because in many cases it is not used, would imply an overhead in each and every application.
I know this might sound as taking excuses, but a library has design principles in terms of being general purpose and balancing speed/memory/time for ALL users. Adding a hash for the field name for faster searches (like shown above) still costs extra to some, but probably is a good trade-off. Pushing FindField to the max might not suit every user as well... and adding that feature when needed is really at reasonably easy reach.

I'm not pointing this out for the specific example, but because while we have don and probably still do mistakes that we need to correct, there are scenarios in which there are design decisions that are not in the best interest of some of our users, but they tend to balance different usage scenarios. And you know developers use Delphi for the most diverse applications out there. Good enough you have the power to see (full source code) and to add your own "improvements".

I've witnessed other cases in which a speed up for a customer would result in a slow down for another, and even a complete crash!
ReplyDelete
Replies
Николай ЗверевSeptember 19, 2014 at 2:23 AM
This is not hash table. This is just a list with hash values. Code shows that the developers tried to reduce the number of AnsiCompareText calls.
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 2:38 AM
See below for my current quick fix which creates the indexed lookup the first time FieldByName function is called. Note that I did this in a wrapper and not by modifying the code.

The 220+ views have from a few to 70+ columns.

This significantly speeded up reading records and accessing by named field instead of by order - which IMO is the future-proof way of reading row values (i.e. no sequence dependencies).

If there is a better way to set property values from a query, than

Obj.Prop1 := DataSet.FieldByName['PropField1'].AsPropertyType;
Obj.Prop2 := DataSet.FieldByName['PropField2'].AsPropertyType;
...
I would be willing to try it out.

function TPSDRecordset_FD_Abstract.FieldByName(const FieldName:String):TField;
var
F : TField;
ix : Integer;
begin
if not Assigned(IndexedFieldList)
then begin
IndexedFieldList := TFieldIndexList.Create;
IndexedFieldList.Sorted := False;
for F in DataSet.Fields
do IndexedFieldList.AddObject(LowerCase(F.FieldName), TObject(F));
IndexedFieldList.Sorted := True;
end;
if IndexedFieldList.Find(LowerCase(FieldName), ix)
then Result := IndexedFieldList.Items[ix]
else raise EPSDDbFieldNotFound.Create('Field "'+FieldName+'" not found in record set!');
end;
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 2:44 AM
Furthermore...
There are two more things I wonder about in this code.
1. Why create a hash, but still compare both the hash and the field name?

2. MSSQL supports unicode table and column names.
CREATE TABLE អតិថិជន
(
ឈ្មោះអ្នកខ្ចី INT
)
Is this supported? (Thinking about that AnsiCompareText)
ReplyDelete
Replies
Marco CantùSeptember 19, 2014 at 2:48 AM
Two strings might have the same hash. The Ansi I'm not convinced either.
ReplyDelete
Replies
Николай ЗверевSeptember 19, 2014 at 2:51 AM
1. Hash function for a completely different strings can return the same value.
2. Usually when accessing such fields:
- names of the fields are enclosed in quotation marks (in SQL-query);
- names of the fields are case sensitive.
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 2:59 AM
Eric Grange - For most reads, we use the simple way of just passing the record set to the object instance and let it pick out the fields from the current row. If it is necessary to read a large number of rows, we have routines that indeed do only fetch the field once.
ReplyDelete
Replies
Steve MaughanSeptember 19, 2014 at 3:28 AM
Is this an ideal candidate for TDictionary?
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 3:50 AM
Steve Maughan, I just tested that on the same dataset. It added about 250ms to the load time. Plain generic TDictionary is not a speed demon.
ReplyDelete
Replies
Eric GrangeSeptember 19, 2014 at 3:51 AM
Steve Maughan Lars Fosdal and TDictionary is case sensitive in that case no? so it actually had an easier job
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 4:01 AM
Николай Зверев Case sensitivity can be turned off, and tends to be off by default - at least for MSSQL, MySQL, and Oracle. We are about to move to PostgreSQL, where that is NOT the case - so that will get interesting.
ReplyDelete
Replies
Uwe RaabeSeptember 19, 2014 at 4:07 AM
Marco Cantù The Ansi in the name is misleading. It actually calls the CompareStringW function in kernel32.
ReplyDelete
Replies
Marco CantùSeptember 19, 2014 at 4:12 AM
Uwe Raabe Good catch, I should have known...
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 4:21 AM
Uwe Raabe - That's one less worry. Thanks!
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 4:25 AM
Eric Grange - I lower-cased the field names on TDictionary.Add and TryGetValue.
Any other alternative you could recommend that would beat a sorted TStringList with a custom compare string (SysUtils.CompareStr)?
ReplyDelete
Replies
Steve MaughanSeptember 19, 2014 at 6:09 AM
Lars Fosdal since you're creating a HashValue for the field name anyway, have you tried to use this as a custom hash for the TDictionary?

Also, how many fields are there? I would have thought TDictionary would have been faster once the number of fields goes up.
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 7:34 AM
Steve Maughan - I don't make a hash, I just use the lower case field name. Field counts: 75, 74, 73, 59, 59, 55, 54, 54, 50, 49, 46, 46, 45, 45, 43, 43, 41, 40, 39, 39, 39, 38, 38, 37, 36, 36, 36, 35, 35, 34, 32, 32, 32, 31, 31, 30, 29, 29, 29, 27, 27, 27, 27, 25, 24, 24, 24, 23, 23, 23, 22, 22, 22, 21, 21, 20, 20, 20, 19, 19, 19, 19, 19, 18, 18, 18, 18, 18, 17, 17, 17, 17, 17, 17, 17, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 15, 15, 14, 14, 14, 14, 14, 14, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1
ReplyDelete
Replies
Steve MaughanSeptember 19, 2014 at 8:03 AM
Lars Fosdal I'm amazed TDictionary is slower since it only costs the hashing of the FieldName and rest is a direct look-up.
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 8:25 AM
Well, given how pervasive the Generics Collections have become, it makes you wonder if they should do more to ensure that they are super fast.
F.x. Why do we call BobJenkinsHash, when it's just a wrapper for HashLittle?
ReplyDelete
Replies
Asbjørn HeidSeptember 19, 2014 at 9:48 AM
In one complex form we have I found that over half the time spent during its 15+ second loading time is due to FindField. Most of it was actually spent hashing the strings. If you look at the HashName function you should begin to cry (unless they've sorted it out since XE3).
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 4:33 PM
The brute force simple binary search cuts the time quite a bit, but I think I'll a proper hash list might speed it up further.
ReplyDelete
Replies
A. BouchezSeptember 19, 2014 at 4:37 PM
In a loop, always use a local variable with each field.
It will be faster and more readable.
The whole db.pas unit is far from optimized. For instance, TDataSet is a real bottleneck.
ReplyDelete
Replies
Lars FosdalSeptember 19, 2014 at 4:42 PM
A. Bouchez - If you need to read many rows, that really helps - but if you have many relations, and a large set of data - you still will need to look up a lot of fields, and if each of those are a linear enumeration through the list of fields - it really is seriously slow.

To make an efficient lookup mechanism, there are already several constraints that helps. We know there will not be duplicate field names. We know the number of fields will be fairly moderate, and we know the list will constant once built, for the lifetime of the retrieval. We could - in theory - cache the list between uses - if we make the assumption that the database schema doesn't change between uses.
ReplyDelete
Replies
Jeroen Wiert PluimersSeptember 20, 2014 at 1:42 AM
You might want to try Spring4D generic stuff and compare if it is faster than the built in stuff. It is better designed, but I'm not sure if it is faster.
ReplyDelete
Replies
Asbjørn HeidSeptember 20, 2014 at 2:51 AM
Eric Grange Indeed. When profiling the slow loading form, I found like 70% of the time was spent in string allocation and Move. The first due to AnsiUpperCase and then the Move you mention.

The reason FieldByName was called so much during startup on this form was due to our DevExpress grids, which apparently uses it to populate the grids, along the lines of
col.Value := FieldByName(col.DataBinding.FieldName).Value;
One of the grids contained rather a lot of records.

Unfortunately our "main" table is rather denormalized (almost 2^9 columns, doh!), so in our case the hashing can still pay off, but I'm certain a FastCode-style rewrite of that hash function would do wonders.
ReplyDelete
Replies
Asbjørn HeidSeptember 20, 2014 at 3:12 AM
Jeroen Wiert Pluimers FWIW the hash table and hash set currently in Spring4D are just wrappers around Generics.Collections.TDictionary. TDictionary is fast on a good day, but requires a good hash function as it's very prone to clustering, in which case you're back to O(n).
ReplyDelete
Replies
Jeroen Wiert PluimersSeptember 20, 2014 at 6:21 AM
Asbjørn Heid thanks for clarifying that.
ReplyDelete
Replies
Marjan VenemaSeptember 21, 2014 at 12:41 PM
----
If there is a better way to set property values from a query, than

Obj.Prop1 := DataSet.FieldByName['PropField1'].AsPropertyType;
Obj.Prop2 := DataSet.FieldByName['PropField2'].AsPropertyType;
...
I would be willing to try it out.
-----

Is this code in a loop? Or even just executed once, but executed again and again as some current rec somewhere else changes?

Don't want to use persistent fields at design time because they tie you down to much?

Then set up "semi-persistent" fields: declare an instance member for every field you use, populate them using FieldByName once when you open the dataset and use the instance members in the rest of your code.

But I am sure you know about that? So what am I missing here?
ReplyDelete
Replies
Lars FosdalSeptember 21, 2014 at 3:06 PM
Marjan Venema - No offence, but you should have read all the previous comments.
ReplyDelete
Replies
Markus MüllerSeptember 21, 2014 at 11:54 PM
Did anyone create his own dataset and added a new implementation of FindField? Would be interesting if the speed gains are mainly in benchmarks or if there is a noticable difference in real world applications.
ReplyDelete
Replies
Marjan VenemaSeptember 22, 2014 at 2:16 AM
Lars Fosdal Non taken. I did skip over them a bit thinking they were mostly about hash tables/functions etc.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Wait... WHAT!?

Comments

Post a Comment