Time to focus on the compiler purpose, right?
Time to focus on the compiler purpose, right?
IMHO a compiler is there to generate correct and efficient code...
http://blog.synopse.info/post/2015/06/21/Why-FPC-may-be-a-better-compiler-than-Delphi
IMHO a compiler is there to generate correct and efficient code...
http://blog.synopse.info/post/2015/06/21/Why-FPC-may-be-a-better-compiler-than-Delphi
Well this is exactly what I've been telling Marco Cantù and he just won't listen. Sadly.
ReplyDeleteA free product such as FreePascal is faster than Delphi. Think about this for a moment. This is the current reality.
Results were posted by +Eric Grange sometime ago on his website.
I think it's pretty clear that the vast majority of the Delphi customer base doesn't care much about raw performance as long as it's "good enough".
ReplyDeleteOn the flip side, FPC seems to lack some big language features...
Asbjørn Heid This has to be the most common optimization tricks out there. I don't know why Delphi doesn't do it.
ReplyDeleteJosé RamÃrez I read alot about x86 optimization tricks back in the 386 days, and I can't remember seeing this one. So I'm not so sure about "most common". But I guess these days it should be well known for the compiler guys.
ReplyDeleteHowever it should be fairly clear the Delphi compiler hasn't received much tender love and care in the optimization department for the last 10 years, if not more.
Asbjørn Heid Did you check http://www.getlazarus.org/language ? The FPC 3 branch did give a lot of nice language features.
ReplyDeleteThere are still some missing features, e.g. anonymous methods or advanced RTTI, but they have dedicated branches in the FPC repository, and their implementation path is IMHO superior than Delphi's.
A. Bouchez AFAIK they are lacking a couple of features. Delayed loading, Anonymous methods, some generic stuff and that's about it. Its getting there though.
ReplyDeleteHowever! Their 32 bit compiler can generate SSE2-4.1 code which Delphi cannot! And it can statically link GCC code and probably CLANG as well. Bit packed records..
I've personally statically linked OpenSSL with GCC & FPC before for Indy.
Overall it's looking pretty good.
A. Bouchez Yea, for me the most glaring is the lack of anonymous methods and "reference to function". I use it quite a lot these days.
ReplyDeleteOnce FPC gets that I'll take a good look at it.
Pfff, better compiler just because it can do some optimizations the Delphi one cannot but lacking some language features that the Delphi compiler has since more than half a decade? Yeah sure...
ReplyDeleteStefan Glienke Faster compiler vs "some" features. I'll take the faster compiler and so would anyone with half a brain. If speed doesn't matter I may just start writing C#.
ReplyDeleteJosé RamÃrez Yeah, I bet that some optimized divisions are making your applications ten times faster...
ReplyDeleteStefan Glienke Its not just about divisions its other stuff as well. Like floating point performance etc. You wouldn't care anyway you deal mostly with interfaces. :)
ReplyDeleteIts old but anyway I doubt much changed.
https://www.delphitools.info/2014/05/08/delphi-xe6-32bits-and-scimark/
uhh getlazarus.org? thought lazarus-ide.org is their mainpage?
ReplyDeleteJosé RamÃrez If they got rid of the excessive stack juggling the Delphi compiler likes to do, it would matter a LOT more than the integer division optimization.
ReplyDeleteBut the point still stands. People who care about performance go elsewhere these days. Even Java can be significantly faster than Delphi.
I don't have too many complaints on the code - as I deal with databases - which are slower than molasses compared to x86 code.
ReplyDeletePersonally, I would like to see a proper overhaul of the IDE to fix the lockups, leaks, and debugger crashes - taking priority over compiler work (beyond obvious fixes).
Everyone has their pet desires. I'd like a compiler that produced much better code because that is of critical importance to me. It shouldn't be one or the other. Why can't we have a compiler that produces efficient code, a language with rich features, a stable IDE, and a high quality RTL?
ReplyDeleteI am forced to believe that there is a substantial disconnect between the corporate vision and the everyday needs which concern those of us who depend on the tools for our livelihood. And "trust us" doesn't get it done.
ReplyDeleteBill Meyer Without radical change in corporate management nothing will change.
ReplyDeleteOne thing is certain - nothing said in this community has the slightest effect on the EMBT corporate vision (or lack there of).
ReplyDeleteWe can perhaps sway the priorities of the Delphi team - but that is about all.
It should be certain, as we were certainly lectured on the irrelevance of our opinions many times in the newsgroups. I can't recall any of us ever being right....
ReplyDeleteWhich is why I prefer this community to focus on actual coding. There is enough poisonous and futile debate elsewhere.
ReplyDeleteLars Fosdal One of the great things in this community is that it continues to deliver much value.
ReplyDeleteLars Fosdal Another great thing is that when legitimate rants appear, they almost always offer valid criticism, and once the issue has been raised and received a round of comments, it does down. So different to the experience in certain newsgroups. ;)
ReplyDeleteIs Delphi 64 bit compiler using LLVM?
ReplyDeleteJosé RamÃrez Not the desktop one no.
ReplyDeleteLLVM only used on the mobile compilers. dcc32, dcc64, dccosx all use the legacy compiler
ReplyDeleteDavid Heffernan WOW even OSX uses legacy. They use different compilers for the same language. Must be a PITA to maintain all of them especially legacy stuff.
ReplyDeleteI bet the LLVM stuff they use is like 5 years old.
But then again not surprising at all from a company that forgets to fix TList before releasing it to the public.
José RamÃrez The legacy compiler wasn't written by them either. They bought it in an modified it to their needs. I also think it's so fast because it was written (originally at least) outside the building! Very cynical!
ReplyDeleteDavid Heffernan Now that's funny. So that explains the ASM code in the RTL/VCL. Fake it until you make it! :D
ReplyDeleteCompiler != RTL/VCL
ReplyDeleteLars Fosdal Have you ever opened the RTL or VCL? Its riddled with Assembly and guess why because the compiler is slow! That's what I meant with fake it (speed with asm) until you make it (a better compiler).
ReplyDeletePlease clarify, José RamÃrez - Are you saying the RTL assembly code is slow, the compiler is slow, or the compiler generated code is slow?
ReplyDeleteI wonder if anyone ever made an overview over what ASM blocks where added or modified in which Delphi version.
Lars Fosdal RTL code has tons of assembly instead of real code to compensate for lost speed with poor compiler. I am sure you understand. Add tons of IFDEFs + other nonsense and you get lost quickly.
ReplyDeleteAnd since IDE is compiled with Delphi compiler it would fix a lot of your problems. Hence a better compiler is needed.
If they only added /LARGEADDRESSAWARE to allow 4GB allocation in the IDE it would make a lot of difference!
I guess sending mass promotion emails is more important for the company.
There are currently approx 720 asm blocks in the RTL/VCL spread over 48 units out of a total of 2050+ units (per XP7.1 Enterprise). Without having checked - I assume several of the asm blocks are duplicated for win 32 vs win64.
ReplyDeleteunit:
2943 occurrences in 2059 files
function:
222172 occurrences in 1870 files
procedure:
138664 occurrences in 1719 files
asm:
778 occurrences in 48 files
This hardly qualifies as "tons."
As for getting lost - the debugger puts you right on the spot - and if you can't read assembly, you either study up - or press shift-F8.
Can the compiler do better? Not this one, I guess - and moving to a new one requires more than a bit of work and regression testing up the wazzoo. I'd like to see a LLVM compiler for the desktop too - but it is likely to mean yet another round of new and exciting bugs - and come at the cost of not fixing the existing problems with the IDE.
With limited resource, you can do limited work. Even if the devs probably don't handwrite the mass promotion emails.
Lars Fosdal You may think that's not a lot but without those 700 or so asm blocks the application would be dead slow. Its not how often they are used but where they are used. And by fixing the compiler you fix the IDE partially as well since it used to compile the IDE itself.
ReplyDeleteWell you can't do much about ErrorInsight since that's written in .NET I presume C#.
C++Builder
ReplyDeleteasm:
Found 288 occurrences in 47 files
__emit:
Found 58 occurrences in 13 files
out of a total of 2256 files (*.cpp *.c *.h *.inc)
Lars Fosdal Most of those are in string constants ;)
ReplyDeleteIn System.hpp there is 0 ASM code.
Also - none of the asm or __emit references are in string constants, but are embedded asm in the C code.
ReplyDeleteOh... and a slight oversight:
There are 191 .asm files in the C++Builder source/cpprtl directories
There is 1 (one) in the VCL source directory.
So - we have established the fact that the current Delphi compiler for Win/OSX is old, and that the RTL needs ASM to keep it's speed up.
Now, deal with it.
It's certainly the case that the legacy Delphi compiler has never produced especially efficient code. Well, perhaps it did 20 years ago, but it hasn't moved on.
ReplyDeleteAs somebody who does primarily floating point work, the poor quality of the code that the compiler emits (both dcc32 and dcc64) is frustrating. Code like mine, heavily based on floating point, would be significantly faster if it had codegen of similar quality to any of the well known C++ compilers.
David Heffernan - I don't think you'll find anyone that contests the need for a compiler overhaul.
ReplyDeleteSome of the math support in CPP looks to be hard coded assembly. In theory, some of that could be ported to Delphi as well - but, it wouldn't solve the stack handling issues in the current Delphi compiler.
The Delphi language does give you a lot of headroom for tweaking routines in assembly, though - which should give us the ability to work on optimized floating point libraries for the community?
It would be cheaper to move the entire project over to C++ than to tinker around with assembly. Its not exactly trivial to write an optimized assembly routine. And how exactly will you target SSE-AVX2 hardware with your custom assembly? This stuff is for the compiler to handle.
ReplyDelete+Lars It's not enough to write your own asm routines. You need help from the compiler.
ReplyDeleteIMO, it's trivial to write an optimized assembly routine, compared to moving to a new compiler and ensuring that it is backwards compatible with billions of lines of Delphi code (with it's interspersed asm).
ReplyDeleteThat is a pretty big task. It was easier for the new platforms - as you had no backwards compatibility to care for. I'm not saying it can't be done - but - it is a lot of work - and we won't know if the benefits will be tangible until it has been done. What if it turns out that it is the RTL/VCL that has most of the bottlenecks?
David Heffernan - For generic cross platform FPU code, yes.
ReplyDeleteLars Fosdal For specific single target code too. If the compiler is emitting slow x87 instructions instead of fast SSE instructions, or doing lame stack juggling, for built in operators, then there's not much we can do. Nobody is going to want to convert their entire code base to asm.
ReplyDeleteFor instance, I use a third party sparse solver. The original C code executes much faster than my Pascal port, which is a literal port. The bulk of the code is loops and expressions using built in arithmetic operators. No amount of hand crafted asm libs is going to help here. Nothing short of a total asm port, or linking to code emitted by a different compiler, will help.
As you might imagine, I link to the C version. But for my own code, that is subject to development, that's not a viable option.
David Heffernan - Good points.
ReplyDeleteWho cares about x86 nowdays? Delphi generated x64 math code is roughly 50% faster than x86-version with optimization.
ReplyDeleteVitali Burkov - You'd be surprised of how incredibly slow corporate IT can be rolling out new OS versions. I know of companies that roll out 32-bit images on machines that comes from the vendor with 8+Gb RAM - simply because the 32-bit Windows is corporate standard.
ReplyDeleteVitali Burkov Not in my experience it is not. More like 10-20%. If you use 32 bit third party libraries then you may be stuck with 32 bit.
ReplyDeleteI'd say >99% of my clients now use my x64 version. Of course, dcc64 produces much worse code than respectable C++ compilers. Typical really.
Lars Fosdal For companies that do serious floating point number crunching I think x64 is prevalent. People running business apps are much more likely to cling to 32 bit systems.
ReplyDeleteDefine serious fp number crunching?
ReplyDeleteFpc produced nice x64 code, and much smaller executable than dcc64, from my findings.
ReplyDeleteLars Fosdal Algorithmic Trading down in central London?
ReplyDeleteLars Fosdal Do I need to?
ReplyDeleteMost algorithmic trading models are pretty simple. Their biggest advantage was to be faster to react than your average trader. Now they need to be faster than your competing algorithms. But - it's not really number crunching with extreme performance requirements. High frequency traders are not highly regarded by their peers, it seems. Then again, traders have no sense of value - just sense of margin.
ReplyDeleteDavid Heffernan - Well, there isn't that much shelfware software out there which really is computationally heavy. It usually falls into math and statistics (Mathematica, Wolfram, SAS Institute, etc), business intelligence (Oracle, Qlikview), and to a certain degree - audio and video processing. On the scientific side, you usually add more horsepower on the hardware side (parallelism, CUDA et.al.). Other examples would be welcome.
ReplyDeleteOh, forgot modelling and analysis tools. I know a guy that writes programs in Delphi to feed Abaqus data for FEM analysis.
ReplyDelete+Lars I write FEM code in Delphi
ReplyDeleteThat explains your need, for sure. Is CUDA an option?
ReplyDelete+Lars Not really. As I see it, GPU programming is good for small kernel, high parallelisation tasks. We have large kernel. Probably a re-write from scratch might benefit from GPU. Perhaps using a bespoke GPU sparse solver would help. None of that very tractable for us.
ReplyDeleteDavid Heffernan FEM in Delphi, impressive. Just curious, have you looked at the FEniCS project, in particular DOLFIN?
ReplyDeleteAsbjørn Heid No. Sorry.
ReplyDeleteDavid Heffernan They generate C++ code "just-in-time" style based on a higher-level description of the equations (in Python currently). The C++ code is compiled as a .so/.dll and linked back in. Quite fancy. Since the code is generated they can exploit symmetries and constants in the input. And since it's C++ code, it's quite fast...
ReplyDeleteAnyway, as I said earlier, just getting rid of the stack juggling should result in a very significant performance boost in not just integer code but also floating-point code. As such this should be a quite high priority on the compiler side IMO.
Asbjørn Heid Stack juggling is just the first step but I am afraid the two compilers are so much behind the rest of the competition out there that a radical move maybe needed.
ReplyDelete+David Heffernan Have you checked out Dew Research Math library for Delphi? I think they have fast math algorithms that use SSE etc.
José RamÃrez My linear algebra libraries are better than theirs, at least for my needs.
ReplyDeleteJosé RamÃrez I doubt the Delphi compiler will ever match C++ in speed for non-trivial code, mostly due to language restrictions. As such it's pointless goal I think.
ReplyDeleteAsbjørn Heid Hence why its pointless to apply tiny patches to the compiler and just replace it.
ReplyDeleteAsbjørn Heid The only real restriction affecting perf is that classes have to be heap allocated. Otherwise there's no reason that a good Delphi compiler could not be as fast as a good C++ compiler. And as for classes being on the heap, stack allocated classes in C++ for non-POD types are a minefield anyway, so it's hardly a limitation. If you want stack allocated structures, use records.
ReplyDeleteJosé RamÃrez David Heffernan C++ has const and move schemantics, stack-based objects with copy/move constructors and destructors, template metaprogramming, constexpr...
ReplyDeleteA Delphi compiler doesn't stand a chance due to the Delphi language in its current form. Sure you can do "asm in Delphi"-like programming and probably get fairly close, but that defeats the point of using a higher-level language.
Asbjørn Heid This is exactly why we are porting our project to C++. But I have a hard time deciding between wxwidgets or QT.
ReplyDeleteAsbjørn Heid I don't see much of that presenting serious hurdles to perf. I think a better compiler would get >90% of the way there.
ReplyDeleteJosé RamÃrez The cross-platform Open Source project I worked on moved from WxWidgets to Qt. The MOC stuff and such is a bit of a pain, but other than that I vastly prefer Qt.
ReplyDeleteDavid Heffernan I can agree on 70% ;)
ReplyDeleteLars Fosdal About RTL bottlenecks and LLVM, some of our users, developing for Android and iOS, found out that memory allocation was very slow under those devices. Sounds like if default implementation (using POSIX __malloc/_free) may be the culprit...
ReplyDeleteDid anyone experiment about this?
A. Bouchez Sounds like a poor Memory Manager, as expected. FastMM4 doesn't exist for mobile.
ReplyDelete+José FastMM is a poor performing MM.
ReplyDeleteParticularly for multithreaded apps.
ReplyDeleteWhile FastMM may not be the greatest memory allocator, a core issue is IMO that the Delphi language by it's nature leads to an excessive amount of memory allocations.
ReplyDeleteLars Fosdal For desktop apps, FastMM4 is just great, and for multi-thread apps, if you do not abuse of memory allocation, FastMM4 has a great performance. It is much lighter and efficient that the Windows API local heap, or C runtime malloc, for instance.
ReplyDeleteAsbjørn Heid You are right. But it depends on how you write your code. In practice, FastMM4 does its best to avoid reallocation most of the time (e.g. a string concatenation using FastMM4 is faster than TStringBuilder AFAIR). Some new features (like anonymous methods) do silently allocate memory...
In all case, you should always know a little what is under the hood. ;)
IMHO we should not blame FastMM4: this is one of the best part of the current Delphi RTL - and has been contributed by Pierre Le Riche, so was not done on Embarcadero side. I wish the latest version of the RTL was following its lead!
A. Bouchez - Is this info out of date?
ReplyDeletehttp://stackoverflow.com/questions/6072269/need-multi-threading-memory-manager
Our server app (a service) has 30-50 threads, depending on the number of clients - and more than 60% of the threads do concurrent memory allocations - albeit rarely at a high rate. There may be 5 to 10 worker threads which can peak high rate on allocations, network and db traffic at the same time, during closing of large deliveries.
The best part of FastMM4 is that it has rarely, if ever, failed me.
Lars Fosdal SAPMM did appear since, and sounded nice to me - may help you in your case: http://blog.synopse.info/post/2013/12/05/New-Open-Source-Multi-Thread-ready-Memory-Manager%3A-SAPMM
ReplyDeleteUpdated project repository is https://github.com/alan008/sapmm
For typical mORMot servers, memory allocation is pretty low, so we found out that FastMM4 was good enough in practice, even on production servers.
The FPC memory manager is very robust, and multi-thread friendly, but it is slower than FastMM4 in the context of single threaded work.
I don't agree with A. Bouchez here. My experience is that FastMM is slower than almost any other memory manager under contention. Slower than the malloc from msvcrt in my tests. Best is to avoid heap allocation where possible. Emba RTL makes that hard often.
ReplyDeleteDavid Heffernan this is what I wrote: Fastmm4 is great, unless you have contention.
ReplyDeleteAnd AFAIR neversleeponthreadcontention flag helps.
+A. Bouchez That option makes things worse in my experience. You get higher CPU utilisation but slower runtime.
ReplyDeleteDavid Heffernan Emba RTL is slow outside some particular areas (fastcode functions and some oldies), you have to avoid it in the first place when you are after performance, contention or no contention.
ReplyDeleteEric Grange Sadly, this is what I wrote some years ago - http://blog.synopse.info/post/2011/05/20/How-to-write-fast-multi-thread-Delphi-applications
ReplyDeleteA. Bouchez as an update to your article, you could use the full Knuth quote
ReplyDeletehttps://en.wikipedia.org/wiki/Program_optimization#When_to_optimize
and the subsequent one
"In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"
In the test that I've done Lazarus is nowhere near as fast as Delphi. In some cases yeah, but generally I'm not impressed.
ReplyDeleteEric Grange
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
Johan Bontes I disagree, most programmers do not care about performance at all. And by most, I mean an overwhelming majority.
ReplyDeleteUsually they do not even care about data volumetry.
Eric Grange This sounds like the hard truth. I've seen this in so many companies... programmers do not prepare about performance (since it is a hard task, especially when your C# or Java environment do abstract everything), and managers don't focus on it (mostly due to the cost). Until the clients start to arrive in numbers, then the whole system expectations appear like underestimated, and the programmers are asking to do wonders of performance... on a code base which was never designed nor prepared for it.
ReplyDeleteTypical issue is abusing of the DB (putting everything in the RDBMS), which becomes a true bottleneck. Or never looked at the execution path of a complex SQL query auto-magically generated by LINQ or the ORM. Never used the step by step debugger (even within the core libraries) to see what is involved in a single line of code. Looks like "but it works on my dev workstation!" syndrome.
I've seen companies disappear due to such bad design. Or others spending huge amount of money to rewrote the initial quick & dirty project. And even some companies leaving Delphi for C# or Java, because the technical managers did believe in the wonders of performance told by the marketers, and putting brand new coders, just out from scholl, with still milk in the nose, to make the "future flag product of our company". With the success you imagine.
A. Bouchez
ReplyDeleteSpeaking of performance, I managed to outperform the Fastcrc32c hash :-) Using handcoded assembly of Murmurhash3. (Well on my machine at least).
Johan Bontes Will you share your implementation? Also the non-asm code? :)
ReplyDeleteJohan Bontes AFAIR MurmurHash3 has more collisions than crc32c. This is to be taken in account. Would you post your numbers on http://synopse.info/forum/profile.php?id=1374 ?Did you test crc32c with hardware acceleration (i.e. crc32c() function in SynCommons, not crc32fast)? I hope your Murmurhash3 implementation is more correct than the benchmark code you posted on your forum, which was buggy.
ReplyDeleteJosé RamÃrez
ReplyDeleteHere's the code:
{$pointermath on}
function PascalMurmurHash3(const [ref] HashData; Len, Seed: Integer): Integer;
const
c1 = $cc9e2d51;
c2 = $1b873593;
r1 = 15;
r2 = 13;
m = 5;
n = $e6546b64;
f1 = $85ebca6b;
f2 = $c2b2ae35;
var
i: NativeInt;
Len1, Len2: integer;
k: integer;
data: PCardinal;
label case1, case2, case3, final;
type
ByteArray = array[0..0] of byte;
begin
Result:= seed;
data:=@HashData;
Len1:= (Len shr 2)-1;
for i:= 0 to Len1 do begin
k:= data[i];
k:= k * integer(c1);
k:= (k shl r1) or (k shr (32-r1));
k:= k * c2;
Result:= Result xor k;
Result:= (Result shl r2) or (Result shr (32-r2));
Result:= Result * m + integer(n);
end; {for i}
k:= 0;
Len2:= Len;
case Len and $3 of
1: goto case1;
2: goto case2;
3: goto case3;
else goto final;
end;
case3:
Dec(Len2);
Inc(k, PByte(Data)[Len2] shl 16);
case2:
Dec(Len2);
Inc(k, PByte(Data)[Len2] shl 8);
case1:
Dec(Len2);
Inc(k, PByte(Data)[Len2]);
k:= k * integer(c1);
k:= (k shl r1) or (k shr (32 - r1));
k:= k * c2;
Result:= Result xor k;
final:
Result:= Result xor Len;
Result:= Result xor (Result shr 16);
Result:= Result * integer(f1);
Result:= Result xor (Result shr 13);
Result:= Result * integer(f2);
Result:= Result xor (Result shr 16);
end;
function MurmurHash3(const [ref] HashData; Len: integer; Seed: integer = 0): integer;
{$ifdef purepascal}
inline;
begin
Result:= PascalMurmurHash3(HashData, Len, Seed);
end;
{$else}
const
c1 = $CC9E2D51;
c2 = $1B873593;
r1 = 15;
r2 = 13;
m = 5;
n = $E6546B64;
f1 = $85EBCA6B;
f2 = $C2B2AE35;
{$REGION 'asm'}
{$IFDEF CPUx86}
asm
push EBX
push EDI
push ESI
xchg ECX,EDX
//EAX = data
//ECX = count in bytes
//EDX = seed
add EAX, ECX
neg ECX
add ECX,4
jg @remaining_bytes
@loop:
mov EDI,[EAX+ECX-4]
imul EDI,EDI,c1
rol EDI,r1
imul EDI,EDI,c2
xor EDX,EDI
rol EDX,r2
lea EDX,[EDX*4+EDX+n]
add ECX,4
jle @loop
@remaining_bytes:
cmp ECX,4
jz @finalization
movzx EBX,byte ptr [EAX+ECX-4]
cmp ECX,3//inc RCX
jz @process_remaining
ror EBX,8//ror R9d,24
mov BL,byte ptr [EAX+ECX-3]
cmp ECX,2//inc RCX
jz @shift_back
ror EBX,8//ror R9d,24
mov bl,byte ptr [EAX+ECX-2]
rol EBX,8
@shift_back:
rol EBX,8
@process_remaining:
imul EBX,EBX,c1
rol EBX,r1
imul EBX,EBX,c2
xor EDX,EBX
@finalization:
xor EDX,ESI
mov EAX,EDX
shr EDX,16
xor EDX,EAX
imul EDX,EDX,f1
mov EAX,EDX
shr EDX,13
xor EDX,EAX
imul EDX,EDX,f2
mov EAX,EDX
shr EDX,16
xor EAX,EDX
pop ESI
pop EDI
pop EBX
end;
{$ENDIF}
{$IFDEF CPUx64}
asm
.NOFRAME
xchg R10,RDI
xchg R11,RSI
mov RAX,RCX
mov RCX,RDX
mov RDX,R8
//RAX = data
//RCX = count in bytes
//RDX = seed
mov ESI,ECX
//make index negative based.
//and ECX,not(3)
neg RCX
sub RAX, RCX
add RCX, 4
jg @remaining_bytes
@loop:
mov EDI,[RAX+RCX-4]
imul EDI,EDI,c1
rol EDI,r1
imul EDI,EDI,c2
ReplyDeletexor EDX,EDI
rol EDX,r2
lea EDX,[EDX*4+EDX+n] // *5 + n
//lea RAX,qword ptr [RAX+4]
add RCX,4
jle @loop
@remaining_bytes:
cmp RCX,4
//mov ECX,ESI
//and ESI,$3
jz @finalization
movzx r9,byte ptr [RAX+RCX-4]
cmp RCX,3//inc RCX
jz @process_remaining
ror R9,8//ror R9d,24
mov R9b,byte ptr [RAX+RCX-3]
cmp RCX,2//inc RCX
jz @shift_back
ror R9,8//ror R9d,24
mov R9b,byte ptr [RAX+RCX-2]
rol R9,8
@shift_back:
rol R9,8
@process_remaining:
imul R9d,R9d,c1
rol R9d,r1
imul R9d,R9d,c2
xor EDX,R9d
@finalization:
xor EDX,ESI
mov EAX,EDX
shr EDX,16
xor EDX,EAX
imul EDX,EDX,f1
mov EAX,EDX
shr EDX,13
xor EDX,EAX
imul EDX,EDX,f2
mov EAX,EDX
shr EDX,16
xor EAX,EDX
xchg R10,RDI
xchg R11,RSI
end;
{$ENDIF}
{$ENDREGION}
{$ENDIF}
A. Bouchez
ReplyDeleteCollisions are the same as crc32, see: http://www.strchr.com/hash_functions (the number in square brackets are the collisions).
My lowly laptop/workstations do not have crc32c built in, so I was only able to test against the FastCRC32c code.
The results where that for hashings of small data (e.g. string[10]) Murmurhash was significantly faster and for large data (e.g. string[1000]) slightly slower.
See the code in the post above.
The lookup table needs a while to warm up in the cache.
Murmurhash3 only has the loop, so it will always be faster for intermittent hashing.
Given the fact that hashing is a core ingredient in TDictionary and small keys there are very common, I think a fast hash for small data is a big win.
Ideally the hash should be selectable based on the size of the key.
Numbers
I'm working on a performance-test suite for my Fastcode project, there's a (basic) speed test for murmurhash3 at: https://github.com/JBontes/FastCode/tree/master/Test
Johan Bontes Your crc32c() test is still broken in github. You are using Length(HashTest) as length instead of Length(HashTest[i]).
ReplyDeleteA. Bouchez
ReplyDeleteOh, I'll get right on it.
With the correct length, the timings for crc32c and murmur are very close, both on small and large sets.
committed the correct code to github.