Time to focus on the compiler purpose, right?

Time to focus on the compiler purpose, right?
IMHO a compiler is there to generate correct and efficient code...
http://blog.synopse.info/post/2015/06/21/Why-FPC-may-be-a-better-compiler-than-Delphi

Comments

  1. Well this is exactly what I've been telling Marco Cantù and he just won't listen. Sadly.

    A free product such as FreePascal is faster than Delphi. Think about this for a moment. This is the current reality.

    Results were posted by +Eric Grange sometime ago on his website.

    ReplyDelete
  2. I think it's pretty clear that the vast majority of the Delphi customer base doesn't care much about raw performance as long as it's "good enough".

    On the flip side, FPC seems to lack some big language features...

    ReplyDelete
  3. Asbjørn Heid This has to be the most common optimization tricks out there. I don't know why Delphi doesn't do it.

    ReplyDelete
  4. José Ramírez I read alot about x86 optimization tricks back in the 386 days, and I can't remember seeing this one. So I'm not so sure about "most common". But I guess these days it should be well known for the compiler guys.

    However it should be fairly clear the Delphi compiler hasn't received much tender love and care in the optimization department for the last 10 years, if not more.

    ReplyDelete
  5. Asbjørn Heid Did you check http://www.getlazarus.org/language ? The FPC 3 branch did give a lot of nice language features.
    There are still some missing features, e.g. anonymous methods or advanced RTTI, but they have dedicated branches in the FPC repository, and their implementation path is IMHO superior than Delphi's.

    ReplyDelete
  6. A. Bouchez AFAIK they are lacking a couple of features. Delayed loading, Anonymous methods, some generic stuff and that's about it. Its getting there though.  

    However! Their 32 bit compiler can generate SSE2-4.1 code which Delphi cannot! And it can statically link GCC code and probably CLANG as well. Bit packed records..

    I've personally statically linked OpenSSL with GCC & FPC before for Indy.

    Overall it's looking pretty good.

    ReplyDelete
  7. A. Bouchez Yea, for me the most glaring is the lack of anonymous methods and "reference to function". I use it quite a lot these days.

    Once FPC gets that I'll take a good look at it.

    ReplyDelete
  8. Pfff, better compiler just because it can do some optimizations the Delphi one cannot but lacking some language features that the Delphi compiler has since more than half a decade? Yeah sure...

    ReplyDelete
  9. Stefan Glienke Faster compiler vs "some" features. I'll take the faster compiler and so would anyone with half a brain. If speed doesn't matter I may just start writing C#.

    ReplyDelete
  10. José Ramírez Yeah, I bet that some optimized divisions are making your applications ten times faster...

    ReplyDelete
  11. Stefan Glienke Its not just about divisions its other stuff as well. Like floating point performance etc. You wouldn't care anyway you deal mostly with interfaces. :)

    Its old but anyway I doubt much changed.
    https://www.delphitools.info/2014/05/08/delphi-xe6-32bits-and-scimark/

    ReplyDelete
  12. José Ramírez If they got rid of the excessive stack juggling the Delphi compiler likes to do, it would matter a LOT more than the integer division optimization.

    But the point still stands. People who care about performance go elsewhere these days. Even Java can be significantly faster than Delphi.

    ReplyDelete
  13. I don't have too many complaints on the code - as I deal with databases - which are slower than molasses compared to x86 code.

    Personally, I would like to see a proper overhaul of the IDE to fix the lockups, leaks, and debugger crashes - taking priority over compiler work (beyond obvious fixes).

    ReplyDelete
  14. Everyone has their pet desires. I'd like a compiler that produced much better code because that is of critical importance to me. It shouldn't be one or the other. Why can't we have a compiler that produces efficient code, a language with rich features, a stable IDE, and a high quality RTL?

    ReplyDelete
  15. I am forced to believe that there is a substantial disconnect between the corporate vision and the everyday needs which concern those of us who depend on the tools for our livelihood. And "trust us" doesn't get it done.

    ReplyDelete
  16. Bill Meyer Without radical change in corporate management nothing will change.

    ReplyDelete
  17. One thing is certain - nothing said in this community has the slightest effect on the EMBT corporate vision (or lack there of).

    We can perhaps sway the priorities of the Delphi team - but that is about all.

    ReplyDelete
  18. It should be certain, as we were certainly lectured on the irrelevance of our opinions many times in the newsgroups. I can't recall any of us ever being right....

    ReplyDelete
  19. Which is why I prefer this community to focus on actual coding. There is enough poisonous and futile debate elsewhere.

    ReplyDelete
  20. Lars Fosdal One of the great things in this community is that it continues to deliver much value.

    ReplyDelete
  21. Lars Fosdal Another great thing is that when legitimate rants appear, they almost always offer valid criticism, and once the issue has been raised and received a round of comments, it does down. So different to the experience in certain newsgroups. ;)

    ReplyDelete
  22. Is Delphi 64 bit compiler using LLVM?

    ReplyDelete
  23. José Ramírez Not the desktop one no.

    ReplyDelete
  24. LLVM only used on the mobile compilers. dcc32, dcc64, dccosx all use the legacy compiler

    ReplyDelete
  25. David Heffernan WOW even OSX uses legacy. They use different compilers for the same language. Must be a PITA to maintain all of them especially legacy stuff.
    I bet the LLVM stuff they use is like 5 years old.
    But then again not surprising at all from a company that forgets to fix TList before releasing it to the public.

    ReplyDelete
  26. José Ramírez  The legacy compiler wasn't written by them either. They bought it in an modified it to their needs. I also think it's so fast because it was written (originally at least) outside the building!  Very cynical!

    ReplyDelete
  27. David Heffernan Now that's funny. So that explains the ASM code in the RTL/VCL. Fake it until you make it! :D

    ReplyDelete
  28. Lars Fosdal Have you ever opened the RTL or VCL? Its riddled with Assembly and guess why because the compiler is slow! That's what I meant with fake it (speed with asm) until you make it (a better compiler).

    ReplyDelete
  29. Please clarify, José Ramírez - Are you saying the RTL assembly code is slow, the compiler is slow, or the compiler generated code is slow?

    I wonder if anyone ever made an overview over what ASM blocks where added or modified in which Delphi version.

    ReplyDelete
  30. Lars Fosdal RTL code has tons of assembly instead of real code to compensate for lost speed with poor compiler. I am sure you understand. Add tons of IFDEFs + other nonsense and you get lost quickly.

    And since IDE is compiled with Delphi compiler it would fix a lot of your problems. Hence a better compiler is needed.

    If they only added /LARGEADDRESSAWARE to allow 4GB allocation in the IDE it would make a lot of difference!

    I guess sending mass promotion emails is more important for the company.

    ReplyDelete
  31. There are currently approx 720 asm blocks in the RTL/VCL spread over 48 units out of a total of 2050+ units (per XP7.1 Enterprise).  Without having checked - I assume several of the asm blocks are duplicated for win 32 vs win64.

    unit:
    2943 occurrences in 2059 files
    function:
    222172 occurrences in 1870 files
    procedure:
    138664 occurrences in 1719 files
    asm:
    778 occurrences in 48 files

    This hardly qualifies as "tons."

    As for getting lost - the debugger puts you right on the spot - and if you can't read assembly, you either study up - or press shift-F8.

    Can the compiler do better?  Not this one, I guess - and moving to a new one requires more than a bit of work and regression testing up the wazzoo.  I'd like to see a LLVM compiler for the desktop too - but it is likely to mean yet another round of new and exciting bugs - and come at the cost of not fixing the existing problems with the IDE.

    With limited resource, you can do limited work. Even if the devs probably don't handwrite the mass promotion emails.

    ReplyDelete
  32. Lars Fosdal You may think that's not a lot but without those 700 or so asm blocks the application would be dead slow. Its not how often they are used but where they are used. And by fixing the compiler you fix the IDE partially as well since it used to compile the IDE itself.

    Well you can't do much about ErrorInsight since that's written in .NET I presume C#.

    ReplyDelete
  33. C++Builder
    asm:
    Found 288 occurrences in 47 files
    __emit:
    Found 58 occurrences in 13 files

    out of a total of 2256 files (*.cpp *.c *.h *.inc)

    ReplyDelete
  34. Lars Fosdal Most of those are in string constants ;)
    In System.hpp there is 0 ASM code.

    ReplyDelete
  35. Also - none of the asm or __emit references are in string constants, but are embedded asm in the C code.

    Oh... and a slight oversight:
    There are 191 .asm files in the C++Builder source/cpprtl directories
    There is 1 (one) in the VCL source directory.

    So - we have established the fact that the current Delphi compiler for Win/OSX is old, and that the RTL needs ASM to keep it's speed up.
    Now, deal with it.

    ReplyDelete
  36. It's certainly the case that the legacy Delphi compiler has never produced especially efficient code. Well, perhaps it did 20 years ago, but it hasn't moved on.

    As somebody who does primarily floating point work, the poor quality of the code that the compiler emits (both dcc32 and dcc64) is frustrating. Code like mine, heavily based on floating point, would be significantly faster if it had codegen of similar quality to any of the well known C++ compilers.

    ReplyDelete
  37. David Heffernan - I don't think you'll find anyone that contests the need for a compiler overhaul.

    Some of the math support in CPP looks to be hard coded assembly. In theory, some of that could be ported to Delphi as well - but, it wouldn't solve the stack handling issues in the current Delphi compiler.

    The Delphi language does give you a lot of headroom for tweaking routines in assembly, though - which should give us the ability to work on optimized floating point libraries for the community?

    ReplyDelete
  38. It would be cheaper to move the entire project over to C++ than to tinker around with assembly. Its not exactly trivial to write an optimized assembly routine. And how exactly will you target SSE-AVX2 hardware with your custom assembly? This stuff is for the compiler to handle.

    ReplyDelete
  39. +Lars It's not enough to write your own asm routines. You need help from the compiler.

    ReplyDelete
  40. IMO, it's trivial to write an optimized assembly routine, compared to moving to a new compiler and ensuring that it is backwards compatible with billions of lines of Delphi code (with it's interspersed asm).  

    That is a pretty big task.  It was easier for the new platforms - as you had no backwards compatibility to care for. I'm not saying it can't be done - but - it is a lot of work - and we won't know if the benefits will be tangible until it has been done.  What if it turns out that it is the RTL/VCL that has most of the bottlenecks?

    ReplyDelete
  41. David Heffernan - For generic cross platform FPU code, yes.

    ReplyDelete
  42. Lars Fosdal For specific single target code too. If the compiler is emitting slow x87 instructions instead of fast SSE instructions, or doing lame stack juggling, for built in operators, then there's not much we can do. Nobody is going to want to convert their entire code base to asm.

    For instance, I use a third party sparse solver. The original C code executes much faster than my Pascal port, which is a literal port. The bulk of the code is loops and expressions using built in arithmetic operators. No amount of hand crafted asm libs is going to help here. Nothing short of a total asm port, or linking to code emitted by a different compiler, will help.

    As you might imagine, I link to the C version. But for my own code, that is subject to development, that's not a viable option.

    ReplyDelete
  43. Who cares about x86 nowdays? Delphi generated x64 math code is roughly 50% faster than x86-version with optimization.

    ReplyDelete
  44. Vitali Burkov - You'd be surprised of how incredibly slow corporate IT can be rolling out new OS versions.  I know of companies that roll out 32-bit images on machines that comes from the vendor with 8+Gb RAM - simply because the 32-bit Windows is corporate standard.

    ReplyDelete
  45. Vitali Burkov Not in my experience it is not.  More like 10-20%. If you use 32 bit third party libraries then you may be stuck with 32 bit.  

    I'd say >99% of my clients now use my x64 version. Of course, dcc64 produces much worse code than respectable C++ compilers.  Typical really.

    ReplyDelete
  46. Lars Fosdal  For companies that do serious floating point number crunching I think x64 is prevalent. People running business apps are much more likely to cling to 32 bit systems.

    ReplyDelete
  47. Define serious fp number crunching?

    ReplyDelete
  48. Fpc produced nice x64 code, and much smaller executable than dcc64, from my findings.

    ReplyDelete
  49. Lars Fosdal Algorithmic Trading down in central London?

    ReplyDelete
  50. Most algorithmic trading models are pretty simple.  Their biggest advantage was to be faster to react than your average trader. Now they need to be faster than your competing algorithms.  But - it's not really number crunching with extreme performance requirements. High frequency traders are not highly regarded by their peers, it seems.  Then again, traders have no sense of value - just sense of margin.

    ReplyDelete
  51. David Heffernan - Well, there isn't that much shelfware software out there which really is computationally heavy.  It usually falls into math and statistics (Mathematica, Wolfram, SAS Institute, etc), business intelligence (Oracle, Qlikview), and to a certain degree - audio and video processing.  On the scientific side, you usually add more horsepower on the hardware side (parallelism, CUDA et.al.). Other examples would be welcome.

    ReplyDelete
  52. Oh, forgot modelling and analysis tools. I know a guy that writes programs in Delphi to feed Abaqus data for FEM analysis.

    ReplyDelete
  53. That explains your need, for sure. Is CUDA an option?

    ReplyDelete
  54. +Lars Not really. As I see it, GPU programming is good for small kernel, high parallelisation tasks. We have large kernel. Probably a re-write from scratch might benefit from GPU. Perhaps using a bespoke GPU sparse solver would help. None of that very tractable for us.

    ReplyDelete
  55. David Heffernan FEM in Delphi, impressive. Just curious, have you looked at the FEniCS project, in particular DOLFIN?

    ReplyDelete
  56. David Heffernan They generate C++ code "just-in-time" style based on a higher-level description of the equations (in Python currently). The C++ code is compiled as a .so/.dll and linked back in. Quite fancy. Since the code is generated they can exploit symmetries and constants in the input. And since it's C++ code, it's quite fast...

    Anyway, as I said earlier, just getting rid of the stack juggling should result in a very significant performance boost in not just integer code but also floating-point code. As such this should be a quite high priority on the compiler side IMO.

    ReplyDelete
  57. Asbjørn Heid Stack juggling is just the first step but I am afraid the two compilers are so much behind the rest of the competition out there that a radical move maybe needed.

    +David Heffernan Have you checked out Dew Research Math library for Delphi? I think they have fast math algorithms that use SSE etc.

    ReplyDelete
  58. José Ramírez My linear algebra libraries are better than theirs, at least for my needs.

    ReplyDelete
  59. José Ramírez I doubt the Delphi compiler will ever match C++ in speed for non-trivial code, mostly due to language restrictions. As such it's pointless goal I think.

    ReplyDelete
  60. Asbjørn Heid Hence why its pointless to apply tiny patches to the compiler and just replace it.

    ReplyDelete
  61. Asbjørn Heid The only real restriction affecting perf is that classes have to be heap allocated. Otherwise there's no reason that a good Delphi compiler could not be as fast as a good C++ compiler. And as for classes being on the heap, stack allocated classes in C++ for non-POD types are a minefield anyway, so it's hardly a limitation. If you want stack allocated structures, use records.

    ReplyDelete
  62. José Ramírez  David Heffernan C++ has const and move schemantics, stack-based objects with copy/move constructors and destructors, template metaprogramming, constexpr...

    A Delphi compiler doesn't stand a chance due to the Delphi language in its current form. Sure you can do "asm in Delphi"-like programming and probably get fairly close, but that defeats the point of using a higher-level language.

    ReplyDelete
  63. Asbjørn Heid This is exactly why we are porting our project to C++. But I have a hard time deciding between wxwidgets or QT.

    ReplyDelete
  64. Asbjørn Heid I don't see much of that presenting serious hurdles to perf. I think a better compiler would get >90% of the way there.

    ReplyDelete
  65. José Ramírez The cross-platform Open Source project I worked on moved from WxWidgets to Qt. The MOC stuff and such is a bit of a pain, but other than that I vastly prefer Qt.

    ReplyDelete
  66. David Heffernan I can agree on 70% ;)

    ReplyDelete
  67. Lars Fosdal About RTL bottlenecks and LLVM, some of our users, developing for Android and iOS, found out that memory allocation was very slow under those devices. Sounds like if default implementation (using POSIX __malloc/_free) may be the culprit...
    Did anyone experiment about this?

    ReplyDelete
  68. A. Bouchez Sounds like a poor Memory Manager, as expected. FastMM4 doesn't exist for mobile.

    ReplyDelete
  69. +José FastMM is a poor performing MM.

    ReplyDelete
  70. Particularly for multithreaded apps.

    ReplyDelete
  71. While FastMM may not be the greatest memory allocator, a core issue is IMO that the Delphi language by it's nature leads to an excessive amount of memory allocations.

    ReplyDelete
  72. Lars Fosdal  For desktop apps, FastMM4 is just great, and for multi-thread apps, if you do not abuse of memory allocation, FastMM4 has a great performance. It is much lighter and efficient that the Windows API local heap, or C runtime malloc, for instance.
    Asbjørn Heid You are right. But it depends on how you write your code. In practice, FastMM4 does its best to avoid reallocation most of the time (e.g. a string concatenation using FastMM4 is faster than TStringBuilder AFAIR). Some new features (like anonymous methods) do silently allocate memory...
    In all case, you should always know a little what is under the hood. ;)
    IMHO we should not blame FastMM4: this is one of the best part of the current Delphi RTL - and has been contributed by Pierre Le Riche, so was not done on Embarcadero side. I wish the latest version of the RTL was following its lead!

    ReplyDelete
  73. A. Bouchez - Is this info out of date?
    http://stackoverflow.com/questions/6072269/need-multi-threading-memory-manager

    Our server app (a service) has 30-50 threads, depending on the number of clients - and more than 60% of the threads do concurrent memory allocations - albeit rarely at a high rate.  There may be 5 to 10 worker threads which can peak high rate on allocations, network and db traffic at the same time, during closing of large deliveries.

    The best part of FastMM4 is that it has rarely, if ever, failed me.

    ReplyDelete
  74. Lars Fosdal SAPMM did appear since, and sounded nice to me - may help you in your case: http://blog.synopse.info/post/2013/12/05/New-Open-Source-Multi-Thread-ready-Memory-Manager%3A-SAPMM 
    Updated project repository is https://github.com/alan008/sapmm
    For typical mORMot servers, memory allocation is pretty low, so we found out that FastMM4 was good enough in practice, even on production servers.
    The FPC memory manager is very robust, and multi-thread friendly, but it is slower than FastMM4 in the context of single threaded work.

    ReplyDelete
  75. I don't agree with A. Bouchez here. My experience is that FastMM is slower than almost any other memory manager under contention. Slower than the malloc from msvcrt in my tests. Best is to avoid heap allocation where possible. Emba RTL makes that hard often.

    ReplyDelete
  76. David Heffernan this is what I wrote: Fastmm4 is great, unless you have contention.
    And AFAIR neversleeponthreadcontention flag helps.

    ReplyDelete
  77. +A. Bouchez That option makes things worse in my experience. You get higher CPU utilisation but slower runtime.

    ReplyDelete
  78. David Heffernan Emba RTL is slow outside some particular areas (fastcode functions and some oldies), you have to avoid it in the first place when you are after performance, contention or no contention.

    ReplyDelete
  79. A. Bouchez as an update to your article, you could use the full Knuth quote

    https://en.wikipedia.org/wiki/Program_optimization#When_to_optimize

    and the subsequent one

    "In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"

    ReplyDelete
  80. In the test that I've done Lazarus is nowhere near as fast as Delphi. In some cases yeah, but generally I'm not impressed. 

    Eric Grange 
    Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

    ReplyDelete
  81. Johan Bontes​ I disagree, most programmers do not care about performance at all. And by most, I mean an overwhelming majority.
    Usually they do not even care about data volumetry.

    ReplyDelete
  82. Eric Grange This sounds like the hard truth. I've seen this in so many companies... programmers do not prepare about performance (since it is a hard task, especially when your C# or Java environment do abstract everything), and managers don't focus on it (mostly due to the cost). Until the clients start to arrive in numbers, then the whole system expectations appear like underestimated, and the programmers are asking to do wonders of performance... on a code base which was never designed nor prepared for it.
    Typical issue is abusing of the DB (putting everything in the RDBMS), which becomes a true bottleneck. Or never looked at the execution path of a complex SQL query auto-magically generated by LINQ or the ORM. Never used the step by step debugger (even within the core libraries) to see what is involved in a single line of code. Looks like "but it works on my dev workstation!" syndrome.
    I've seen companies disappear due to such bad design. Or others spending huge amount of money to rewrote the initial quick & dirty project. And even some companies leaving Delphi for C# or Java, because the technical managers did believe in the wonders of performance told by the marketers, and putting brand new coders, just out from scholl, with still milk in the nose, to make the "future flag product of our company". With the success you imagine.

    ReplyDelete
  83. A. Bouchez
    Speaking of performance, I managed to outperform the Fastcrc32c hash :-) Using handcoded assembly of Murmurhash3. (Well on my machine at least).

    ReplyDelete
  84. Johan Bontes Will you share your implementation? Also the non-asm code? :)

    ReplyDelete
  85. Johan Bontes AFAIR MurmurHash3 has more collisions than crc32c. This is to be taken in account. Would you post your numbers on http://synopse.info/forum/profile.php?id=1374 ?Did you test crc32c with hardware acceleration (i.e. crc32c() function in SynCommons, not crc32fast)? I hope your Murmurhash3 implementation is more correct than the benchmark code you posted on your forum, which was buggy.

    ReplyDelete
  86. José Ramírez
    Here's the code:
    {$pointermath on}
    function PascalMurmurHash3(const [ref] HashData; Len, Seed: Integer): Integer;
    const
      c1 = $cc9e2d51;
      c2 = $1b873593;
      r1 = 15;
      r2 = 13;
      m = 5;
      n = $e6546b64;
      f1 = $85ebca6b;
      f2 = $c2b2ae35;
    var
      i: NativeInt;
      Len1, Len2: integer;
      k: integer;
      data: PCardinal;
    label case1, case2, case3, final;
    type
      ByteArray = array[0..0] of byte;
    begin
      Result:= seed;
      data:=@HashData;
      Len1:= (Len shr 2)-1;
      for i:= 0 to Len1 do begin
        k:= data[i];
        k:= k * integer(c1);
        k:= (k shl r1) or (k shr (32-r1));
        k:= k * c2;
        Result:= Result xor k;
        Result:= (Result shl r2) or (Result shr (32-r2));
        Result:= Result * m + integer(n);
      end; {for i}
      k:= 0;
      Len2:= Len;
      case Len and $3 of
        1: goto case1;
        2: goto case2;
        3: goto case3;
        else goto final;
      end;
    case3:
      Dec(Len2);
      Inc(k, PByte(Data)[Len2] shl 16);
    case2:
      Dec(Len2);
      Inc(k, PByte(Data)[Len2] shl 8);
    case1:
      Dec(Len2);
      Inc(k, PByte(Data)[Len2]);
      k:= k * integer(c1);
      k:= (k shl r1) or (k shr (32 - r1));
      k:= k * c2;
      Result:= Result xor k;
    final:
      Result:= Result xor Len;

      Result:= Result xor (Result shr 16);
      Result:= Result * integer(f1);
      Result:= Result xor (Result shr 13);
      Result:= Result * integer(f2);
      Result:= Result xor (Result shr 16);
    end;


    function MurmurHash3(const [ref] HashData; Len: integer; Seed: integer = 0): integer;
    {$ifdef purepascal}
    inline;
    begin
      Result:= PascalMurmurHash3(HashData, Len, Seed);
    end;
    {$else}
    const
      c1 = $CC9E2D51;
      c2 = $1B873593;
      r1 = 15;
      r2 = 13;
      m = 5;
      n = $E6546B64;
      f1 = $85EBCA6B;
      f2 = $C2B2AE35;
    {$REGION 'asm'}
    {$IFDEF CPUx86}
    asm
        push EBX
        push EDI
        push ESI
        xchg ECX,EDX
        //EAX = data
        //ECX = count in bytes
        //EDX = seed
        add EAX, ECX
        neg ECX
        add ECX,4
        jg @remaining_bytes
    @loop:
        mov  EDI,[EAX+ECX-4]
        imul EDI,EDI,c1
        rol  EDI,r1
        imul EDI,EDI,c2
        xor  EDX,EDI
        rol  EDX,r2
        lea  EDX,[EDX*4+EDX+n]
        add  ECX,4
        jle @loop
    @remaining_bytes:
        cmp  ECX,4
        jz @finalization
        movzx EBX,byte ptr [EAX+ECX-4]
        cmp  ECX,3//inc  RCX
        jz @process_remaining
        ror  EBX,8//ror  R9d,24
        mov  BL,byte ptr [EAX+ECX-3]
        cmp  ECX,2//inc  RCX
        jz @shift_back
        ror  EBX,8//ror  R9d,24
        mov  bl,byte ptr [EAX+ECX-2]
        rol  EBX,8
    @shift_back:
        rol  EBX,8
    @process_remaining:
        imul EBX,EBX,c1
        rol  EBX,r1
        imul EBX,EBX,c2
        xor  EDX,EBX
    @finalization:
        xor  EDX,ESI
        mov  EAX,EDX
        shr  EDX,16
        xor  EDX,EAX
        imul EDX,EDX,f1
        mov  EAX,EDX
        shr  EDX,13
        xor  EDX,EAX
        imul EDX,EDX,f2
        mov  EAX,EDX
        shr  EDX,16
        xor  EAX,EDX
        pop  ESI
        pop  EDI
        pop  EBX
    end;
    {$ENDIF}
    {$IFDEF CPUx64}
    asm
        .NOFRAME
        xchg R10,RDI
        xchg R11,RSI
        mov  RAX,RCX
        mov  RCX,RDX
        mov  RDX,R8
        //RAX = data
        //RCX = count in bytes
        //RDX = seed
        mov  ESI,ECX
        //make index negative based.
        //and  ECX,not(3)
        neg  RCX
        sub  RAX, RCX
        add  RCX, 4
        jg @remaining_bytes
    @loop:
        mov  EDI,[RAX+RCX-4]
        imul EDI,EDI,c1
        rol  EDI,r1

    ReplyDelete
  87. imul EDI,EDI,c2
        xor  EDX,EDI
        rol  EDX,r2
        lea  EDX,[EDX*4+EDX+n] // *5 + n
        //lea  RAX,qword ptr [RAX+4]
        add  RCX,4
        jle @loop
    @remaining_bytes:
        cmp  RCX,4
        //mov  ECX,ESI
        //and  ESI,$3
        jz @finalization
        movzx r9,byte ptr [RAX+RCX-4]
        cmp  RCX,3//inc  RCX
        jz @process_remaining
        ror  R9,8//ror  R9d,24
        mov  R9b,byte ptr [RAX+RCX-3]
        cmp  RCX,2//inc  RCX
        jz @shift_back
        ror  R9,8//ror  R9d,24
        mov  R9b,byte ptr [RAX+RCX-2]
        rol  R9,8
    @shift_back:
        rol  R9,8
    @process_remaining:
        imul R9d,R9d,c1
        rol  R9d,r1
        imul R9d,R9d,c2
        xor  EDX,R9d
    @finalization:
        xor  EDX,ESI
        mov  EAX,EDX
        shr  EDX,16
        xor  EDX,EAX
        imul EDX,EDX,f1
        mov  EAX,EDX
        shr  EDX,13
        xor  EDX,EAX
        imul EDX,EDX,f2
        mov  EAX,EDX
        shr  EDX,16
        xor  EAX,EDX
        xchg R10,RDI
        xchg R11,RSI
    end;
    {$ENDIF}
    {$ENDREGION}
    {$ENDIF}

    ReplyDelete
  88. A. Bouchez

    Collisions are the same as crc32, see: http://www.strchr.com/hash_functions (the number in square brackets are the collisions).

    My lowly laptop/workstations do not have crc32c built in, so I was only able to test against the FastCRC32c code. 
    The results where that for hashings of small data (e.g. string[10]) Murmurhash was significantly faster and for large data (e.g. string[1000]) slightly slower.
    See the code in the post above.

    The lookup table needs a while to warm up in the cache. 
    Murmurhash3 only has the loop, so it will always be faster for intermittent hashing. 
      
    Given the fact that hashing is a core ingredient in TDictionary and small keys there are very common, I think a fast hash for small data is a big win.
    Ideally the hash should be selectable based on the size of the key.
     
    Numbers
    I'm working on a performance-test suite for my Fastcode project, there's a (basic) speed test for murmurhash3 at: https://github.com/JBontes/FastCode/tree/master/Test

    ReplyDelete
  89. Johan Bontes Your crc32c() test is still broken in github. You are using Length(HashTest) as length instead of Length(HashTest[i]).

    ReplyDelete
  90. A. Bouchez
    Oh, I'll get right on it.
    With the correct length, the timings for crc32c and murmur are very close, both on small and large sets.
    committed the correct code to github.

    ReplyDelete

Post a Comment