Time to focus on the compiler purpose, right?

- June 21, 2015

Time to focus on the compiler purpose, right?
IMHO a compiler is there to generate correct and efficient code...
http://blog.synopse.info/post/2015/06/21/Why-FPC-may-be-a-better-compiler-than-Delphi

Comments

José RamírezJune 21, 2015 at 11:16 AM
Well this is exactly what I've been telling Marco Cantù and he just won't listen. Sadly.

A free product such as FreePascal is faster than Delphi. Think about this for a moment. This is the current reality.

Results were posted by +Eric Grange sometime ago on his website.
ReplyDelete
Replies
Asbjørn HeidJune 21, 2015 at 11:22 AM
I think it's pretty clear that the vast majority of the Delphi customer base doesn't care much about raw performance as long as it's "good enough".

On the flip side, FPC seems to lack some big language features...
ReplyDelete
Replies
José RamírezJune 21, 2015 at 11:28 AM
Asbjørn Heid This has to be the most common optimization tricks out there. I don't know why Delphi doesn't do it.
ReplyDelete
Replies
Asbjørn HeidJune 21, 2015 at 11:44 AM
José Ramírez I read alot about x86 optimization tricks back in the 386 days, and I can't remember seeing this one. So I'm not so sure about "most common". But I guess these days it should be well known for the compiler guys.

However it should be fairly clear the Delphi compiler hasn't received much tender love and care in the optimization department for the last 10 years, if not more.
ReplyDelete
Replies
A. BouchezJune 21, 2015 at 11:59 AM
Asbjørn Heid Did you check http://www.getlazarus.org/language ? The FPC 3 branch did give a lot of nice language features.
There are still some missing features, e.g. anonymous methods or advanced RTTI, but they have dedicated branches in the FPC repository, and their implementation path is IMHO superior than Delphi's.
ReplyDelete
Replies
José RamírezJune 21, 2015 at 12:02 PM
A. Bouchez AFAIK they are lacking a couple of features. Delayed loading, Anonymous methods, some generic stuff and that's about it. Its getting there though.

However! Their 32 bit compiler can generate SSE2-4.1 code which Delphi cannot! And it can statically link GCC code and probably CLANG as well. Bit packed records..

I've personally statically linked OpenSSL with GCC & FPC before for Indy.

Overall it's looking pretty good.
ReplyDelete
Replies
Asbjørn HeidJune 21, 2015 at 12:22 PM
A. Bouchez Yea, for me the most glaring is the lack of anonymous methods and "reference to function". I use it quite a lot these days.

Once FPC gets that I'll take a good look at it.
ReplyDelete
Replies
Stefan GlienkeJune 21, 2015 at 12:31 PM
Pfff, better compiler just because it can do some optimizations the Delphi one cannot but lacking some language features that the Delphi compiler has since more than half a decade? Yeah sure...
ReplyDelete
Replies
José RamírezJune 21, 2015 at 12:35 PM
Stefan Glienke Faster compiler vs "some" features. I'll take the faster compiler and so would anyone with half a brain. If speed doesn't matter I may just start writing C#.
ReplyDelete
Replies
Stefan GlienkeJune 21, 2015 at 12:37 PM
José Ramírez Yeah, I bet that some optimized divisions are making your applications ten times faster...
ReplyDelete
Replies
José RamírezJune 21, 2015 at 12:38 PM
Stefan Glienke Its not just about divisions its other stuff as well. Like floating point performance etc. You wouldn't care anyway you deal mostly with interfaces. :)

Its old but anyway I doubt much changed.
https://www.delphitools.info/2014/05/08/delphi-xe6-32bits-and-scimark/
ReplyDelete
Replies
Alexander BenikowskiJune 21, 2015 at 12:44 PM
uhh getlazarus.org? thought lazarus-ide.org is their mainpage?
ReplyDelete
Replies
Asbjørn HeidJune 21, 2015 at 12:46 PM
José Ramírez If they got rid of the excessive stack juggling the Delphi compiler likes to do, it would matter a LOT more than the integer division optimization.

But the point still stands. People who care about performance go elsewhere these days. Even Java can be significantly faster than Delphi.
ReplyDelete
Replies
Lars FosdalJune 21, 2015 at 1:14 PM
I don't have too many complaints on the code - as I deal with databases - which are slower than molasses compared to x86 code.

Personally, I would like to see a proper overhaul of the IDE to fix the lockups, leaks, and debugger crashes - taking priority over compiler work (beyond obvious fixes).
ReplyDelete
Replies
David HeffernanJune 21, 2015 at 2:04 PM
Everyone has their pet desires. I'd like a compiler that produced much better code because that is of critical importance to me. It shouldn't be one or the other. Why can't we have a compiler that produces efficient code, a language with rich features, a stable IDE, and a high quality RTL?
ReplyDelete
Replies
Bill MeyerJune 21, 2015 at 2:12 PM
I am forced to believe that there is a substantial disconnect between the corporate vision and the everyday needs which concern those of us who depend on the tools for our livelihood. And "trust us" doesn't get it done.
ReplyDelete
Replies
José RamírezJune 21, 2015 at 2:14 PM
Bill Meyer Without radical change in corporate management nothing will change.
ReplyDelete
Replies
Lars FosdalJune 21, 2015 at 2:19 PM
One thing is certain - nothing said in this community has the slightest effect on the EMBT corporate vision (or lack there of).

We can perhaps sway the priorities of the Delphi team - but that is about all.
ReplyDelete
Replies
Bill MeyerJune 21, 2015 at 2:27 PM
It should be certain, as we were certainly lectured on the irrelevance of our opinions many times in the newsgroups. I can't recall any of us ever being right....
ReplyDelete
Replies
Lars FosdalJune 21, 2015 at 2:35 PM
Which is why I prefer this community to focus on actual coding. There is enough poisonous and futile debate elsewhere.
ReplyDelete
Replies
Bill MeyerJune 21, 2015 at 3:17 PM
Lars Fosdal One of the great things in this community is that it continues to deliver much value.
ReplyDelete
Replies
Bill MeyerJune 22, 2015 at 4:31 AM
Lars Fosdal Another great thing is that when legitimate rants appear, they almost always offer valid criticism, and once the issue has been raised and received a round of comments, it does down. So different to the experience in certain newsgroups. ;)
ReplyDelete
Replies
José RamírezJune 22, 2015 at 5:25 AM
Is Delphi 64 bit compiler using LLVM?
ReplyDelete
Replies
Asbjørn HeidJune 22, 2015 at 5:34 AM
José Ramírez Not the desktop one no.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 6:54 AM
LLVM only used on the mobile compilers. dcc32, dcc64, dccosx all use the legacy compiler
ReplyDelete
Replies
José RamírezJune 22, 2015 at 6:58 AM
David Heffernan WOW even OSX uses legacy. They use different compilers for the same language. Must be a PITA to maintain all of them especially legacy stuff.
I bet the LLVM stuff they use is like 5 years old.
But then again not surprising at all from a company that forgets to fix TList before releasing it to the public.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 7:16 AM
José Ramírez The legacy compiler wasn't written by them either. They bought it in an modified it to their needs. I also think it's so fast because it was written (originally at least) outside the building! Very cynical!
ReplyDelete
Replies
José RamírezJune 22, 2015 at 7:22 AM
David Heffernan Now that's funny. So that explains the ASM code in the RTL/VCL. Fake it until you make it! :D
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 7:25 AM
Compiler != RTL/VCL
ReplyDelete
Replies
José RamírezJune 22, 2015 at 7:26 AM
Lars Fosdal Have you ever opened the RTL or VCL? Its riddled with Assembly and guess why because the compiler is slow! That's what I meant with fake it (speed with asm) until you make it (a better compiler).
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 7:31 AM
Please clarify, José Ramírez - Are you saying the RTL assembly code is slow, the compiler is slow, or the compiler generated code is slow?

I wonder if anyone ever made an overview over what ASM blocks where added or modified in which Delphi version.
ReplyDelete
Replies
José RamírezJune 22, 2015 at 7:34 AM
Lars Fosdal RTL code has tons of assembly instead of real code to compensate for lost speed with poor compiler. I am sure you understand. Add tons of IFDEFs + other nonsense and you get lost quickly.

And since IDE is compiled with Delphi compiler it would fix a lot of your problems. Hence a better compiler is needed.

If they only added /LARGEADDRESSAWARE to allow 4GB allocation in the IDE it would make a lot of difference!

I guess sending mass promotion emails is more important for the company.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 7:52 AM
There are currently approx 720 asm blocks in the RTL/VCL spread over 48 units out of a total of 2050+ units (per XP7.1 Enterprise). Without having checked - I assume several of the asm blocks are duplicated for win 32 vs win64.

unit:
2943 occurrences in 2059 files
function:
222172 occurrences in 1870 files
procedure:
138664 occurrences in 1719 files
asm:
778 occurrences in 48 files

This hardly qualifies as "tons."

As for getting lost - the debugger puts you right on the spot - and if you can't read assembly, you either study up - or press shift-F8.

Can the compiler do better? Not this one, I guess - and moving to a new one requires more than a bit of work and regression testing up the wazzoo. I'd like to see a LLVM compiler for the desktop too - but it is likely to mean yet another round of new and exciting bugs - and come at the cost of not fixing the existing problems with the IDE.

With limited resource, you can do limited work. Even if the devs probably don't handwrite the mass promotion emails.
ReplyDelete
Replies
José RamírezJune 22, 2015 at 8:07 AM
Lars Fosdal You may think that's not a lot but without those 700 or so asm blocks the application would be dead slow. Its not how often they are used but where they are used. And by fixing the compiler you fix the IDE partially as well since it used to compile the IDE itself.

Well you can't do much about ErrorInsight since that's written in .NET I presume C#.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 8:30 AM
C++Builder
asm:
Found 288 occurrences in 47 files
__emit:
Found 58 occurrences in 13 files

out of a total of 2256 files (*.cpp *.c *.h *.inc)
ReplyDelete
Replies
José RamírezJune 22, 2015 at 8:32 AM
Lars Fosdal Most of those are in string constants ;)
In System.hpp there is 0 ASM code.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 8:35 AM
Also - none of the asm or __emit references are in string constants, but are embedded asm in the C code.

Oh... and a slight oversight:
There are 191 .asm files in the C++Builder source/cpprtl directories
There is 1 (one) in the VCL source directory.

So - we have established the fact that the current Delphi compiler for Win/OSX is old, and that the RTL needs ASM to keep it's speed up.
Now, deal with it.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 8:50 AM
It's certainly the case that the legacy Delphi compiler has never produced especially efficient code. Well, perhaps it did 20 years ago, but it hasn't moved on.

As somebody who does primarily floating point work, the poor quality of the code that the compiler emits (both dcc32 and dcc64) is frustrating. Code like mine, heavily based on floating point, would be significantly faster if it had codegen of similar quality to any of the well known C++ compilers.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 9:13 AM
David Heffernan - I don't think you'll find anyone that contests the need for a compiler overhaul.

Some of the math support in CPP looks to be hard coded assembly. In theory, some of that could be ported to Delphi as well - but, it wouldn't solve the stack handling issues in the current Delphi compiler.

The Delphi language does give you a lot of headroom for tweaking routines in assembly, though - which should give us the ability to work on optimized floating point libraries for the community?
ReplyDelete
Replies
José RamírezJune 22, 2015 at 9:19 AM
It would be cheaper to move the entire project over to C++ than to tinker around with assembly. Its not exactly trivial to write an optimized assembly routine. And how exactly will you target SSE-AVX2 hardware with your custom assembly? This stuff is for the compiler to handle.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 9:23 AM
+Lars It's not enough to write your own asm routines. You need help from the compiler.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 9:26 AM
IMO, it's trivial to write an optimized assembly routine, compared to moving to a new compiler and ensuring that it is backwards compatible with billions of lines of Delphi code (with it's interspersed asm).

That is a pretty big task. It was easier for the new platforms - as you had no backwards compatibility to care for. I'm not saying it can't be done - but - it is a lot of work - and we won't know if the benefits will be tangible until it has been done. What if it turns out that it is the RTL/VCL that has most of the bottlenecks?
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 9:29 AM
David Heffernan - For generic cross platform FPU code, yes.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 9:37 AM
Lars Fosdal For specific single target code too. If the compiler is emitting slow x87 instructions instead of fast SSE instructions, or doing lame stack juggling, for built in operators, then there's not much we can do. Nobody is going to want to convert their entire code base to asm.

For instance, I use a third party sparse solver. The original C code executes much faster than my Pascal port, which is a literal port. The bulk of the code is loops and expressions using built in arithmetic operators. No amount of hand crafted asm libs is going to help here. Nothing short of a total asm port, or linking to code emitted by a different compiler, will help.

As you might imagine, I link to the C version. But for my own code, that is subject to development, that's not a viable option.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 11:08 AM
David Heffernan - Good points.
ReplyDelete
Replies
Vitali BurkovJune 22, 2015 at 12:26 PM
Who cares about x86 nowdays? Delphi generated x64 math code is roughly 50% faster than x86-version with optimization.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 12:44 PM
Vitali Burkov - You'd be surprised of how incredibly slow corporate IT can be rolling out new OS versions. I know of companies that roll out 32-bit images on machines that comes from the vendor with 8+Gb RAM - simply because the 32-bit Windows is corporate standard.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 1:11 PM
Vitali Burkov Not in my experience it is not. More like 10-20%. If you use 32 bit third party libraries then you may be stuck with 32 bit.

I'd say >99% of my clients now use my x64 version. Of course, dcc64 produces much worse code than respectable C++ compilers. Typical really.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 1:12 PM
Lars Fosdal For companies that do serious floating point number crunching I think x64 is prevalent. People running business apps are much more likely to cling to 32 bit systems.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 1:52 PM
Define serious fp number crunching?
ReplyDelete
Replies
A. BouchezJune 22, 2015 at 2:10 PM
Fpc produced nice x64 code, and much smaller executable than dcc64, from my findings.
ReplyDelete
Replies
José RamírezJune 22, 2015 at 2:36 PM
Lars Fosdal Algorithmic Trading down in central London?
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 2:43 PM
Lars Fosdal Do I need to?
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 2:47 PM
Most algorithmic trading models are pretty simple. Their biggest advantage was to be faster to react than your average trader. Now they need to be faster than your competing algorithms. But - it's not really number crunching with extreme performance requirements. High frequency traders are not highly regarded by their peers, it seems. Then again, traders have no sense of value - just sense of margin.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 2:57 PM
David Heffernan - Well, there isn't that much shelfware software out there which really is computationally heavy. It usually falls into math and statistics (Mathematica, Wolfram, SAS Institute, etc), business intelligence (Oracle, Qlikview), and to a certain degree - audio and video processing. On the scientific side, you usually add more horsepower on the hardware side (parallelism, CUDA et.al.). Other examples would be welcome.
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 3:15 PM
Oh, forgot modelling and analysis tools. I know a guy that writes programs in Delphi to feed Abaqus data for FEM analysis.
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 8:57 PM
+Lars I write FEM code in Delphi
ReplyDelete
Replies
Lars FosdalJune 22, 2015 at 11:06 PM
That explains your need, for sure. Is CUDA an option?
ReplyDelete
Replies
David HeffernanJune 22, 2015 at 11:35 PM
+Lars Not really. As I see it, GPU programming is good for small kernel, high parallelisation tasks. We have large kernel. Probably a re-write from scratch might benefit from GPU. Perhaps using a bespoke GPU sparse solver would help. None of that very tractable for us.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 2:51 AM
David Heffernan FEM in Delphi, impressive. Just curious, have you looked at the FEniCS project, in particular DOLFIN?
ReplyDelete
Replies
David HeffernanJune 23, 2015 at 3:00 AM
Asbjørn Heid No. Sorry.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 3:10 AM
David Heffernan They generate C++ code "just-in-time" style based on a higher-level description of the equations (in Python currently). The C++ code is compiled as a .so/.dll and linked back in. Quite fancy. Since the code is generated they can exploit symmetries and constants in the input. And since it's C++ code, it's quite fast...

Anyway, as I said earlier, just getting rid of the stack juggling should result in a very significant performance boost in not just integer code but also floating-point code. As such this should be a quite high priority on the compiler side IMO.
ReplyDelete
Replies
José RamírezJune 23, 2015 at 4:04 AM
Asbjørn Heid Stack juggling is just the first step but I am afraid the two compilers are so much behind the rest of the competition out there that a radical move maybe needed.

+David Heffernan Have you checked out Dew Research Math library for Delphi? I think they have fast math algorithms that use SSE etc.
ReplyDelete
Replies
David HeffernanJune 23, 2015 at 4:15 AM
José Ramírez My linear algebra libraries are better than theirs, at least for my needs.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 4:18 AM
José Ramírez I doubt the Delphi compiler will ever match C++ in speed for non-trivial code, mostly due to language restrictions. As such it's pointless goal I think.
ReplyDelete
Replies
José RamírezJune 23, 2015 at 4:19 AM
Asbjørn Heid Hence why its pointless to apply tiny patches to the compiler and just replace it.
ReplyDelete
Replies
David HeffernanJune 23, 2015 at 4:27 AM
Asbjørn Heid The only real restriction affecting perf is that classes have to be heap allocated. Otherwise there's no reason that a good Delphi compiler could not be as fast as a good C++ compiler. And as for classes being on the heap, stack allocated classes in C++ for non-POD types are a minefield anyway, so it's hardly a limitation. If you want stack allocated structures, use records.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 4:38 AM
José Ramírez David Heffernan C++ has const and move schemantics, stack-based objects with copy/move constructors and destructors, template metaprogramming, constexpr...

A Delphi compiler doesn't stand a chance due to the Delphi language in its current form. Sure you can do "asm in Delphi"-like programming and probably get fairly close, but that defeats the point of using a higher-level language.
ReplyDelete
Replies
José RamírezJune 23, 2015 at 4:45 AM
Asbjørn Heid This is exactly why we are porting our project to C++. But I have a hard time deciding between wxwidgets or QT.
ReplyDelete
Replies
David HeffernanJune 23, 2015 at 4:52 AM
Asbjørn Heid I don't see much of that presenting serious hurdles to perf. I think a better compiler would get >90% of the way there.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 5:30 AM
José Ramírez The cross-platform Open Source project I worked on moved from WxWidgets to Qt. The MOC stuff and such is a bit of a pain, but other than that I vastly prefer Qt.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 5:31 AM
David Heffernan I can agree on 70% ;)
ReplyDelete
Replies
A. BouchezJune 23, 2015 at 5:36 AM
Lars Fosdal About RTL bottlenecks and LLVM, some of our users, developing for Android and iOS, found out that memory allocation was very slow under those devices. Sounds like if default implementation (using POSIX __malloc/_free) may be the culprit...
Did anyone experiment about this?
ReplyDelete
Replies
José RamírezJune 23, 2015 at 5:41 AM
A. Bouchez Sounds like a poor Memory Manager, as expected. FastMM4 doesn't exist for mobile.
ReplyDelete
Replies
David HeffernanJune 23, 2015 at 6:18 AM
+José FastMM is a poor performing MM.
ReplyDelete
Replies
Lars FosdalJune 23, 2015 at 6:21 AM
Particularly for multithreaded apps.
ReplyDelete
Replies
Asbjørn HeidJune 23, 2015 at 8:18 AM
While FastMM may not be the greatest memory allocator, a core issue is IMO that the Delphi language by it's nature leads to an excessive amount of memory allocations.
ReplyDelete
Replies
A. BouchezJune 23, 2015 at 9:59 AM
Lars Fosdal For desktop apps, FastMM4 is just great, and for multi-thread apps, if you do not abuse of memory allocation, FastMM4 has a great performance. It is much lighter and efficient that the Windows API local heap, or C runtime malloc, for instance.
Asbjørn Heid You are right. But it depends on how you write your code. In practice, FastMM4 does its best to avoid reallocation most of the time (e.g. a string concatenation using FastMM4 is faster than TStringBuilder AFAIR). Some new features (like anonymous methods) do silently allocate memory...
In all case, you should always know a little what is under the hood. ;)
IMHO we should not blame FastMM4: this is one of the best part of the current Delphi RTL - and has been contributed by Pierre Le Riche, so was not done on Embarcadero side. I wish the latest version of the RTL was following its lead!
ReplyDelete
Replies
Lars FosdalJune 23, 2015 at 10:45 AM
A. Bouchez - Is this info out of date?
http://stackoverflow.com/questions/6072269/need-multi-threading-memory-manager

Our server app (a service) has 30-50 threads, depending on the number of clients - and more than 60% of the threads do concurrent memory allocations - albeit rarely at a high rate. There may be 5 to 10 worker threads which can peak high rate on allocations, network and db traffic at the same time, during closing of large deliveries.

The best part of FastMM4 is that it has rarely, if ever, failed me.
ReplyDelete
Replies
A. BouchezJune 23, 2015 at 11:30 AM
Lars Fosdal SAPMM did appear since, and sounded nice to me - may help you in your case: http://blog.synopse.info/post/2013/12/05/New-Open-Source-Multi-Thread-ready-Memory-Manager%3A-SAPMM
Updated project repository is https://github.com/alan008/sapmm
For typical mORMot servers, memory allocation is pretty low, so we found out that FastMM4 was good enough in practice, even on production servers.
The FPC memory manager is very robust, and multi-thread friendly, but it is slower than FastMM4 in the context of single threaded work.
ReplyDelete
Replies
David HeffernanJune 23, 2015 at 2:19 PM
I don't agree with A. Bouchez here. My experience is that FastMM is slower than almost any other memory manager under contention. Slower than the malloc from msvcrt in my tests. Best is to avoid heap allocation where possible. Emba RTL makes that hard often.
ReplyDelete
Replies
A. BouchezJune 23, 2015 at 10:42 PM
David Heffernan this is what I wrote: Fastmm4 is great, unless you have contention.
And AFAIR neversleeponthreadcontention flag helps.
ReplyDelete
Replies
David HeffernanJune 24, 2015 at 12:18 AM
+A. Bouchez That option makes things worse in my experience. You get higher CPU utilisation but slower runtime.
ReplyDelete
Replies
Eric GrangeJune 24, 2015 at 1:06 AM
David Heffernan Emba RTL is slow outside some particular areas (fastcode functions and some oldies), you have to avoid it in the first place when you are after performance, contention or no contention.
ReplyDelete
Replies
A. BouchezJune 25, 2015 at 1:52 AM
Eric Grange Sadly, this is what I wrote some years ago - http://blog.synopse.info/post/2011/05/20/How-to-write-fast-multi-thread-Delphi-applications
ReplyDelete
Replies
Eric GrangeJune 25, 2015 at 2:04 AM
A. Bouchez as an update to your article, you could use the full Knuth quote

https://en.wikipedia.org/wiki/Program_optimization#When_to_optimize

and the subsequent one

"In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering"
ReplyDelete
Replies
Johan BontesJune 26, 2015 at 3:20 PM
In the test that I've done Lazarus is nowhere near as fast as Delphi. In some cases yeah, but generally I'm not impressed.

Eric Grange
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
ReplyDelete
Replies
Eric GrangeJune 26, 2015 at 11:49 PM
Johan Bontes I disagree, most programmers do not care about performance at all. And by most, I mean an overwhelming majority.
Usually they do not even care about data volumetry.
ReplyDelete
Replies
A. BouchezJune 27, 2015 at 3:16 AM
Eric Grange This sounds like the hard truth. I've seen this in so many companies... programmers do not prepare about performance (since it is a hard task, especially when your C# or Java environment do abstract everything), and managers don't focus on it (mostly due to the cost). Until the clients start to arrive in numbers, then the whole system expectations appear like underestimated, and the programmers are asking to do wonders of performance... on a code base which was never designed nor prepared for it.
Typical issue is abusing of the DB (putting everything in the RDBMS), which becomes a true bottleneck. Or never looked at the execution path of a complex SQL query auto-magically generated by LINQ or the ORM. Never used the step by step debugger (even within the core libraries) to see what is involved in a single line of code. Looks like "but it works on my dev workstation!" syndrome.
I've seen companies disappear due to such bad design. Or others spending huge amount of money to rewrote the initial quick & dirty project. And even some companies leaving Delphi for C# or Java, because the technical managers did believe in the wonders of performance told by the marketers, and putting brand new coders, just out from scholl, with still milk in the nose, to make the "future flag product of our company". With the success you imagine.
ReplyDelete
Replies
Johan BontesJune 27, 2015 at 8:35 AM
A. Bouchez
Speaking of performance, I managed to outperform the Fastcrc32c hash :-) Using handcoded assembly of Murmurhash3. (Well on my machine at least).
ReplyDelete
Replies
José RamírezJune 27, 2015 at 11:39 AM
Johan Bontes Will you share your implementation? Also the non-asm code? :)
ReplyDelete
Replies
A. BouchezJune 27, 2015 at 1:30 PM
Johan Bontes AFAIR MurmurHash3 has more collisions than crc32c. This is to be taken in account. Would you post your numbers on http://synopse.info/forum/profile.php?id=1374 ?Did you test crc32c with hardware acceleration (i.e. crc32c() function in SynCommons, not crc32fast)? I hope your Murmurhash3 implementation is more correct than the benchmark code you posted on your forum, which was buggy.
ReplyDelete
Replies
Johan BontesJune 28, 2015 at 9:48 AM
José Ramírez
Here's the code:
{$pointermath on}
function PascalMurmurHash3(const [ref] HashData; Len, Seed: Integer): Integer;
const
c1 = $cc9e2d51;
c2 = $1b873593;
r1 = 15;
r2 = 13;
m = 5;
n = $e6546b64;
f1 = $85ebca6b;
f2 = $c2b2ae35;
var
i: NativeInt;
Len1, Len2: integer;
k: integer;
data: PCardinal;
label case1, case2, case3, final;
type
ByteArray = array[0..0] of byte;
begin
Result:= seed;
data:=@HashData;
Len1:= (Len shr 2)-1;
for i:= 0 to Len1 do begin
    k:= data[i];
    k:= k * integer(c1);
    k:= (k shl r1) or (k shr (32-r1));
    k:= k * c2;
    Result:= Result xor k;
    Result:= (Result shl r2) or (Result shr (32-r2));
    Result:= Result * m + integer(n);
end; {for i}
k:= 0;
Len2:= Len;
case Len and $3 of
    1: goto case1;
    2: goto case2;
    3: goto case3;
    else goto final;
end;
case3:
Dec(Len2);
Inc(k, PByte(Data)[Len2] shl 16);
case2:
Dec(Len2);
Inc(k, PByte(Data)[Len2] shl 8);
case1:
Dec(Len2);
Inc(k, PByte(Data)[Len2]);
k:= k * integer(c1);
k:= (k shl r1) or (k shr (32 - r1));
k:= k * c2;
Result:= Result xor k;
final:
Result:= Result xor Len;

Result:= Result xor (Result shr 16);
Result:= Result * integer(f1);
Result:= Result xor (Result shr 13);
Result:= Result * integer(f2);
Result:= Result xor (Result shr 16);
end;

function MurmurHash3(const [ref] HashData; Len: integer; Seed: integer = 0): integer;
{$ifdef purepascal}
inline;
begin
Result:= PascalMurmurHash3(HashData, Len, Seed);
end;
{$else}
const
c1 = $CC9E2D51;
c2 = $1B873593;
r1 = 15;
r2 = 13;
m = 5;
n = $E6546B64;
f1 = $85EBCA6B;
f2 = $C2B2AE35;
{$REGION 'asm'}
{$IFDEF CPUx86}
asm
    push EBX
    push EDI
    push ESI
    xchg ECX,EDX
    //EAX = data
    //ECX = count in bytes
    //EDX = seed
    add EAX, ECX
    neg ECX
    add ECX,4
    jg @remaining_bytes
@loop:
    mov EDI,[EAX+ECX-4]
    imul EDI,EDI,c1
    rol EDI,r1
    imul EDI,EDI,c2
    xor EDX,EDI
    rol EDX,r2
    lea EDX,[EDX*4+EDX+n]
    add ECX,4
    jle @loop
@remaining_bytes:
    cmp ECX,4
    jz @finalization
    movzx EBX,byte ptr [EAX+ECX-4]
    cmp ECX,3//inc RCX
    jz @process_remaining
    ror EBX,8//ror R9d,24
    mov BL,byte ptr [EAX+ECX-3]
    cmp ECX,2//inc RCX
    jz @shift_back
    ror EBX,8//ror R9d,24
    mov bl,byte ptr [EAX+ECX-2]
    rol EBX,8
@shift_back:
    rol EBX,8
@process_remaining:
    imul EBX,EBX,c1
    rol EBX,r1
    imul EBX,EBX,c2
    xor EDX,EBX
@finalization:
    xor EDX,ESI
    mov EAX,EDX
    shr EDX,16
    xor EDX,EAX
    imul EDX,EDX,f1
    mov EAX,EDX
    shr EDX,13
    xor EDX,EAX
    imul EDX,EDX,f2
    mov EAX,EDX
    shr EDX,16
    xor EAX,EDX
    pop ESI
    pop EDI
    pop EBX
end;
{$ENDIF}
{$IFDEF CPUx64}
asm
    .NOFRAME
    xchg R10,RDI
    xchg R11,RSI
    mov RAX,RCX
    mov RCX,RDX
    mov RDX,R8
    //RAX = data
    //RCX = count in bytes
    //RDX = seed
    mov ESI,ECX
    //make index negative based.
    //and ECX,not(3)
    neg RCX
    sub RAX, RCX
    add RCX, 4
    jg @remaining_bytes
@loop:
    mov EDI,[RAX+RCX-4]
    imul EDI,EDI,c1
    rol EDI,r1
ReplyDelete
Replies
Johan BontesJune 28, 2015 at 9:48 AM
imul EDI,EDI,c2
    xor EDX,EDI
    rol EDX,r2
    lea EDX,[EDX*4+EDX+n] // *5 + n
    //lea RAX,qword ptr [RAX+4]
    add RCX,4
    jle @loop
@remaining_bytes:
    cmp RCX,4
    //mov ECX,ESI
    //and ESI,$3
    jz @finalization
    movzx r9,byte ptr [RAX+RCX-4]
    cmp RCX,3//inc RCX
    jz @process_remaining
    ror R9,8//ror R9d,24
    mov R9b,byte ptr [RAX+RCX-3]
    cmp RCX,2//inc RCX
    jz @shift_back
    ror R9,8//ror R9d,24
    mov R9b,byte ptr [RAX+RCX-2]
    rol R9,8
@shift_back:
    rol R9,8
@process_remaining:
    imul R9d,R9d,c1
    rol R9d,r1
    imul R9d,R9d,c2
    xor EDX,R9d
@finalization:
    xor EDX,ESI
    mov EAX,EDX
    shr EDX,16
    xor EDX,EAX
    imul EDX,EDX,f1
    mov EAX,EDX
    shr EDX,13
    xor EDX,EAX
    imul EDX,EDX,f2
    mov EAX,EDX
    shr EDX,16
    xor EAX,EDX
    xchg R10,RDI
    xchg R11,RSI
end;
{$ENDIF}
{$ENDREGION}
{$ENDIF}
ReplyDelete
Replies
Johan BontesJune 28, 2015 at 9:51 AM
A. Bouchez

Collisions are the same as crc32, see: http://www.strchr.com/hash_functions (the number in square brackets are the collisions).

My lowly laptop/workstations do not have crc32c built in, so I was only able to test against the FastCRC32c code.
The results where that for hashings of small data (e.g. string[10]) Murmurhash was significantly faster and for large data (e.g. string[1000]) slightly slower.
See the code in the post above.

The lookup table needs a while to warm up in the cache.
Murmurhash3 only has the loop, so it will always be faster for intermittent hashing.

Given the fact that hashing is a core ingredient in TDictionary and small keys there are very common, I think a fast hash for small data is a big win.
Ideally the hash should be selectable based on the size of the key.

Numbers
I'm working on a performance-test suite for my Fastcode project, there's a (basic) speed test for murmurhash3 at: https://github.com/JBontes/FastCode/tree/master/Test
ReplyDelete
Replies
A. BouchezJune 28, 2015 at 10:49 AM
Johan Bontes Your crc32c() test is still broken in github. You are using Length(HashTest) as length instead of Length(HashTest[i]).
ReplyDelete
Replies
Johan BontesJune 28, 2015 at 11:46 AM
A. Bouchez
Oh, I'll get right on it.
With the correct length, the timings for crc32c and murmur are very close, both on small and large sets.
committed the correct code to github.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Time to focus on the compiler purpose, right?

Comments

Post a Comment