I hope a fix for this will be back-ported for many Delphi versions.
I hope a fix for this will be back-ported for many Delphi versions.
System.pas:
procedure YieldProcessor;
{$IF (defined(CPUX86) or defined(CPUX64)) and defined(ASSEMBLER)}
asm
PAUSE
end;
{$ELSE}
begin
end;
{$ENDIF}
Called from TMonitor.Spin, TThread.SpinWait and TInternalConditionVariable.LockQueue.
Duplicates in TMonitor.Spin TThread.SpinWait, TInternalConditionVariable.LockQueue and getmem.inc.
Originally shared by Kristian Köhntopp
The "Pause" instruction changed timing dramatically in Skylake. Spinlock implementation based on pause will need adjustments.
https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/
System.pas:
procedure YieldProcessor;
{$IF (defined(CPUX86) or defined(CPUX64)) and defined(ASSEMBLER)}
asm
PAUSE
end;
{$ELSE}
begin
end;
{$ENDIF}
Called from TMonitor.Spin, TThread.SpinWait and TInternalConditionVariable.LockQueue.
Duplicates in TMonitor.Spin TThread.SpinWait, TInternalConditionVariable.LockQueue and getmem.inc.
Originally shared by Kristian Köhntopp
The "Pause" instruction changed timing dramatically in Skylake. Spinlock implementation based on pause will need adjustments.
https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/
Why would they start back porting now? Nothing ever gets back ported.
ReplyDeleteBut before anybody gets too excited, I think somebody would need to show how this might actually affect some real world programs.
Fix? I cannot even find the bug report... has this even been reported?
ReplyDeleteI cannot see any reports, (both internal and external) I can report it internally, but then the issue will not be visible to everyone...
ReplyDeleteRoy Nelson What would you report? Has anybody even shown that there is an issue with RTL spin lock code? And why tie this to Intel. What's the PAUSE latency like on AMD processors? Clue, it's different again.
ReplyDeleteDavid Heffernan if the performance hit is as bad as guy claims then obviously we would need to test or at the very least have it logged to see if we can replicate what the chap says... and if we do see the same issue in the RTL to hopefully be able to fix it?
ReplyDeleteRoy Nelson Nobody can report anything yet because nobody has produced any evidence that there is anything wrong. That's my point. This thread is a little pointless. Just because somebody somewhere wrote a crappy spin lock doesn't mean that the Delphi RTL spinlocks are also crappy. Not that I'd have much confidence in Delphi RTL synchronisation code but that's another matter and we should give it the benefit of the doubt at least.
ReplyDelete/sub
ReplyDeleteReading the comments in SynObjs.pas since Delphi XE "This type is modeled after a similar type available in .NET 4.0 as System.Threading." and this being a problem in .NET 4, there is now http://qc.embarcadero.com/wc/qcmain.aspx?d=144063
ReplyDeleteJeroen Wiert Pluimers That's a very poor bug report because it is based entirely on speculation. It's also in the wrong place. It's meant to be Quality Portal.
ReplyDeleteI am disappointed.
David Heffernan feel free to amend it or show in another way you do better.
ReplyDeleteJeroen Wiert Pluimers The onus is on the submitter to submit good reports.
ReplyDeleteDavid Heffernan They did Backports of fixes from Tokyo to Berlin (and even seatle).
ReplyDeleteThey only thing the Delphi-RTL doesn't seem to suffer from is the multiply by processor issue. Most stuff looks quite similar to .Net (from a brief look). It has an exponential backoff and seems to trigger PAUSE a lot. If i had the CPU i'd love to test that right now.
Except AMD Bobcat(2011) 6 and on Jaguar(2013) 46, AMD doesn't seem to have an extra latency added to Pause. But no Data for Ryzen found. So Intel does seem like the odd here, currently.
agner.org - www.agner.org/optimize/instruction_tables.pdf
Alexander Benikowski AMD PAUSE can be 50 clocks I think
ReplyDeleteDavid Heffernan The Tables i linked say something in the line of 40+ Ops. But Steamroller from 2014 has only 8. So it varies. But no extra latency as in Intels PAUSE.
ReplyDeleteI guess FastMM4 is affected too, since the asm calls explicitly "pause" in its spinlocks. IMHO this may have a much bigger impact on multi-threaded apps on Skylake.
ReplyDeleteGood reports... bad reports... if they are reported in QC and have never been opened... they are all dead reports.
ReplyDeleteThe only 2 "pause" instructions in FastMM 4.991 are used when NeverSleepOnThreadContention is active. One in FastGetMem, another in FastFreeMem. As long as you keep this disabled, it should be OK.
ReplyDeleteI have added this as an internal report(RS-88454), copying Jeroen Wiert Pluimers' qc text... So we will keep an eye on this...
ReplyDeleteRoy Nelson thanks Roy!
ReplyDeleteWhy is it even possible to add new reports to QC?
ReplyDeleteAlexandre Machado In the cut-down version embedded in Delphi (not the full FastMM4) there is no such conditional and `pause` is always executed in the asm. So the problem may occur in 99% of Win32/Win64 Delphi programs (i.e. the ones not compiled with external FastMM4), for every and each multi-threaded memory allocation. So it is NOT OK at all. :(
ReplyDeleteA. Bouchez Where did you get the "99%" number from? I don't know a single Delphi application built with the built in memory manager. Unless we are talking about "Hello World" type of applications. So, in my case 100% of Delphi applications I know are safe.
ReplyDeleteAlexandre Machado So you are in the 1%. :) Of course, this was a guess, mostly probably wrong, but since there is (was) no benefit of using FastMM4 instead of the built-in memory manager which is a cut-down version of FastMM4. At least since Delphi 2006 when it was introduced IIRC. From all companies I worked for, or audited in, they use the built-in heap, and only used FastMM4 for full debug mode. The Delphi IDE itself doesn't use FastMM4, but only the cut-down internal version, which sadly uses pause. Of course, it is not heavily multi-threaded, so I guess it won't affect its speed. :)
ReplyDeleteI understand that a lot of people use the full version of FastMM4 because they want that extra debug functionality but I've never needed it personally and I would bet most companies use the built-in fastMM. Would love to see some statistics tho, are there any?
ReplyDeleteI'm with A. Bouchez. Most companies I know only use full FastMM for the debug stuff.
ReplyDelete