Using SSE 4.2 Text Instructions with Delphi.

- June 30, 2015

Using SSE 4.2 Text Instructions with Delphi.
http://blog.synopse.info/post/2015/06/30/Faster-String-process-using-SSE-4.2-Text-Processing-Instructions-STTNI

Comments

José RamírezJune 30, 2015 at 10:26 AM
Thanks! You can unroll it on non SSE 4.2 architectures. By the way this is also possible for UTF16. Infact it may be even faster for UTF16.

utf16slen_sse4_2a from Intel.
ReplyDelete
Replies
Horácio FilhoJune 30, 2015 at 10:31 AM
IMHO, inject assembly code in Delphi programs (outside RTL) is not a good path for the language, I would rather invest in parallel and concurrent code, it also can bring amazing results. It is just an opinion :D
ReplyDelete
Replies
A. BouchezJune 30, 2015 at 10:32 AM
José Ramírez Unrolling is not ideal anymore. It wasa good habit on Pentium 4 old times, but new CPUs have a much better branch prediction. For instance, a rolled AES is now faster than an unrolled AES, from my experiment. See http://www.agner.org/optimize/
ReplyDelete
Replies
José RamírezJune 30, 2015 at 10:35 AM
Horácio Filho We need a better compiler and then this ASM stuff wouldn't be necessary in majority of cases.

A. Bouchez Here there are UTF16 functions aswell.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
ReplyDelete
Replies
A. BouchezJune 30, 2015 at 10:35 AM
Horácio Filho You can't do text parsing in parallel (unless the input is.... parallel, like maybe English text or a CSV file - but definitively not JSON or XML). All our optimized asm functions have their "pure pascal" tuned version. In fact StrLen and StrComp have pure pascal, x86, SSE2 and SSE4.2 versions, which would be selected when the program starts.
The idea is to stay away from the RTL, which introduces CPU locks, unexpected memory allocations, hidden try..finally blocks, lack of inlined functions, and over sub-optimized patterns.
ReplyDelete
Replies
Andreas HausladenJune 30, 2015 at 11:04 AM
Your SSE4 version isn't a 100% replacement for StrComp/StrLen because it makes an assumption about the memory allocation of the strings.

This code crashes with an access violation because StrCompSSE42 accesses a memory page that isn't committed.
var
I: PtrInt;
S1, S2: PAnsiChar;
begin
S1 := VirtualAlloc(nil, 4096 * 2, MEM_RESERVE, PAGE_READWRITE);
S1 := VirtualAlloc(S1, 4096, MEM_COMMIT, PAGE_READWRITE);
S1 := S1 + 4096 - 4;
StrCopy(S1, 'HAL');
S2 := 'HAL';
// StrLenSSE42(S1);
StrCompSSE42(S1, S2);
end;
ReplyDelete
Replies
José RamírezJune 30, 2015 at 11:13 AM
What about Wide versions? :O
ReplyDelete
Replies
Dirk CarstensenJune 30, 2015 at 11:39 AM
Nice! Is it possible to replace RTL functions with $DOPATCHTRTL? So just add SynCommons to project uses clause and win performance? Best regards Dirk
ReplyDelete
Replies
José RamírezJune 30, 2015 at 11:40 AM
Dirk Carstensen Technically yes.
ReplyDelete
Replies
Asbjørn HeidJune 30, 2015 at 2:22 PM
Andreas Hausladen Good point. NVIDIAs OpenGl driver made similar assumptions, causing my Delphi code to crash. Very annoying. The optimized version should have a check + fallback.
ReplyDelete
Replies
A. BouchezJune 30, 2015 at 2:57 PM
Indeed. This is documented as such: SSE2 and SSE4.2 versions of StrLen() and StrComp() may read a few bytes after the incoming buffer, so are not OK, e.g. with mapped files. This is the reason why there is a StrLenPas() and a StrCompFast() functions, which are safe
ReplyDelete
Replies
David HeffernanJuly 1, 2015 at 1:28 AM
A. Bouchez In what circumstances could it be reasonable to read beyond the end of the buffer?
ReplyDelete
Replies
José RamírezJuly 1, 2015 at 5:13 AM
A. Bouchez Does the data have to be aligned?
ReplyDelete
Replies
A. BouchezJuly 1, 2015 at 9:20 AM
David Heffernan The PcmpIstrI xmm0,dqword [edx+eax] opcode reads 16 bytes of memory and compare them to the 16 bytes of xmm0, in a single opcode, and a few cycles. It is pretty reasonnable, since it is much faster.
You won't have any problem with memory allocated from the heap (e.g. with strings), from the stack, or from the write-only code section of the exe, since all should have extra information after the text.
Any potential issue is when reading from a mapped file, which is pretty uncommon. In this case, you have other functions for fallback.
José Ramírez The PcmpIstrI instruction does not require proper alignment, AFAIK from Intel's doc. This is one of the beauties of this instruction: it has been defined for XML parsing, so here the start position of the text is almost never aligned.
ReplyDelete
Replies
David HeffernanJuly 1, 2015 at 10:08 AM
+A. Bouchez I don't see why there should for sure be valid addresses beyond a heap block.
ReplyDelete
Replies
A. BouchezJuly 1, 2015 at 12:50 PM
David Heffernan afaik this is how fastmm4 works.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Using SSE 4.2 Text Instructions with Delphi.

Comments

Post a Comment