Using SSE 4.2 Text Instructions with Delphi.

Using SSE 4.2 Text Instructions with Delphi.
http://blog.synopse.info/post/2015/06/30/Faster-String-process-using-SSE-4.2-Text-Processing-Instructions-STTNI

Comments

  1. Thanks! You can unroll it on non SSE 4.2 architectures. By the way this is also possible for UTF16. Infact it may be even faster for UTF16.

    utf16slen_sse4_2a from Intel.

    ReplyDelete
  2. IMHO, inject assembly code in Delphi programs (outside RTL) is not a good path for the language, I would rather invest in parallel and concurrent code, it also can bring amazing results. It is just an opinion :D

    ReplyDelete
  3. José Ramírez Unrolling is not ideal anymore. It wasa good habit on Pentium 4 old times, but new CPUs have a much better branch prediction. For instance, a rolled AES is now faster than an unrolled AES, from my experiment. See http://www.agner.org/optimize/

    ReplyDelete
  4. Horácio Filho We need a better compiler and then this ASM stuff wouldn't be necessary in majority of cases.

    A. Bouchez Here there are UTF16 functions aswell.

    http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

    ReplyDelete
  5. Horácio Filho You can't do text parsing in parallel (unless the input is.... parallel, like maybe English text or a CSV file - but definitively not JSON or XML). All our optimized asm functions have their "pure pascal" tuned version. In fact StrLen and StrComp have pure pascal, x86, SSE2 and SSE4.2 versions, which would be selected when the program starts.
    The idea is to stay away from the RTL, which introduces CPU locks, unexpected memory allocations, hidden try..finally blocks, lack of inlined functions, and over sub-optimized patterns.

    ReplyDelete
  6. Your SSE4 version isn't a 100% replacement for StrComp/StrLen because it makes an assumption about the memory allocation of the strings.

    This code crashes with an access violation because StrCompSSE42 accesses a memory page that isn't committed.
      var
        I: PtrInt;
        S1, S2: PAnsiChar;
      begin
        S1 := VirtualAlloc(nil, 4096 * 2, MEM_RESERVE, PAGE_READWRITE);
        S1 := VirtualAlloc(S1, 4096, MEM_COMMIT, PAGE_READWRITE);
        S1 := S1 + 4096 - 4;
        StrCopy(S1, 'HAL');
        S2 := 'HAL';
    //    StrLenSSE42(S1);
        StrCompSSE42(S1, S2);
      end;

    ReplyDelete
  7. Nice! Is it possible to replace RTL functions  with $DOPATCHTRTL? So just add SynCommons to project uses clause and win performance? Best regards Dirk

    ReplyDelete
  8. Andreas Hausladen Good point. NVIDIAs OpenGl driver made similar assumptions, causing my Delphi code to crash. Very annoying. The optimized version should have a check + fallback.

    ReplyDelete
  9. Indeed. This is documented as such: SSE2 and SSE4.2 versions of StrLen() and StrComp() may read a few bytes after the incoming buffer, so are not OK, e.g. with mapped files. This is the reason why there is a StrLenPas() and a StrCompFast() functions, which are safe

    ReplyDelete
  10. A. Bouchez  In what circumstances could it be reasonable to read beyond the end of the buffer?

    ReplyDelete
  11. A. Bouchez Does the data have to be aligned?

    ReplyDelete
  12. David Heffernan The PcmpIstrI xmm0,dqword [edx+eax] opcode reads 16 bytes of memory and compare them to the 16 bytes of xmm0, in a single opcode, and a few cycles. It is pretty reasonnable, since it is much faster.
    You won't have any problem with memory allocated from the heap (e.g. with strings), from the stack, or from the write-only code section of the exe, since all should have extra information after the text.
    Any potential issue is when reading from a mapped file, which is pretty uncommon. In this case, you have other functions for fallback.
    José Ramírez The PcmpIstrI instruction does not require proper alignment, AFAIK from Intel's doc. This is one of the beauties of this instruction: it has been defined for XML parsing, so here the start position of the text is almost never aligned.

    ReplyDelete
  13. +A. Bouchez I don't see why there should for sure be valid addresses beyond a heap block.

    ReplyDelete
  14. David Heffernan afaik this is how fastmm4 works.

    ReplyDelete

Post a Comment