It all depends. How big your data is. What architectures you are targeting and so on. You need to supply more details to get a better answer I suspect. I also doubt that Delphi Developers group is the best place to get advice on floating point perf.
Don't forget that memory access is slow, and modern CPUs will do what they can to munch on instructions while waiting for memory. Unless you are actually streaming data, you'll likely see very litte, if any, benefit of using SIMD.
Okay, more details. Two functions are of interest, dot(const v1, v2: TVector3f): TFloat; cross(const v1, v2: TVector3f): TFloat; where TFloat might be either single or double. Platform: x64. Maybe someone has a pas file with XMM instuctions for those functions, or maybe someone has a library compiled with the Intel C++ compiler ready to share?
For now I just want to compare the two implementations, with and without SIMD performance-wise.
I want to use those functions in a Sphere against Triangle collision detection procedure.
It is performance critical code that will be called "million times", Asbjørn Heid The caching part is implemented already, the question is in the CD function.
Vitali Burkov It's been a few years since I wrote serious SSE code but from what I know I'd be surprised if you see any major benefit from using SSE unless you pass arrays for inputs.
As for FPU fallback... unless you have plans of porting to other platforms I wouldn't waste time on that. Decade old mainstream CPUs support SSE2, which is all you need.
I've never written SSE code in Delphi, only x86, x64 asm, Asbjørn Heid. Here is the triangle-sphere intersection code I've found http://realtimecollisiondetection.net/blog/?p=103. It looks like it can benefit from being implemented using SSE. I found this article about SSE dot and cross products implemented with SSE in a C++ compiler, I guess. http://tomjbward.co.uk/simd-optimized-dot-and-cross/ What do you think, is it possible to port it to Delphi?
Problem is that 3 is an odd number, so not terribly amenable to SSE. I suspect you'll need to do some learning rather than hoping to get some code from somewhere that just "works".
It's a problem in the sense that it's hard to get the full benefit. There's also the issue of alignment. You likely won't have fully aligned data. I think modern SSE units are more forgiving than in days past.
For a 3 vector scalar product I'm not sure there is a discernible gain. It will depend on many factors. I suggest you do some exploration and learning. Prediction is hard even when you have experience and knowledge, and futile without.
I'd also question whether or not Delphi is the best tool for exploration. You might do better with a C++ compiler with good SSE intrinsics.
If you really need highest performance for a specific case like intersections, then unless you have very few objects, the dot, cross and other vector stuff will probably take second place to higher level intersection detection.
Depending on how your scene is structured, there are many techniques that can apply: BSP, quad-trees, voxels, depth maps... which can make a whole lot more difference (orders of magnitude) than SIMD (2x faster, best case).
The fastest intersection test is the one you do not do :)
Also in the particular case of primitives collision detection, the implementation depends a lot on what you want out of the "detection", whether it's a simple boolean result, a point+vector, etc. Also if the relative size of the primitives is known this can sometimes drastically cut down on the math (f.i. huge sphere vs small triangle or small sphere vs huge triangle can be handled with simpler math than if both are of similar size)
Yeah, I agree, Eric Grange. The CD scheme affects the performance a lot more than a constant-factor improvement in the primitive test. But, as I've said, the caching part is implemented already and it is pretty good, only a few tirangles are checked per a query. Parallelization is implemented as well. I'm just hoping to get this 2x performance gain the C++ developers enjoy "for free".
Vitali Burkov I do it with asm, by adapting the code from Intel/AMD library to Delphi call conventions.
However Delphi does not support inlining asm routines, and does not support intrinsics either (which would allow inlining), also even with plain FPU code, when inlining the compiler will still juggle the register, and basically only eliminates the call/ret, not the stack juggling (actually, it can sometimes result in worse juggling), so the benefits from SIMD-optimized cross/dot primitives are less than what you have in C++.
So for really critical code, you need to code most of the routine's math in SIMD directly, not just the individual vector operators, or some of what you gained from SIMD, you may lose to stack juggling or call overhead.
Mobile platforms support too?
ReplyDeleteIt all depends. How big your data is. What architectures you are targeting and so on. You need to supply more details to get a better answer I suspect. I also doubt that Delphi Developers group is the best place to get advice on floating point perf.
ReplyDeleteDon't forget that memory access is slow, and modern CPUs will do what they can to munch on instructions while waiting for memory. Unless you are actually streaming data, you'll likely see very litte, if any, benefit of using SIMD.
ReplyDeleteOkay, more details. Two functions are of interest,
ReplyDeletedot(const v1, v2: TVector3f): TFloat;
cross(const v1, v2: TVector3f): TFloat;
where TFloat might be either single or double.
Platform: x64.
Maybe someone has a pas file with XMM instuctions for those functions, or maybe someone has a library compiled with the Intel C++ compiler ready to share?
For now I just want to compare the two implementations, with and without SIMD performance-wise.
I want to use those functions in a Sphere against Triangle collision detection procedure.
Vitali Burkov Is it one (or a few) spheres against a few million triangles? Or many vs many?
ReplyDeleteIt is performance critical code that will be called "million times", Asbjørn Heid
ReplyDeleteThe caching part is implemented already, the question is in the CD function.
Vitali Burkov It's been a few years since I wrote serious SSE code but from what I know I'd be surprised if you see any major benefit from using SSE unless you pass arrays for inputs.
ReplyDeleteAs for FPU fallback... unless you have plans of porting to other platforms I wouldn't waste time on that. Decade old mainstream CPUs support SSE2, which is all you need.
I've never written SSE code in Delphi, only x86, x64 asm, Asbjørn Heid.
ReplyDeleteHere is the triangle-sphere intersection code I've found http://realtimecollisiondetection.net/blog/?p=103.
It looks like it can benefit from being implemented using SSE.
I found this article about SSE dot and cross products implemented with SSE in a C++ compiler, I guess.
http://tomjbward.co.uk/simd-optimized-dot-and-cross/
What do you think, is it possible to port it to Delphi?
Problem is that 3 is an odd number, so not terribly amenable to SSE. I suspect you'll need to do some learning rather than hoping to get some code from somewhere that just "works".
ReplyDeleteIt's not a problem, David Heffernan I'll just set W=0.
ReplyDeleteIt's a problem in the sense that it's hard to get the full benefit. There's also the issue of alignment. You likely won't have fully aligned data. I think modern SSE units are more forgiving than in days past.
ReplyDeleteDo you use SSE yourself, David Heffernan?
ReplyDelete+Vitali For some things yes.
ReplyDeleteHow large is the performance gain? The best case, the worst case? Is it worth it?
ReplyDeleteFor a 3 vector scalar product I'm not sure there is a discernible gain. It will depend on many factors. I suggest you do some exploration and learning. Prediction is hard even when you have experience and knowledge, and futile without.
ReplyDeleteI'd also question whether or not Delphi is the best tool for exploration. You might do better with a C++ compiler with good SSE intrinsics.
if you try purepascal version for x64, don't forget specify {$EXCESSPRECISION OFF} for bigger speed - http://docwiki.embarcadero.com/RADStudio/Seattle/en/Floating_point_precision_control_%28Delphi_for_x64%29
ReplyDeletesee https://www.delphitools.info/2011/09/09/happy-excessprecision-off/
Okay, thank you David Heffernan. I've got to sleep ).
ReplyDeleteI noticed at least Delphi 10 has support for SSE4 instructions in the asm block, so that's something.
ReplyDeleteIf you really need highest performance for a specific case like intersections, then unless you have very few objects, the dot, cross and other vector stuff will probably take second place to higher level intersection detection.
ReplyDeleteDepending on how your scene is structured, there are many techniques that can apply: BSP, quad-trees, voxels, depth maps... which can make a whole lot more difference (orders of magnitude) than SIMD (2x faster, best case).
The fastest intersection test is the one you do not do :)
Also in the particular case of primitives collision detection, the implementation depends a lot on what you want out of the "detection", whether it's a simple boolean result, a point+vector, etc. Also if the relative size of the primitives is known this can sometimes drastically cut down on the math (f.i. huge sphere vs small triangle or small sphere vs huge triangle can be handled with simpler math than if both are of similar size)
Yeah, I agree, Eric Grange. The CD scheme affects the performance a lot more than a constant-factor improvement in the primitive test.
ReplyDeleteBut, as I've said, the caching part is implemented already and it is pretty good, only a few tirangles are checked per a query.
Parallelization is implemented as well.
I'm just hoping to get this 2x performance gain the C++ developers enjoy "for free".
Vitali Burkov I do it with asm, by adapting the code from Intel/AMD library to Delphi call conventions.
ReplyDeleteHowever Delphi does not support inlining asm routines, and does not support intrinsics either (which would allow inlining), also even with plain FPU code, when inlining the compiler will still juggle the register, and basically only eliminates the call/ret, not the stack juggling (actually, it can sometimes result in worse juggling), so the benefits from SIMD-optimized cross/dot primitives are less than what you have in C++.
So for really critical code, you need to code most of the routine's math in SIMD directly, not just the individual vector operators, or some of what you gained from SIMD, you may lose to stack juggling or call overhead.