Delphi Developers Archive

- December 14, 2015

So how do you, the performance heads, do the vector dot and cross products of single and double precisions in Delphi? SIMD with fallback to FPU maybe?

Comments

Horácio FilhoDecember 14, 2015 at 8:38 AM
Mobile platforms support too?
ReplyDelete
Replies
David HeffernanDecember 14, 2015 at 9:28 AM
It all depends. How big your data is. What architectures you are targeting and so on. You need to supply more details to get a better answer I suspect. I also doubt that Delphi Developers group is the best place to get advice on floating point perf.
ReplyDelete
Replies
Asbjørn HeidDecember 14, 2015 at 11:06 AM
Don't forget that memory access is slow, and modern CPUs will do what they can to munch on instructions while waiting for memory. Unless you are actually streaming data, you'll likely see very litte, if any, benefit of using SIMD.
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 11:07 AM
Okay, more details. Two functions are of interest,
dot(const v1, v2: TVector3f): TFloat;
cross(const v1, v2: TVector3f): TFloat;
where TFloat might be either single or double.
Platform: x64.
Maybe someone has a pas file with XMM instuctions for those functions, or maybe someone has a library compiled with the Intel C++ compiler ready to share?

For now I just want to compare the two implementations, with and without SIMD performance-wise.

I want to use those functions in a Sphere against Triangle collision detection procedure.
ReplyDelete
Replies
Asbjørn HeidDecember 14, 2015 at 11:09 AM
Vitali Burkov Is it one (or a few) spheres against a few million triangles? Or many vs many?
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 11:14 AM
It is performance critical code that will be called "million times", Asbjørn Heid
The caching part is implemented already, the question is in the CD function.
ReplyDelete
Replies
Asbjørn HeidDecember 14, 2015 at 11:20 AM
Vitali Burkov It's been a few years since I wrote serious SSE code but from what I know I'd be surprised if you see any major benefit from using SSE unless you pass arrays for inputs.

As for FPU fallback... unless you have plans of porting to other platforms I wouldn't waste time on that. Decade old mainstream CPUs support SSE2, which is all you need.
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 11:31 AM
I've never written SSE code in Delphi, only x86, x64 asm, Asbjørn Heid.
Here is the triangle-sphere intersection code I've found http://realtimecollisiondetection.net/blog/?p=103.
It looks like it can benefit from being implemented using SSE.
I found this article about SSE dot and cross products implemented with SSE in a C++ compiler, I guess.
http://tomjbward.co.uk/simd-optimized-dot-and-cross/
What do you think, is it possible to port it to Delphi?
ReplyDelete
Replies
David HeffernanDecember 14, 2015 at 11:56 AM
Problem is that 3 is an odd number, so not terribly amenable to SSE. I suspect you'll need to do some learning rather than hoping to get some code from somewhere that just "works".
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 11:59 AM
It's not a problem, David Heffernan I'll just set W=0.
ReplyDelete
Replies
David HeffernanDecember 14, 2015 at 12:09 PM
It's a problem in the sense that it's hard to get the full benefit. There's also the issue of alignment. You likely won't have fully aligned data. I think modern SSE units are more forgiving than in days past.
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 12:12 PM
Do you use SSE yourself, David Heffernan?
ReplyDelete
Replies
David HeffernanDecember 14, 2015 at 12:27 PM
+Vitali For some things yes.
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 12:32 PM
How large is the performance gain? The best case, the worst case? Is it worth it?
ReplyDelete
Replies
David HeffernanDecember 14, 2015 at 12:38 PM
For a 3 vector scalar product I'm not sure there is a discernible gain. It will depend on many factors. I suggest you do some exploration and learning. Prediction is hard even when you have experience and knowledge, and futile without.

I'd also question whether or not Delphi is the best tool for exploration. You might do better with a C++ compiler with good SSE intrinsics.
ReplyDelete
Replies
Radek ČervinkaDecember 14, 2015 at 12:42 PM
if you try purepascal version for x64, don't forget specify {$EXCESSPRECISION OFF} for bigger speed - http://docwiki.embarcadero.com/RADStudio/Seattle/en/Floating_point_precision_control_%28Delphi_for_x64%29

see https://www.delphitools.info/2011/09/09/happy-excessprecision-off/
ReplyDelete
Replies
Vitali BurkovDecember 14, 2015 at 12:44 PM
Okay, thank you David Heffernan. I've got to sleep ).
ReplyDelete
Replies
Asbjørn HeidDecember 14, 2015 at 1:31 PM
I noticed at least Delphi 10 has support for SSE4 instructions in the asm block, so that's something.
ReplyDelete
Replies
Eric GrangeDecember 15, 2015 at 5:29 AM
If you really need highest performance for a specific case like intersections, then unless you have very few objects, the dot, cross and other vector stuff will probably take second place to higher level intersection detection.

Depending on how your scene is structured, there are many techniques that can apply: BSP, quad-trees, voxels, depth maps... which can make a whole lot more difference (orders of magnitude) than SIMD (2x faster, best case).

The fastest intersection test is the one you do not do :)

Also in the particular case of primitives collision detection, the implementation depends a lot on what you want out of the "detection", whether it's a simple boolean result, a point+vector, etc. Also if the relative size of the primitives is known this can sometimes drastically cut down on the math (f.i. huge sphere vs small triangle or small sphere vs huge triangle can be handled with simpler math than if both are of similar size)
ReplyDelete
Replies
Vitali BurkovDecember 15, 2015 at 7:26 AM
Yeah, I agree, Eric Grange. The CD scheme affects the performance a lot more than a constant-factor improvement in the primitive test.
But, as I've said, the caching part is implemented already and it is pretty good, only a few tirangles are checked per a query.
Parallelization is implemented as well.
I'm just hoping to get this 2x performance gain the C++ developers enjoy "for free".
ReplyDelete
Replies
Eric GrangeDecember 16, 2015 at 12:28 AM
Vitali Burkov I do it with asm, by adapting the code from Intel/AMD library to Delphi call conventions.

However Delphi does not support inlining asm routines, and does not support intrinsics either (which would allow inlining), also even with plain FPU code, when inlining the compiler will still juggle the register, and basically only eliminates the call/ret, not the stack juggling (actually, it can sometimes result in worse juggling), so the benefits from SIMD-optimized cross/dot primitives are less than what you have in C++.

So for really critical code, you need to code most of the routine's math in SIMD directly, not just the individual vector operators, or some of what you gained from SIMD, you may lose to stack juggling or call overhead.
ReplyDelete
Replies

Add comment

Search This Blog

Delphi Developers Archive

Comments

Post a Comment