Recursive Parallel For Loops

Recursive Parallel For Loops
I have an algorithm which isn't a sorting algorithm but conceptually works similar to the Quicksort algorithm i.e. it partitions a list and then works on the two partitions recursively. I'd like to utilize multi cores and have separate threads for each partition. However, everything just locks up. It seems the Parallel Programming Library creates too many thread, all waiting on one another. In my case there are approximately 100 threads created. I would have thought the PPL and the OS would have given resources to the active threads but that doesn't seem to be the case. I seem to have the same problem as in this thread (https://goo.gl/GVJ1R2).

The solutions proposed on Stack Overflow aren't helpful. They suggest forking at the root (or first depth) but I'd like to utilize 16+ cores when available.

Is there any approach I can take to solving this problem? Would the OmniThread Library offer a better solution than the PPL?

Comments

  1. Hard to say from here without seeing any code. There are too many things to keep in mind when doing multi-threading. No library will do that for you.
    Divide and conquer tends to fire back when you divide too much and the overhead becomes the limiting factor.

    ReplyDelete
  2. Charges when executing (Tparallel.For) greater than simple (For ... Loop) on small to one million values. You should study the sorting material using CUDA, and there will be more benefits.

    ReplyDelete
  3. Ришат Бикчантаев Any examples to use CUDA with Delphi?

    ReplyDelete
  4. With OTL, you can attack partitioning algorithms with Fork/Join: otl.17slon.com - 6. How-to

    Keep in mind, though, that fork/join has quite high overhead so you should find a sweet spot - the best subproblem size where you should start working on the problem single-threadedly.

    ReplyDelete
  5. Thanks for the input. I'm going to look at Primož Gabrijelčič's OTL in a bit more detail. I also came up with another solution. Give the number of available cores as parameter. After the partition only use the threaded version if there are 2 or more available cores (and divide the number of cores by two when calling the threaded version), else use the single threaded version.

    Here's some pseudo code:

    procedure QuickSort(low, high, cores: integer)
    begin
    PartitionList;
    if cores > 1 then
    Do_In_Parallel(
    QuickSort(low, mid - 1, cores div 2);
    QuickSort(mid + 1, high, cores div 2);
    )
    else
    begin
    QuickSort(low, mid - 1, cores div 2);
    QuickSort(mid + 1, high, cores div 2);
    end;
    end;

    All this requires is an initial estimate of the number of cores.

    ReplyDelete
  6. I don't think the point is that OTL is inherently to be preferred. That's not how I read Primož Gabrijelčič​'s comment at all. He's pointing to an example of an algorithmic approach. That approach can be transferred onto any decent parallel library. The identity of the library is of secondary relevance.

    ReplyDelete
  7. David Heffernan Thanks. I haven't come across the Fork/Join pattern using the PPL. From what I can see it looks to be more low level than "algorithmic". I'll do some more research, but if you can point me to anything I'd appreciate it.

    ReplyDelete
  8. I'm starting to drink the functional programming kool aid. Paralellizing code becomes trivial when everything is a function and the language supports immutability at its core. clojuredocs.org - pmap - clojure.core | ClojureDocs - Community-Powered Clojure Documentation and Examples

    ReplyDelete

Post a Comment