Multithreading is hard.

Multithreading is hard.

I often say this and, boy, am I right!

Few days ago I've published a tutorial (link below) on updating a progress bar from a parallel for loop. Relatively simple code, nothing to go wrong, true?

Wrong! Today I reviewed one of the examples as something was bothering me and then I noticed that all of the progress bar examples (that is, sections from 1.a to 2.c in the tutorial) are bad! Each of them can exhibit unwanted behaviour!

While I prepare an explanation article and fixed code examples (which may take a day or three) you are kindly invited to find the problems in my code and post them here :)

A winner (criteria will be the number of bugs found and quality of explanations) gets a free copy of my "Parallel Programming with OmniThreadLibrary" book plus free access to my "High-level Multithreading with OmniThreadLibrary" webinars.

Comments

  1. for 1.a - 2.c there is a small race condition:

    First a Counter is Interlocked Incremented(thats OK), then some calculation is done(who is faster?) and then a call to Postmessage is done. However, who managed to call Postmessage first? The calculation (if statement, or in the other examples the Round and division PLUS the if statement) run asynchronus. Depending on the interrupt behaviour of the OS and its timing it is possible that thread A starts calculating before Thread B, but Thread B manages to call PostMessage/Queue before Thread A because the OS interrupted Thread A.

    ReplyDelete
  2. 1a) while the update of the ProgressBar is done using Postmessage, the final setting is done using TThread.queue. Which means, it is possible that TParalell.For finishes, TThread.queue is executed and the queued proc is executed before all other windowmessages are executed. (not tested, but i suspect this). Which results in a progressbar jumping to 100%, then back and doing some progress after finishing. I think a fix here is to simply not rely on PostMessage but use TThread.queue.

    ReplyDelete
  3. 1.c) in additon to the race condition mentioned by me in the first comment(which applies here too), this line is a race condition in itself:
    TInterlocked.Exchange(FProgress, Round(new / total * 100));

    So it is possible that Threa B writes first and then Thread A which reults in a progress of 90%, 80% for example and order is completely screwed.

    ReplyDelete
  4. Alexander: First is indeed the problem, the second is implementation-dependent and as such could indeed pose as a problem. I think that in current Win implementation this cannot happen.

    The third is a real problem, though.

    (I found only 1 and 3, not 2 so - great work!)

    ReplyDelete
  5. That covers .a and .c examples. What about 2.b?

    ReplyDelete
  6. i don't seen an error in 2B which is different from the others, but then i am kind of busy with other stuff right now and can't imagine the source properly. Have to write it this evening to see if i find something strange

    ReplyDelete
  7. The issue in all your samples is that you let the worker threads determine the position. This invariably leads to races because a worker thread may get paused by the OS between the increment and the posting of the update.

    You just have various manifestations of this issue.

    The solution is for the workers to post/queue amount of work done, and the code updating the progress bar (which runs synchronously) to increment the position based on total work done.

    ReplyDelete
  8. Alternatively the update code simply reads the current value of "processed". But I prefer the other way, feels more explicit and clean.

    ReplyDelete
  9. Doing some tests with my code (using 1.b mode), I do this to avoid "jumping progress":

    if ProgressBar1.Position < NewPos then
      ProgressBar1.Position:=NewPos;

    I'm not getting any issue, maybe only the (not big) problem with TThread.Queue is all updates are executed almost at the end, when there are many parallel work items and most have already finished.

    ReplyDelete
  10. David Berneda your code removes the symptomes but doesn't fix it ;)

    ReplyDelete
  11. Asbjørn Heid True, but that's basically what Alexander already found.

    There's something else going deeply wrong in the 1.b example, though.

    ReplyDelete
  12. Primož Gabrijelčič I admit I just glossed over the code, and Alexander seemed to split it up in cases when it's clearly the same error in all examples, so I didn't read the details.

    As for 1.b, I can't see anything obvious wrong once you fix the position update. It's a bit confusing though as the progress bar takes some time to animate its way to 100%.

    ReplyDelete
  13. Okay i think i got 1b, thanks to Stefan Glienke, he told
    me that new is shared in this case!
    While the anonymous method is queued several times, there is only ONE instance of it, which is queued multiple times. new is captured in a hidden Object, and this is the same Object for all instances of the anonymous procedure.

    So lets do this:
    new := 1
    queue
    new := 2
    queue
    new := 3
    queue

    if we process the queued events, all of them will calculate the exact same result, as new is 3 FOR ALL of them. Combined with the race condition, the last queued call specifies the value of new, and this can be 80% or something more random.


    So new is like a PInteger here, and all calls work with the same PInteger. Last one wins.

    ReplyDelete
  14. Alexander Benikowski Ah good catch.

    While C++'s lambda syntax is rather horrible, it makes it much harder to do such accidents.

    ReplyDelete
  15. I also think this is a good illustration of why multi-threading is hard. You have to really know the details of what is going on under the hood to know that something is safe or unsafe, and you need to remember all of that all the time while writing multi-threaded code.

    ReplyDelete
  16. Asbjørn Heid​ i suspected that the capturing might cause a problem. But credits go to Stefan for explaining me precisely (again... -.-) what exactly is going on there. He spottet that new is shared

    ReplyDelete
  17. Alexander Benikowski Shame on me for not spotting it though, I've been bitten by this before. I blame lack of caffeine :D

    ReplyDelete
  18. Congratulations, Alexander Benikowski , capturing is indeed the problem here! Tricky to spot, especially as everything works correctly in my dumb example because threads execute in a very predictable manner.

    You got the prize, so the link for free book + webinars is already on way to you!

    ReplyDelete
  19. After looking into the debugger I have to correct myself a bit: the nested anonymous methods are not backed by the same object instance but in fact the deepest one captures the new variable which is then accessed by the one that has it as local variable. But the result is the same.

    When you put a breakpoint into the line where it sets the progressbar position you see this method name in the callstack:

    Unit1.TForm1.FormCreate$1$ActRec.$0$Body$2$ActRec.$0$0$Body$3$ActRec.$0$0$0$Body

    Unit1.TForm1.FormCreate$1$ActRec - class that backs the anonymous method being passed to TTask.Run
    Unit1.TForm1.FormCreate$1$ActRec.$0$Body$2$ActRec - class that backs the anonymous method being passed to TParallel.For
    Unit1.TForm1.FormCreate$1$ActRec.$0$Body$2$ActRec.$0$0$Body$3$ActRec - class that backs the anonymous method being passed to TThread.Queue

    In the local variables window you can see that the nested ones hold a reference to their parent (__Parent) and their variables as fields.

    ReplyDelete
  20. Too much (anonymous method) magic is killing the magic... Implementation details hide a lot of hidden code and classes. Explicit variable definition is sometimes much easier to debug, especially in multi-thread. Anonymous methods are hiding references to variables within their execution context, as hidden automagically generated classes.
    Making the implicit explicit sounds safer to me. This sounds like a typical example of saving 1 minute of code writing, for loosing 1 hour/day/week to find out the root cause.
    BTW, shouldn't local variables NOT being shared among instances? I suspect all of us (especially coming from JavaScript or C#) would have expect this behavior...

    ReplyDelete
  21. Another problem I've found with TThread.Queue is the proc is called after TTask has finished (and in my case I need them before task ends).
    One way to solve this is checking <>nil like:

    FTask:=TTask.Run...
    ...
                  TThread.Queue(nil,procedure
                  begin
                    // Safety check
                    if tmpTask<>nil then
                       ....
                  end);
    ...
    FTask.Wait;
    FTask:=nil;

    Trying TThread.Synchronize instead of Queue, locks/freezes execution.

    ReplyDelete
  22. David Berneda Sounds line your code might have a deadlock.

    ReplyDelete
  23. Asbjørn Heid​ yes I guess vcl progressbar animarion postmessages get delayed. Should work fine in fmx, I'll do the test.

    ReplyDelete
  24. David Berneda I think your problem is that your Main Thread is stucked in FTask.Wait(); so it cant process the queued procedure.

    ReplyDelete

Post a Comment