I'm experiencing a weird issue with "zombie processes", and I'm curious if anyone has experienced anything similar.

I'm experiencing a weird issue with "zombie processes", and I'm curious if anyone has experienced anything similar.

We're using Task Scheduler to start our program every 3 minutes. The program imports some data to our database and outputs some files. It's part of an integration solution. The program runs from local disk on a MS Server 2008 in a "cloud VM".

Now, the problem is that sometimes the process simply hangs after the call to "end." (the final end that is). Task Scheduler somehow doesn't understand what's going on and does not terminate the process.

To avoid this issue I wrote a "watchdog" program. The main program aquires a named mutex before shutting down, and never releases it. When aquiring the mutex, it writes it's PID to a named memory-mapped file. After aquiring the mutex it writes to a log file and then it calls "end.".

The whatchdog tries to aquire this mutex for a minute, and if it fails to do so, gets the PID from the memory-mapped file and terminates it.

This works fine during testing. However sometimes a zombie process lingers. I can see in the log file that it it has successfully aquired the mutex. However the watchdog does not detect that the process hangs. Since it works fine in testing (aquire mutex, infinite loop) it would seem that the  OS somehow started cleaning up the process but failed.

I thought it might be due to kernel object namespaces, but I ran the test app also as a scheduled task in the same manner as the main app, and it gets killed.

I'm pretty clueless here. Anyone got any ideas of what might be going on?

Comments

  1. in stead of making many complex operations between a dozen apps and loads of configurations, did you try writing a simple task scheduler that starts a process, loads a dll(watchdog) within it, if the process fails to end, the dll will kill current process(from within the watchdog thread) and voila!

    ReplyDelete
  2. Might try that. I was just really surprised that the process is left in this state.

    I might try a watchdog thread to begin with.

    ReplyDelete
  3. weird stuff indeed, but I wouldn't lose sleep over that, just do the thread and get on (:

    ReplyDelete
  4. Use OutputDebugString and output messages in the destructors.  It could be a destructor deadlock or exception.

    ReplyDelete
  5. Hm, that'd be a lot of destructors... big project :) Not sure how to get to them all. If it happens it must be in a thread, why else would the OS clean up the mutex...

    Thanks for the suggestion though, I'll keep it in mind.

    ReplyDelete

Post a Comment