To date every attempt to solve or identify where the memory leak is coming from when processing Ruby dialplans has met with failure. That’s despite a concerted effort analysing and tracing the dialplan processing including the purchase of a relatively expensive memory profiling tool. The latest attempts seemed to be going somewhere initially by removing the Thread.Aborts – which seemed to be leaving something in the scripting engine in a bad state leading to the leak – and instead interrupting the execution of the engine by setting a variable.
The interrupt execution approach seemed to work fine both in testing and in the first few days on sipsorcery.com but then the leak re-appeared and even worse some weird things started happening where log messages from one dialplan execution appeared in another dialplan’s execution. That’s bad and my guess is that within the internals of the DLR (Dynamic Language Runtime) the script interpreters are being cached and interrupting the execution of the interpreters left some internal states awry.
It’s somewhat difficult to troubleshoot and test the dialplan execution because once the script starts getting interpreted then there is no easy way to debug through it. The lines of script get converted into lambda methods and executed as a series of instructions on the interpreter. It’s a bit like debugging a C application in assembly mode, i.e. next to impossible for all but the most trivial application.
I’ve put together a timeline of the issue so I can one day look back on it with fond memories when the issue is eventually solved.
- May/Jun 2008 Memory leak first appeared on mysipswitch.com, mysipswitch forums: Instability Jun 2008
- 6 Jul 2008 Isolated leak to the dialplan processing and IronRuby engine, IronRuby forums: Memory Leak with Certain Script Exceptions
- 31 Jul 2008 Stage 1 of software upgrade to separate mysipswitch server agents mysipswitch forums: Software Update 31 Jul 2008
- 6 Sep 2008 Stage 2 of software upgrade to separate mysipswitch server agents mysipswitch forums: SIPSwitch Upgrade – 6 Sep 2008 and mysipswitch blog: Pear Skidding
- Sep 2008 to Jun 2009 sipsorcery.com upgrade from mysipswith.com under heavy development. No further investigation into the dialplan processing leak undertaken with the hope being that later versions of IronRuby, which is also under heavy development, would not have the same behaviour when interacting with sipsorcery,
- 24 Jun 2009 sipsorcery.com went live,
- 7 Jul 2009 call volume increases on sipsorcery.com and memory leak behaviour in dialplan processing observed. Undertook investigation of the leak this time debugging into the IronRuby and DLR libraries,
- 9 Jul 2009 Hypothesis that Thread.Abort call to halt completed dialplans was causing the leak. Delved into DLR design and discovered “interrupt execution” which is the theoretical design solution to the problem IronRuby forums: Interrupt Execution and DLR forums: Interrupt Execution,
- 11 Jul 2009 Upgraded sipsorcery.com to use an interrupt execution apporach in the DLR library and removed the Thread.Abort calls,
- 15 Jul 2009 Memory leak behaviour on sipsorcery.com coupled with crossover of dialplan log messages,
- 16 Jul 2009 Removed the interrupt execution changes made to the Microsoft.Scripting (DLR) assembly which means dialplans will not be terminated and will be left to run to completion,
- 17 Jul 2009 Memory leak behaviour caused app server to become unresponsive and not process calls,
- 18 Jul 2009 Initial implementation work for call-dispatcher/isolated-process approach for dialplan processing.
- 23 Aug 2009 Dial plan processing failed (“Long Running Dialplan” message).