![]() |
Home | Libraries | People | FAQ | More |
|
WARNING : PRE-RELEASE VERSION,
NOT OFFICIAL BOOST LIBRARY!!!! StackTrack is not an official Boost library. It has not been submitted for review as a potential Boost library, although the current intention is to seek review after a stable release with significant positive experience in the Boost community. |
StackTrack captures data during application runs, which can then be used to investigate the cause for anomalous behaviors.
As instructions are processed on various threads within a process, StackTrack captures event data whenever a thread passes a checkpoint. Checkpoints are created by modifying the the application's source code. They can be inserted to cover the entry and exit of C++ block scopes and at specific locations within a block. The minimum data collected for each event includes identifiers for the checkpoint and current thread, a sequence counter, and the current time. Additional optional information can be attached to an event.
In most configurations, StackTrack event data will be captured and stored in a compact binary format. For typical production systems, an individual application process will queue events in shared memory, and a separate StackTrack enabled process will reads the events and store them to files. StackTrack provides utilities to process the raw event data, either directly extracting a variety of statical information or converting it to XML format for further analysis with other tools.
"There are programs runnin' all over the place. The ones doing their job, doing what they were meant to do, are invisible. You'd never even know they were there. But the other ones... well, you hear about them all the time"
-- the Oracle, Matrix Reloaded
When used properly, multi-threading can provide significant advantages for large applications. Success, however, can be particularly challenging when problems appear in production systems. Many tools for debugging and profiling are difficult to use in a production environment, and sometimes problems disappear when the application isn't running under a production load. Traditional logging and tracing systems often face a similar paradox, where the overhead created by increasing the captured detail modifies the timing enough to mask the problem. Project teams are left to tedious second-guessing of the application logic, sometimes even searching for possible flaws by by attempting iterative patches on a production system,bypassing the standard production QA release process. Problems of this nature can introduce serious risks for any business process depending on the application, as the difficulties in resolving them leads to ongoing production failures without a reliable approach to finding a solution.
StackTrack seeks to improve the situation, providing a means to capture key details about an application's progress through mechanisms explicitly designed for use while an application is running in a production environment. This requires operating with extremely low overhead, ideally incurring near zero impact on the performance of the monitored application. It also requires capturing enough detail to provide substantially increased visibility into the timing of and interactions between threads in the application, to assist the project team in tuning performance and isolating defects.
StackTrack is designed specifically for use in production environments, so that it can remain enabled in situations where the alternatives would introduce prohibitive runtime overhead, increase the risk of application failure, or simply be too complicated to operate. While StackTrack has similarities with other systems for profiling and logging, the bias towards "always on" monitoring in production environments introduces some challenges.
The following table illustrates some of the issues considered in the design and implementation of StackTrack, and constraints associated with those issues. Some of the concerns, like accuracy, are fairly obvious and universal (until we start writing software for quantum systems...). Some variations which could be disastrous in a production environment can be quite valid and useful in other contexts (like fine-grained instrumentation for profiling or exploring the execution path in complex sections of legacy code).
| Concern | The Wrong Answer for Production Systems ... |
|---|---|
| Stability | allow failures in monitoring to impact normal application behavior |
| Performance | degrade overall system performance, especially under full load |
| Simplicity | require significant learning curve to instrument application code for monitoring |
| Completeness | losing data - especially at the very end - when application crashes |
| Accuracy | provide faulty or incomplete data |
| Granularity | capture data at too many locations (like every function call) |
| Relevance | analyze data within application and discarding raw data |
| Blocking | change application timing by blocking on resources like mutex locks or I/O |
| Portability | use language features not consistently available in compilers used for currently deployed applications |
| Reversibility | omit compile-time mechanism to disable monitoring without modifying instrumented application source code |
"Nothing did she remember save a darkness that lay behind her, and a shadow of fear;"
-- J.R.R. Tokienn, The Simarillion
Mission critical applications running in a production environment should never, ever crash. Unfortunately, many of them appear unaware of this, and proceed to reveal their bad manners by crashing anyway. Modern conveniences like just-in-time debugger attachment are often unavailable, maybe a core dump file gets captured. If you're lucky, the state of the application's final moment is preserved with enough detail to recognize symptoms surrounding the failure. But without a deeper understanding of the past, deadly consequences can remain hidden in seemingly innocent, reasonable solutions.
"You are the eventuality of an anomaly which despite my sincerest efforts I have been unable to eliminate from what is otherwise a harmony of mathematical precision"
-- the Architect, Matrix Reloaded
Determining the cause for production crashes can be hard work, especially when the best one can hope for is a reasonably detailed snapshot of the application state at the moment it crashed. StackTrack provides a way to look further back, and analyze the behavior leading up to the crash. It can't magically pinpoint the problem, but it can help identify likely suspects.
Many production defects are caused by timing, lying dormant until just the right chance alignment brings multiple threads into contention for an unguarded shared resource. StackTrack can be very useful in tracking these down. Other defects arise from unexpected inputs, which can be recorded in the StackTrack data stream. While some defects may leave no traces in the data captured by StackTrack, other problems which had remained stubbornly hidden may become easily identifiable as soon as they occur in a StackTrack monitored application run.
Production problems vary in degree. Various manifestations of accelerated process mortality, like core dumps and the infamous BSOD, tend to be rather dramatic. But other less flamboyant problems can be just as serious. One commonly frustrating scenario is mysterious performance degradation. An application may scream through benchmark tests, tear through "simulated load conditions" without breaking a sweat, yet slow to a crawl when put in production. When none of the usual suspects (CPU utilization, network saturation, page thrashing, etc.) show signs of stress, then the stress will gladly transfer to whoever is responsible for maintaining the application...
Sometimes things aren't so bad, but they need to get better soon. Average peak loads on the credit card processing module still leave about 5% headroom before hitting gridlock... but it's the end of October and loads usually jump 30% when people start shopping for Christmas. One month till the merger is finalized, and management's already declared that one of the billing apps will be retired, and you really don't want it to be yours.
Sometimes the hard part is simply knowing where to look. StackTrack can make it easy to find out where your application is dragging. Certain patterns of critical details (ex. "everything's fine until more than ten threads are hitting the database at once") can be easily correlated with the time of performance dips by analysing StackTrack information.
| Copyright © 2005 James Fowler |