Author
StackTrack was written by James Fowler, founder of
Open Sea Consulting. As of the
initial release, he is solely responsible for any bugs, typos, or other
deficiencies in StackTrack...
Acknowledgements
The Boost Community, for the wonderful things they have accomplished so far...
Joel de Guzman and Eric Niebler, for creating QuickBook, Thomas
Guest for adding dynamic source code highlighting, and the many others
working to improve the process for creating Boost documentation.
Brian Baatz, for enlightening conversation on profiling interfaces.
Ion Gaztaņaga, for providing the shmem library which works very
well with Boost and is used by StackTrack.
ACE - The initial protoype for shared memory support in StackTrack
used ACE, which worked well. Unfortunately, requiring ACE just for
basic shared memory support would make StackTrack more difficult to use
for projects which don't already depend on ACE. There are, however,
some "bonus features" planned for ACE users in future releases...
This section is a holding area for various questions, ideas, and so on which are still in process.
Design Decisions which might be made adjustable (current behavior, if any, in italics) :
-
Shared Memory
-
Overflow : what happens when writing a log entry would require the thread to block?
-
block indefinitely?
-
timed block?
-
drop the data and move on?
-
Identification
-
named?
-
anonymous (from user's perspective)
-
Process granularity
-
unique buffer per process
- shared buffer for all processes on system
-
__
-
Timing
-
Granularity
-
every call gets a unique timer value?
-
common shared timer value updated on background thread?
-
Format
-
store as raw (highest efficiency), convert in post-processing?
-
convert to common (double?) format?
-
Calibration
-
expected accuracy for calibrating relative time to "world" time
-
+\- "world" time (1 second)/
-
average before/after hires time at "world" time transition - introduces startup delay during calibration
-
periodic recalibration?
Prerequisites
The following need to be configured and available:
Make sure you already have BoostBook successfully installed and working before you try generating the docs for StackTrack!
BoostBook as contained in Boost 1.32 works (for the author) with Doxygen 1.3.9.1. Since StackTrack makes use of some
nested classes (which Doxygen 1.3.9.1 didn't seem to handle well), an attempt was made to use Doxygen 1.4.1 which
failed miserably. Some changes have been made to $BOOST/tools/boostbook/xsl/doxygen/doxygen2boostbook.xsl which allow it to
work better, but it still needs some work. For now, be happy with pre-1.4.x versions unless you're interested in wrestling with
BoostBook's Doxygen support.
QuickBook
Get QuickBook from CVS, build it, and have the executable in your path.
Create a "quickbook.jam" in $BOOST/tools/v2/build/tools (if it doesn't exist) containing:
import type ;
import boostbook ;
type.register QUICKBOOK : qbk ;
import generators ;
generators.register-standard quickbook.inline-file : QUICKBOOK : XML ;
actions inline-file
{
quickbook $(>) $(<)
}
Build the docs
run
bjam --v2
from the directory $BOOST/libs/stacktrack/docs
-
optional : BoostBook update for Doxygen 1.4x (works with 1.32 release and Doxygen 1.2.9.1, but some doc details will be omitted)
-
this update is still in progress
struct hires_momemt {
typedef some_type raw_type;
void operator()( raw_type & );
};
Interface for capturing high resolution time
-
can be very challenging to balance:
-
resolution (as high possible??? !!!)
-
reliability and accuracy (otherwise resolution doesn't mean much...)
-
efficiency (all this with zero runtime overhead
please, and fetch me a fresh cup of coffee while you're at it)
-
variations in interface
-
hard to find one ideal portable API to build on, lots of choices
-
ACE has some nice timer wrappers,
but that doesn't help unless you're using ACE
-
QueryPerformanceCounter(...) & QueryPerformanceFrequency(...)on Windows
-
time() is ubiquitous, but with pitiful resolution...
-
gettimeofday() available on various POSIX systems
-
gethrtime() on Solaris,
HPUX, systems (maybe RTLinux too)
-
reading tsc counters on Pentium-based systems
-
and there are assuredly a few more out there
-
representation - some use an integer type, some use structs, ...
-
resolution - from very coarse (1 second from time()) to very fine (nanoseconds for gethrtime())
-
context - "wall clock" time (like gettimeofday()), process time (like clock()), and more...
-
scaling - constant (sec/usec in gettimeofday()) vs. runtime dynamic (QueryPerformanceCounter(...) / QueryPerformanceFrequency(...))
-
offset - fixed ( gettimeofday()) vs. floating (gethrtime())
-
overflow for floating offset representations
-
calibration for floating offset representations (baseline for conversion to known offset)
-
variations in behavior - due to implementation or dynamic environmental influences
-
known issues, like QueryPerformanceCounter() sometimes skipping
-
degree of separation from actual "real time clock" hardware
-
query RTC hardware directly (should be most stable - but not necessarily fastest...)
-
query counters slaved to CPU clock (like the "rdtsc" instruction for Pentiums)
-
query global internal "current time" value incremented by periodic process (interrupts...)
-
local uncertainty, i.e. the first (in hard real time) of two nearly simultaneous calls may have a result
slightly "later" value than the second
-
called in one thread?
-
called in multiple threads on one CPU?
-
called in multiple processes on one CPU?
-
called in multiple threads/processes on multiple CPUs?
-
performance impact - overhead varies
-
based on API calls used, can differ by platform / kernel version / etc...
-
based on frequency called (impact on caches, SMP systems)
-
precision - useful precision may be less than representation allows
-
response to system "idling"- load-based CPU speed throttling, sleep & hibernation periods, etc.
-
drift - representations with "wall clock" context but floating offsets may (or may not) diverge over time from calibration points
thoughts on requirements for an ideal portable C++ high resolution time capture mechanism
-
generic interface
-
specifies opaque "raw_hrtime" form in which time is stored
-
not necessarily an object, may use POD type as representation
-
don't want to require constructor/destructor
-
common signatures for key operations
-
store "now" in an instance of raw_hrtime
-
move/copy instance of raw_hrtime
-
convert from raw_hrtime into normalized form(s)
-
separate conversions for floating and fixed offsets
-
conversion to integral POD type(s)
-
also provide converted equivalents for scale and offset
-
conversion to double (in seconds)
-
overflow check
-
get scaling factor
-
get offset
-
traits to make variations in interface of underlying API available for compile-time or run-time use
-
size & alignment requirements for raw_hrtime
-
resolution, context, mode (fixed|variable) for scaling and offset
-
traits / operations addressing potential variations in behavior
-
optional, may only represent "best guess"
-
common signatures for potentially optional operations (default implementation may be feasible)
-
calibration
-
drift detection
-
ideally zero runtime overhead added by generic wrapper for each capture of "now" in raw_hrtime
-
highest priority is capture
-
secondary priority is direct (unprocessed) output
-
output raw_hrtime without forcing conversion
-
do NOT apply - or even query - scaling and offset
-
optionally to store additional information as necessary to decode raw form
-
traits on interface variation
-
scaling/offset values
-
can treat as raw (possibly aligned) opaque chunk of memory
-
local manipulation only by explicit request
-
deferred (lazy-evaluation) for any conversions
-
use scaling and offset to provide normalized representation
-
calibration on demand
"Friendly" class wrapper for hires time type
-
A common class interface implemented around the generic raw_hrtime interface should be able to provide
-
Convenient features like automatically fetching "now" in the constructor
-
automatic conversion to normalized form (like double)
-
comparison and math operations
-
iostream output
-
and do all this in clean, portable code
-
A templatized version can be created to allow compile time variation multiple raw_hrtime implementations
-
A concrete version could provide "best effort" default support for hi-res time
-
may need more than one...
-
A hybrid version could potentially allow for selection of the "closest" available implementation based on filtering raw_hrtime variants
-
example: VARIABLE_OFFSET_OK + MAX_SMP_CONSISTENCY
-
may or may not be worth it if this requires runtime performance hit...
This section is out of data and somewhat redundant...
Performance
-
StackTrack is designed for minimum impact at runtime
-
Data is gathered in a very minimalist, binary format
-
Post-processing is required to produce useful analysis
-
this significantly reduces the time and space costs of acquiring trace data
-
postprocessing is currently done via perl script
Connectivity
-
For simple use cases, data can be written directly to local
files
-
this introduces a problem: either the files must be flushed after each write (which
can cause steep overhead) or data may be lost if the
process dies with buffered, unwritten data.
-
StackTrack supports a shared-memory protocol which allows minimum overhead
while reducing the risk of data loss due to buffered file I/O.