A recent patch posted to the lkml aimed to make it possible to use both kdb and kdump at the same time, and instead led to an interesting discussion about RAS (Reliability, Availability, and Serviceability) tools. Vivek Goyal compared the two main philosophies, "so basically there are two kind of users. One who believes that despite the kernel [having] crashed something meaningful can be done," versus, "exec on panic, which thinks that once [the] kernel is crashed nothing meaningful can be done". When the discussion focused on kdb, Keith Owens noted:
"The problem above applies to all the RAS tools, not just kdb. My stance is that _all_ the RAS tools (kdb, kgdb, nlkd, netdump, lkcd, crash, kdump etc.) should be using a common interface that safely puts the entire system in a stopped state and saves the state of each cpu. Then each tool can do what it likes, instead of every RAS tool doing its own thing and they all conflict with each other, which is why this thread started."
Andrew Morton summarized the current state of affairs, "lots of different groups, little commonality in their desired funtionality, little interest in sharing infrastructure or concepts." In response to an earlier patch Keith posted to a lesser-trafficked mailing list, Andrew suggested it be resubmitted in a working form for a full review, "much of the onus is upon the various RAS tool developers to demonstrate why it is unsuitable for their use and, hopefully, to explain how it can be fixed for them."
A kernel crash dump is a snapshot of system state taken at the time that the kernel crashed, useful for finding and debugging the problem that caused the crash in the first place. There is no standard mechanism for automatiaclly collecting a crash dump on Linux, but there are a number of existing projects working toward efficiently meeting this goal. A "Linux Kernel Dump Summit" was recently mentioned on the lkml, with participants from some of the many crash dump projects looking to standardize the dump process and information collected. A followup email noted, "as memory size grows, the time and space for capturing kernel crash dumps really matter." It went on to examine partial dumps, and full dumps that are compressed. The former risks not collecting information necessary for proper debugging, while the latter risks greatly increasing the amount of time required to collect a dump.
There are a number of existing projects for collecting automatic kernel crash dumps on Linux, including Linux Kernel Crash Dump (LKCD), Mini Kernel Dump (mkdump), kdump, and diskdump (detailed here). Some of these projects also include tools for examining the obtained dumpfiles. Other projects focus just on tools for analyzing kernel crash dumps, including the perl-based Alicia (the Advanced LInux Crash-dump Interactive Analyzer) and Red Hat's crash analysis tool "loosely based on the SVR4 UNIX crash command, but significantly enhanced by completely merging it with the GNU gdb debugger."