[RFC v2][PATCH 6/9] Checkpoint/restart: initial documentation

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Oren Laadan
Date: Wednesday, August 20, 2008 - 8:06 pm

Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
  Documentation/checkpoint.txt |  177 ++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 177 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..fdc69cb
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,177 @@
+
+	=== Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial CR products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide CR: sys_checkpoint and
+sys_restart.  The checkpoint code basically serializes internel kernel
+state and writes it out to a file descriptor, and the resulting image
+is stream-able. More specifically, it consists of 5 steps:
+  1. Pre-dump
+  2. Freeze the container
+  3. Dump
+  4. Thaw (or kill) the container
+  5. Post-dump 
+Steps 1 and 5 are an optimization to reduce application downtime:
+"pre-dump" works before freezing the container, e.g. the pre-copy for
+live migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state and from a
+file descriptor, and re-creates the tasks and the resources they need
+to resume execution. The restart code is executed by each task that
+is restored in a new container to reconstruct its own state. 
+
+
+=== Interfaces
+
+int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+  Checkpoint a container whose init task is identified by pid, to the
+  file designated by fd. Flags will have future meaning (should be 0
+  for now).
+  Returns: a positive integer that identifies the checkpoint image
+  (for future reference in case it is kept in memory) upon success,
+  0 if it returns from a restart, and -1 if an error occurs. 
+
+int sys_restart(int crid, int fd, unsigned long flags);
+  Restart a container from a checkpoint image identified by crid, or
+  from the blob stored in the file designated by fd. Flags will have
+  future meaning (should be 0 for now).
+  Returns: 0 on success and -1 if an error occurs.
+
+Thus, if checkpoint is initiated by a process in the container, one
+can use logic similar to fork():
+	...
+	crid = checkpoint(...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+
+=== Checkpoint image format
+
+The checkpoint image format is composed of records consistings of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream). 
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 id;
+};
+
+Here, 'type' field identifies the type of the payload, 'len' tells its
+length in byes. The 'id' identifies the owner object instance. The
+meaning of the 'id' field varies depending on the type. For example,
+for type CR_HDR_MM, the 'id' identifies the task to which this MM
+belongs. The payload also varies depending on the type, for instance,
+the data describing a task_struct is given by a 'struct cr_hdr_task'
+(type CR_HDR_TASK) and so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. The cr_vma->npages indicated how many pages were dumped for this
+VMA. Following comes the actual data: first the addresses of all the
+dumped pages, followed by the contents of all the dumped pages (npages
+entries each). Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			addr1, addr2
+			page1, page2
+		cr_hdr + cr_hdr_vma
+			addr3, addr4, addr5
+			page3, page4, page5
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+=== Changelog
+
+[2008-Jul-29] v1:
+In this incarnation, CR only works on single task. The address space
+may consist of only private, simple VMAs - anonymous or file-mapped.
+Both checkpoint and restart will ignore the first argument (pid/crid)
+and instead act on themselves.
+
+[2008-Aug-09] v2:
+* Added utsname->{release,version,machine} to checkpoint header
+* Pad header structures to 64 bits to ensure compatibility
+* Address comments from LKML and linux-containers mailing list
-- 
1.5.4.3

--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[RFC v2][PATCH 1/9] kernel based checkpoint-restart, Oren Laadan, (Wed Aug 20, 7:58 pm)
[RFC v2][PATCH 3/9] x86 support for checkpoint/restart, Oren Laadan, (Wed Aug 20, 8:04 pm)
[RFC v2][PATCH 4/9] Memory management - dump state, Oren Laadan, (Wed Aug 20, 8:05 pm)
[RFC v2][PATCH 5/9] Memory managemnet - restore state, Oren Laadan, (Wed Aug 20, 8:05 pm)
[RFC v2][PATCH 6/9] Checkpoint/restart: initial documentation, Oren Laadan, (Wed Aug 20, 8:06 pm)
[RFC v2][PATCH 7/9] Infrastructure for shared objects, Oren Laadan, (Wed Aug 20, 8:06 pm)
[RFC v2][PATCH 8/9] File descriprtors - dump state, Oren Laadan, (Wed Aug 20, 8:07 pm)
[RFC v2][PATCH 9/9] File descriprtors (restore), Oren Laadan, (Wed Aug 20, 8:07 pm)
Re: [RFC v2][PATCH 1/9] kernel based checkpoint-restart, Oren Laadan, (Wed Aug 20, 10:15 pm)
Re: [RFC v2][PATCH 9/9] File descriprtors (restore), Oren Laadan, (Wed Aug 20, 10:26 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Ingo Molnar, (Thu Aug 21, 12:30 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Justin P. Mattock, (Thu Aug 21, 1:01 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Louis Rilling, (Thu Aug 21, 2:53 am)
Re: [RFC v2][PATCH 5/9] Memory managemnet - restore state, Louis Rilling, (Thu Aug 21, 3:07 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Balbir Singh, (Thu Aug 21, 3:28 am)
Re: [RFC v2][PATCH 7/9] Infrastructure for shared objects, Louis Rilling, (Thu Aug 21, 3:40 am)
Re: [RFC v2][PATCH 8/9] File descriprtors - dump state, Louis Rilling, (Thu Aug 21, 4:06 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Ingo Molnar, (Thu Aug 21, 4:59 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Dave Hansen, (Fri Aug 22, 1:37 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Oren Laadan, (Fri Aug 22, 2:21 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Oren Laadan, (Sat Aug 23, 10:40 pm)
Re: [RFC v2][PATCH 8/9] File descriprtors - dump state, Oren Laadan, (Sun Aug 24, 8:28 pm)
Re: [RFC v2][PATCH 8/9] File descriprtors - dump state, Louis Rilling, (Mon Aug 25, 3:30 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Dave Hansen, (Tue Aug 26, 9:33 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Oren Laadan, (Tue Aug 26, 5:14 pm)
Re: [RFC v2][PATCH 7/9] Infrastructure for shared objects, Louis Rilling, (Wed Aug 27, 1:26 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Dave Hansen, (Wed Aug 27, 8:41 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Louis Rilling, (Wed Aug 27, 8:57 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Dave Hansen, (Wed Aug 27, 9:12 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Jeremy Fitzhardinge, (Wed Aug 27, 9:19 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Serge E. Hallyn, (Wed Aug 27, 1:34 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Dave Hansen, (Wed Aug 27, 1:38 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Serge E. Hallyn, (Wed Aug 27, 1:48 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Dave Hansen, (Wed Aug 27, 1:56 pm)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Oren Laadan, (Sun Aug 31, 12:16 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Cedric Le Goater, (Sun Aug 31, 10:34 am)
Re: [RFC v2][PATCH 4/9] Memory management - dump state, Serge E. Hallyn, (Tue Sep 2, 8:32 am)