Tuesday, January 11, 2011

Failsafe and manual management of kernels on EC2

On 10.10 and later, Ubuntu images use the Amazon provided pv-grub to load kernels that live inside the image. The selected kernel is controlled by /boot/grub/menu.lst. This makes it possible to install a new kernel via 'dpkg -i' or 'apt-get dist-upgrade' and then reboot into the new kernel.

The file /boot/grub/menu.lst is managed by grub-legacy-ec2 package. The program 'update-grub-legacy-ec2' is called on installation of Ubuntu kernels through files that are installed in /etc/kernel/postinst.d and /etc/kernel/postrm.d.

By default, as with other Ubuntu systems, the kernel with the highest revision will be the behavior will be automatically selected as the default, and selected on the next boot. Because EC2 images is read-only, you may want to manually manage your selected kernel. This can be done by modifying /boot/grub/menu.lst to use the grub "fallback" code.

I'll launch an instance of the current released maverick (ami-ccf405a5 in us-east-1 ubuntu-maverick-10.10-i386-server-20101225). Then, on the instance, create hard links to the default kernel and ramdisk so even on apt removal, they'll stick around, and then change /boot/grub/menu.lst to use those kernels.

sudo ln /boot/vmlinuz-$(uname -r) /boot/vmlinuz-failsafe
sudo ln /boot/initrd.img-$(uname -r) /boot/initrd.img-failsafe

Then, copy the existing entry in /boot/grub/menu.lst to a new entry above the automatic section. I've changed/added:

# You can specify 'saved' instead of a number. In this case, the default entry
# is the entry saved with the command 'savedefault'.
# WARNING: If you are using dmraid do not use 'savedefault' or your
# array will desync and will not let you boot your system.
default saved


# Put static boot stanzas before and/or after AUTOMAGIC KERNEL LIST

# this is the failsafe kernel, it will be '0' as it is the first
# entry in this file
title Failsafe kernel
root (hd0)
kernel /boot/vmlinuz-failsafe root=LABEL=uec-rootfs ro console=hvc0 FAILSAFE
initrd /boot/initrd.img-failsafe

title Ubuntu 10.10, kernel 2.6.35-24-virtual
root (hd0)
kernel /boot/vmlinuz-2.6.35-24-virtual root=LABEL=uec-rootfs ro console=hvc0 TEST-KERNEL
initrd /boot/initrd.img-2.6.35-24-virtual
savedefault 0

And then update grub to store that the first kernel is the 'saved', which for grub 1 (or 0.97) modifies /boot/grub/default.

sudo grub-set-default 0
sudo reboot

Now, a reboot will boot into the failsafe kernel (which we can verify by checking /proc/cmdline) and see 'FAILSAFE'. Then, to test our "TEST-KERNEL", run:

sudo grub-set-default 1
sudo reboot

After this reboot, the system come up into "TEST-KERNEL" (per /proc/cmdline) but /boot/grub/default will contain '0', indicating that on subsequent boot, the FAILSAFE will run. In this way, if your kernel failed to boot all the way up, you can then just issue:

euca-reboot-instances i-15b77779

And you'll boot back into the FAILSAFE kernel.

The above basically allows you to manually manage your kernels while letting grub-legacy-ec2 still write entries to /boot/grub/menu.lst.

I chose to use hardlinks for the 'failsafe' kernels, so that even on dpkg removal, the files would still exist. Because the 10.10 Ubuntu kernels have the EC2 network and disk drivers built in, you'll still be able to boot even after a dpkg removal of the failsafe kernel or an errant 'rm -Rf /lib/modules/2*'

No comments:

Post a Comment