How to Set Up a ZFS Root Pool Mirror in Oracle Solaris 11 Express
One of the first things to do when setting up
a new system is to mirror your boot disk. This protects you against system disk
failures: If one of the two mirrored boot disks fails, the system can continue
running from the other disk without downtime. You can even boot from the
surviving mirror half and continue using the system normally, until you have
replaced the failed half.
At the currently low prices for boot drive
sized disks, this is a no-brainer for increasing your system's availability,
even for a home server system.
Unfortunately, the steps to complete until
you're running off a mirrored ZFS root pool are not yet a no-brainer. While
there is a piece of documentation entitled How to Configure a Mirrored Root Pool, it
only covers how to add a second disk to your root pool, it does not cover how
to prepare and layout a fresh disk so Solaris will accept it as a bootable
second half of an rpool mirror.
Which, for historic
reasons, is slightly more complicated than just saying zpool attach.
Over the weekend, I sat down and played a bit
with the current Oracle Solaris 11 Expressrelease in VirtualBox and
tested, re-tested and investigated all currently necessary steps to get your
root pool mirrored, including some common issues and variations.
Here's a complete, step-by-step guide with
background information on how to mirror your ZFS root pool:
The Basic Plan
After a standard
install of Oracle Solaris 11 Express, we'll have our system disk configured as
a ZFS root pool called rpool. The rpool disk is set up as an fdisk partition with
some SMI partitions (= "slices") on top. The fdisk part is for
compatibility with other OSes, the SMI slicing is done in order to reserve some
room on the physical disk for the boot blocks and GRUB.
This is different from a regular ZFS data disk
which would normally use EFI (not fdisk) labels and no further partitioning.
So here's the basic plan on how to turn a
fresh disk into an rpool mirror:
You see, the official documentation only
covers step 4 above, and let's you guess about the other steps. Here's the full
sequence of stuff to do to create a proper mirror in more detail:
Hard drives in Solaris
show up in the /dev/rdsk directory as raw devices and the same
drives with the same names show up again in /dev/dsk. The former are used
to perform raw partitioning and low-level options, while the latter is the
standard way to access disks from a day-to-day point of view such as setting up
ZFS pools.
Here's a typical
device name: c0t0d0s0. The naming convention is simple: Controller 0, SCSItarget 0, disk 0 and Solaris slice 0.
Of course, the digits
may vary and even become multi-digits in larger systems such asc12t18d5s8, but the convention
is always the same.
PATA systems omit the t0 part, because
PATA doesn't support "targets" like SCSI or SATA does. This will give
you devices like: c0d0s0.
Sometimes, when
dealing with DOS partitions, you'll see a p0 part instead of
the (Solaris specific) s0 piece. This simply refers to DOS
partition 0 (or any other DOS primary partition).
So before we do anything, we need to figure
out what disks we are dealing with, what device names they have and if they're
used somewhere else already. Two commands will help us here:
·
zpool status will print information about running
zpools. This should tell you what the device name for your existing root pool
("rpool") is. On my system, I get this:
·
admin@s11test:~$ zpool
status
·
pool: rpool
·
state: ONLINE
·
scan: resilvered 2.61G in 0h16m with 0 errors
on Sun Mar 13 21:01:06 2011
·
config:
·
·
NAME STATE READ WRITE CKSUM
·
rpool ONLINE 0
0 0
·
c7t0d0s0 ONLINE
0 0 0
·
errors: No known data
errors
This means my rpool sits on controller 7,
target 0, disk 0 and slice 0.
·
The easiest,
interactive way of figuring out all of your disks in the system would be theformat command, but we
don't want to spend time going through menus and needless interactivity. Here's
a less common, but effective option: cfgadm. This command will tell you what disks we
have in the system:
·
admin@s11test:~$
cfgadm -s "select=type(disk)"
·
Ap_Id Type Receptacle Occupant
Condition
·
sata0/0::dsk/c7t0d0 disk connected configured
ok
sata0/1::dsk/c7t1d0 disk connected configured
ok
Not surprisingly, the
second disk in our system therefore sits on target 1 of the same controller.
Since cfgadm only knows about hardware, not
(software) slices, it omits any "s" part.
Now we know what disks we have, which of them
is used for rpool already, and which ones are available as a second mirror half
for our rpool.
Solaris disk partitioning works differently in
the SPARC and in the x86 world:
·
SPARC: Disks are labeled using special,
Solaris-specific "SMI labels". No need for special boot magic or
GRUB, etc. here, as the SPARC systems' OpenBoot PROM is intelligent enough to
handle the boot process by itself.
·
x86: For reasons of compatibility with the rest
of the x86 world, Solaris uses a primaryfdisk partition labeled Solaris2, so it can coexist
with other OSes. Solaris then treats itsfdisk partition as if it were the whole disk
and proceeds by using an SMI label on top of that to further slice the disk
into smaller partitions. These are then called "slices".
The boot process uses GRUB, again for compatibility reasons, with a special module that is capable of booting off a ZFS root pool.
The boot process uses GRUB, again for compatibility reasons, with a special module that is capable of booting off a ZFS root pool.
So for x86, the first thing to do now is to
make sure that the disk has an fdisk partition of type "Solaris2"
that spans the whole disk. For SPARC, we can skip this step.
fdisk doesn't know about Solaris slices, it
only cares about DOS-style partitions. Therefore, device names are different
when dealing with fdisk: We'll refer to the first partition now and
call it "p0". This will work even if there are no partitions defined on
the disk, it's just a way to address the disk in DOS partition mode.
Again, we could use fdisk in interactive
mode and wiggle ourselves through the menus, but I prefer the command line way.
Here's how to check if your disk already has some kind of DOS partitioning:
admin@s11test:~# fdisk
-W - c7t1d0p0
* /dev/rdsk/c7t1d0p0
default fdisk table
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 2088 cylinders
*
* systid:
* 1: DOSOS12
* 2: PCIXOS
* 4: DOSOS16
(lots of id specifications omitted...)
* 191: SUNIXOS2
* 238: EFI_PMBR
* 239: EFI_FS
*
* Id Act
Bhead Bsect Bcyl
Ehead Esect Ecyl
Rsect Numsect
0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0 0
0
0 0 0
0 0 0
0 0 0
0
0 0 0
0 0 0
0 0 0
The second - tells the W option to write
to standard out instead of to a file.SUNIXOS2 (191) really means SOLARIS2. This is the
partition type that we'll create soon.
Here's how to apply a
default Solaris fdisk partition to a disk in one simple step:
admin@s11test:~# fdisk
-B c7t1d0p0
That's it. Be careful
and double-check that you got the device name right! If you're unsure, you can
still use the interactive version (fdisk
c7t1d0p0) and work through the
menus by hand.
Now let's verify that we got what we wanted:
admin@s11test:~# fdisk
-W - c7t1d0p0
* /dev/rdsk/c7t1d0p0
default fdisk table
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 2088 cylinders
*
* systid:
* 1: DOSOS12
* 2: PCIXOS
* 4: DOSOS16
(stuff omitted...)
* 191: SUNIXOS2
* 238: EFI_PMBR
* 239: EFI_FS
*
* Id Act
Bhead Bsect Bcyl
Ehead Esect Ecyl
Rsect Numsect
191
128 0 1
1 254 63
1023 16065 33527655
0
0 0 0
0 0 0
0 0 0
0
0 0 0
0 0 0
0 0 0
0
0 0 0
0 0 0
0 0 0
Here's the fdisk partition we
wanted. Its type is 191 which equals to SOLARIS2 (you can
double-check using the interactive version of fdisk), and it spans the
whole disk.
Before ZFS can do its
magic, we need to tell it where on the disk the rpool's mirror is supposed to
be, and what blocks are off-limits because they're supposed to host the GRUB
bootloader. This is done by using a Solaris SMI label that breaks down our Solaris2 fdisk partition
into Solaris "slices".
Again, there's an
interactive possibility using the format command, which involves many interactive
steps (print out the original disk's layout, set it up step by step on the
second disk, write the label), but we want to be cool here, so we'll do it in a
single step, again:
admin@s11test:~#
prtvtoc /dev/rdsk/c7t0d0s0 | fmthard -s - /dev/rdsk/c7t1d0s0
fmthard: New volume table of contents now in place.
That's it. You can check how the new
Solaris-style partitioning looks like on the second disk and compare to the
first one. Here's my first disk:
admin@s11test:~#
format
Searching for
disks...done
AVAILABLE DISK
SELECTIONS:
0. c7t0d0 <ATA -VBOX HARDDISK -1.0
cyl 2085 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@0,0
1. c7t1d0 <ATA -VBOX HARDDISK -1.0
cyl 2085 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@1,0
Specify disk (enter
its number): 0
selecting c7t0d0
[disk formatted]
/dev/dsk/c7t0d0s0 is
part of active ZFS pool rpool. Please see zpool(1M).
FORMAT MENU:
disk - select a disk
type
- select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> p
PARTITION MENU:
0
- change `0' partition
1
- change `1' partition
2
- change `2' partition
3
- change `3' partition
4
- change `4' partition
5
- change `5' partition
6
- change `6' partition
7
- change `7' partition
select - select a predefined table
modify - modify a predefined partition
table
name
- name the current table
print
- display the current table
label
- write partition map and label to the disk
!<cmd> - execute <cmd>,
then return
quit
partition> p
Current partition
table (original):
Total disk cylinders
available: 2085 + 2 (reserved cylinders)
Part Tag
Flag Cylinders Size Blocks
0 root
wm 1 - 2084 15.96GB (2084/0/0) 33479460
1 unassigned wm
0 0 (0/0/0) 0
2
backup wu 0 - 2084 15.97GB (2085/0/0) 33495525
3 unassigned wm
0 0
(0/0/0) 0
4 unassigned wm
0 0 (0/0/0) 0
5 unassigned wm
0 0 (0/0/0) 0
6 unassigned wm
0 0 (0/0/0) 0
7 unassigned wm
0 0 (0/0/0) 0
8
boot wu 0 -
0 7.84MB (1/0/0) 16065
9 unassigned wm
0 0 (0/0/0) 0
partition> q
And here's my second disk:
admin@s11test:~#
format
Searching for
disks...done
AVAILABLE DISK
SELECTIONS:
0. c7t0d0 <ATA -VBOX HARDDISK -1.0
cyl 2085 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@0,0
1. c7t1d0 <ATA -VBOX HARDDISK -1.0
cyl 2085 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@1,0
Specify disk (enter
its number): 1
selecting c7t1d0
[disk formatted]
FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> p
PARTITION MENU:
0
- change `0' partition
1
- change `1' partition
2 - change `2' partition
3
- change `3' partition
4
- change `4' partition
5
- change `5' partition
6
- change `6' partition
7
- change `7' partition
select - select a predefined table
modify - modify a predefined partition
table
name
- name the current table
print
- display the current table
label
- write partition map and label to the disk
!<cmd> - execute <cmd>,
then return
quit
partition> p
Current partition
table (original):
Total disk cylinders
available: 2085 + 2 (reserved cylinders)
Part Tag
Flag Cylinders Size Blocks
0
root wm 1 - 2084 15.96GB (2084/0/0) 33479460
1 unassigned wu
0 0 (0/0/0) 0
2
backup wu 0 - 2084 15.97GB (2085/0/0) 33495525
3 unassigned wu
0 0 (0/0/0) 0
4 unassigned
wu 0 0 (0/0/0) 0
5 unassigned wu
0 0 (0/0/0) 0
6 unassigned wu
0 0 (0/0/0) 0
7 unassigned wu
0 0 (0/0/0) 0
8
boot wu 0 -
0 7.84MB (1/0/0) 16065
9 unassigned wu
0 0 (0/0/0) 0
partition> q
Note: This is a typical x86 layout. It's likely different on SPARC
systems as they don't use a special slice for boot block hosting. But the basic
idea on how to replicate the partition table is the same.
Great! We're almost there.
Now that our second disk is prepared, the rest
is quite easy. From now on, we can just follow the standard Solaris documentation for mirroring the root
pool.
The right command to
use here is zpool attach. Notice that this is different from zpool add: Byattaching a
disk to an existing disk, we mean attaching it to its mirror (you
can attach more than one disk to a mirror). By adding a disk
to a pool, we mean expanding the pool size in the sense of
striping in another disk (or sets of mirrored/RAID-Z disks). For mirroring, zpool attachis the way to go.
Remember? Slice 0 is the one we reserved for the rpool's mirrored data:
admin@s11test:~# zpool
attach rpool c7t0d0s0 c7t1d0s0
invalid vdev
specification
use '-f' to override
the following errors:
/dev/dsk/c7t1d0s0
overlaps with /dev/dsk/c7t1d0s2
Wait, what happened? ZFS is complaining that
two slices are overlapping. If ZFS uses slice 0, and something else uses slice
2, it may overwrite some of ZFS' data!
In this particular case, ZFS' worries are
unfounded: Slice 2 by convention spans the whole disk and is named
"backup" (see the output of format above), so traditional disk backup
solutions have a way of easily performing raw backups of whole disks. Today
it's hardly used, but the convention remains for historical reasons.
Therefore, we can safely override this little
nit and get our mirror done:
admin@s11test:~# zpool
attach -f rpool c7t0d0s0 c7t1d0s0
Make sure to wait
until resilver is done before rebooting.
admin@s11test:~# zpool
status
pool: rpool
state: ONLINE
status: One or more
devices is currently being resilvered.
The pool will
continue to function, possibly in a
degraded state.
action: Wait for the
resilver to complete.
scan: resilver in progress since Tue Mar 15
18:17:32 2011
13.9M scanned out of 2.72G at 594K/s, 1h19m
to go
13.3M resilvered, 0.50% done
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0
0 0
mirror-0 ONLINE
0 0 0
c7t0d0s0 ONLINE
0 0 0
c7t1d0s0 ONLINE
0 0 0
(resilvering)
errors: No known data
errors
Great! Everything's working fine now. Before
we make the second disk bootable, we should really wait until it has finished
resilvering. We don't want to boot into a half-baked root pool, do we?
Here's the end state, freshly resilvered:
admin@s11test:~# zpool
status
pool: rpool
state: ONLINE
scan: resilvered 2.72G in 0h15m with 0 errors
on Tue Mar 15 18:33:23 2011
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0
0 0
mirror-0 ONLINE
0 0
0
c7t0d0s0 ONLINE
0 0 0
c7t1d0s0 ONLINE
0 0 0
errors: No known data
errors
Since x86 systems depend on a bootloader that
is installed on disk, we need to perform a final step so that the system can
boot off the second disk, too, in case the first one fails completely.
This is a simple install of GRUB onto the
second disk. GRUB, ZFS and Solaris will then figure it out automatically in
case you have to boot from the second disk instead of the original one.
admin@s11test:~#
installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c7t1d0s0
stage2 written to
partition 0, 277 sectors starting at 50 (abs 16115)
stage1 written to
partition 0 sector 0 (abs 16065)
Since we're dealing
with a low-level operation (boot blocks etc.), we want to address the devices
using the raw device paths. The s0 part is still needed so GRUB knows what
slice to boot from.
Almost done!
This is one of the little things that often
gets overlooked but then becomes critical in case of a real failure: The system
crashes because the first disk is completely borked, or you force a reboot and
the first disk fails to come up again. How does the system know it's supposed
to boot from the second half of the mirror?
·
Managing the Solaris boot behavior and its mechanism is
described thoroughly in the documentation:
·
SPARC: Here you
usually set up aliases for your bootable mirror halfs in the Open Boot PROM,
then assign them to the boot-device variable as a list of possible devices
to boot from (e.g.: "disk1 disk2 net"). Check out the SPARC Enterprise Servers section of the
Oracle System Documentation area, find the administration guide for your
particular system, then consult the sections on booting.
·
x86: Most BIOSes have
a section where you can configure what disks to boot from, in what order and
what to do if a disk is not bootable. Here's a list of current Oracle Sun x86 system documentations.
Again, look for the boot section of your system's admin manual.
Play With It! And Check Out Some Man Pages!
How do you know if this really works? How do
you develop confidence for something critical like booting from a second mirror
half, surviving a disk disaster, etc.?
Here's the easiest option: Use VirtualBox to
set up a test system like I did. It comes with ready-to use suggestions for a
standard Solaris machine. Then, configure a second virtual disk and play with
the commands above. Set up a mirrored rpool, bring down the machine,
unconfigure the original disk, then see if it can boot from the second mirror
half and so on.
BTW: I did not find a way to tell VirtualBox
what disk to boot from (it only allows to specify what type of
device to boot from, not what individual disk), so I reverted to
just pull out (figuratively speaking) the original boot disk, then test if if
boots from the mirrored one.
In short: Play, experiment, break it, etc.,
until you know what's going on and are confident to make it happen on your real
system.
Finally, here's a list of useful man pages to
check out, including links:
·
cgfadm(1M): Asks about and modify's your system's hardware.
·
fdisk(1M): Manipulate DOS-style partitions.
·
prtvtoc(1M): Print out a disk's partitioning information in a
machine-readable format.
·
fmthard(1M): Write a partitioning table to a disk.
·
format(1M): Interactive formatting utility.
·
zpool(1M): Manipulate ZFS pools.
·
installgrub(1M): Install the GRUB boot loader.
I hope this article has made rpool mirroring a
little easier for you from now on!
Your Take
There are endless variations to the above, and
sometimes I've been more verbose, or more simplified for the sake of
ease-of-use. I'm sure there are many different ways to achieve the same result,
so here's your chance to share your favorite mirrored rpool tricks!
What's your routine for mirroring rpools? Did
you find other good tutorials to share? ('cause I didn't, at least nothing
obvious in Google...) What are your preferred rpool mirroring tricks?
Feel free to write a comment!
Comments
Post a Comment