Sunday 24 March 2013

Solaris ODS (online disk suite)


Sun's volume manager has many names
  • Online Disk Suite (ODS) - will be using this name in this document
  • Solstice Disk Suite (SDS)
  • Solaris Logical Volume Manager (Solaris LVM)
ODS is a disk storage management solution, which offers
  • High Availability
  • Improved Performance
  • Simplified disk management
Raid Levels
The disk management software offers the common raid levels
raid 0 (Striping) A number of disks are concatenated together to give the appearance of one very large disk.
Advantages
   Improved performance
   Can Create very large Volumes
Disadvantages
   Not highly available (if one disk fails, the volume fails)
raid 1 (Mirroring) A single disk is mirrored by another disk, if one disk fails the system is unaffected as it can use its mirror.
Advantages
   Improved performance
   Highly Available (if one disk fails the mirror takes over)

Disadvantages
   Expensive (requires double the number of disks)
raid 5 Raid stands for Redundant Array of Inexpensive Disks, the disks are striped with parity across 3 or more disks, the parity is used in the event that one of the disks fails, the data on the failed disk is reconstructed by using the parity bit.
Advantages
   Improved performance (read only)
   Not expensive

Disadvantages
   Slow write operations (caused by having to create the parity bit)
Metadevice and Metadevice Database
A metadevice is a name for a group of physical slices that appear as a single logical device (virtual device). The maximum default number of metadevices is 128 but this can be adjusted by editing /kernel/drv/md.conf and changing the nmd parameter (1024 maximum).
A metadevice database (otherwise know as state database) is a database that stores information about the ODS configuration, it is used to store and track changes made to ODS, this database is what makes the ODS persistent across reboots. The database has multiple copies known as replicas (minimum of 3 is required), this ensures that the database is always valid, you should keep multiple copies across different disks just in case a disk should fail and thus reducing single-points of failure, the database is never more than an 10MB and is generally stored on a single slice of each disk.
ODS uses a majority consensus algorithm to determine if a replica is corrupted or not, when changes are made each replica is updated in turn just in case a power failure happens during the update, thus when the system is started the majority replicas will be implemented, the algorithm guarantees the following
  • The system will stay running with exactly half or more state database replicas
  • The system will panic if more than half the state database replicas are not available
  • The system will not reboot without one more than half the total state database replicas
Hot Spares
ODS uses a hot spare pool, which is a collection of disk slices reserved by ODS which will automatically be used when a disk slice fails. They provide increased data protection, however i have very rarely used hot spares as i normally replace a failed disk pretty quickly. See the Sun Documentation for detail information on hot spares.
Growing/Shrinking Filesystem
Expanding filesystems is not without problems with ODS but it is possible, however shrinking a filesystem under ODS is not possible, normally you create another new smaller filesystem and copy the data across then cut over to the new filesystem.
This is one area the Veritas volume manager excels as it very easy to grow and shrink a filesystem.
Filesystem Logging
ODS uses translogs to log changes made to the filesystem, in the event that the system were to crash the log is replayed thus avoiding a fsck (which can take a long time depending on the size of the filesystem). However newer versions of Solaris offer UFS logging, here is a list of advantages/disadvantages of both
ODS logging
  • Can be mirrored and therefore survive better from disk failures
  • Does not support root filesystem
UFS logging
  • Simple to implement (just update /etc/vfstab and add logging option)
  • Does not require it's own slice
  • Supports root filessytem
  • Tighter connection to the unix kernel which results in less overhead
My preference is to use UFS logging and since its introduction in solaris 7 i have only ever used this.
Naming Convention
There is no set standard on what you call your metadevices but i have my own convention and undoubtedly there are many others.
The main metadevice (raid 0,1 or 5) which is were the filesystem will be placed will always end in 0 so for example d0, d10, d20, d30, d40, etc
A sub-mirror will either end in a 1 (first sub-mirror) or 2 (second sub-mirror) so for example d1 and d2, d11 and d12, d21 and d22, etc
A raid slice will either end in a 1..n (n = depends on number of disks) so for example d1 & d2 & d3, d21 & d22 & d23, etc
So for an example
  • Mirrored metadevice - I would create the mirror metadevice as d0 and have to sub-mirrors called d1, d2
  • Mirrored metadevice - I would create the mirror metadevice as d10 and have to sub-mirrors called d11, d12
  • Raid 5 device - I would create the raid device as d20 and have the raid-slices called d21, d22, d23
This is my own preference and you are welcome to have your own naming convention
File Locations
ODS uses a number of different files, below are the most useful one's:
/kernel/drv/md.conf This file is the ODS device drive configuration file, the only modifiable field is the 'nmd' which represents the number of metadevices supported by the driver, if you change this file you must reboot the system for the changes to take affect.
In a configuration that uses a lot of devices I increase this to the maximum 1024.
/etc/lvm/mddb.cf This file keeps track of metadevice state database replicas, each metadevice state database has a unique entry in this file. You can display the file using 'cat' but do not edit it manually.
/etc/lvm/md.tab This file is used by metainit, metadb and metahs commands. The file contains the the rest of the commandline for use by metainit, metadb and metahs.
This file can be edited manually or populated by the command 'metastat -p'
/etc/lvm/md.cf This file is a copy of the md.tab file and is used for disaster recovery purposes, it is automatically updated.
Meta Commands
I am not going to explain in details how ODS works but simply supply a list of commands that I use regularly, if you want a more detail explanation then I suggest you refer to the Sun Documentation
Metadatabase Commands
Create metadb -a -f -c 3 c0t0d0s6 c1t0d0s6 c2t0d0s6
-a - attach metadatabase to device
-f - create the initial metadatabase and force deletion of replicas below the minimum of one
-c - specifies the number of replicas to be placed on each device
Add metadb -a -c 3 c3t0d0s6
Remove metadb -d c3t0d0s6
Display metadb -i
Repairing # The only way to repair a replica is that you simply delete all the replica's on the device and
# recreate them
# First confirm that the replicas are corrupted and you have the device name
metadb -i
# Delete the corrupted replicas and reboot
metadb -d c3t0d0s6
reboot
# Now recreate them making sure you have 3 copies
metadb -a -c 3 c3t0d0s6
metadb -i
Metadevice Commands
Create Concatenated device metainit d0 3 1 c1t0d0s0 1 c2t0d0s0 1 c3t0d0s0
d0 - metadevice name
3 - total number of slices
1 c1t0d0s0 - number of slices to added followed by device name
Create stripe metadevice metainit d0 1 2 c1t0d0s0 c2t0d0s0 -i 64k
d0 - metadevice name
1 - total number of stripes
2 c1t0d0s0 c2t0d0s0 - number of slices to be added to stripe followed by device name
i 64k - stripe size
Create Mirror metadevice # first create two metadevices (these will become sub-mirrors)
metainit d11 1 1 c2t0d0s0
metainit d12 1 1 c3t0d0s0

# Then create the mirror metadevice using the metadevice d11 (now called a sub-mirror)
metainit d10 -m d11
# Then attach the second sub-mirror using the metadevice d12 create above to the mirror d10
metattach d10 d12
# Display the mirrored metadevice and confirm that mirror has complete resyncing operation
# this may take a long time depending on the size of the mirror device
metastat d10
Create Raid 5 metadevice # When creating a raid 5 metadevice you need a minimum of 3 slices

metainit d10 -r c1t0d0s0 c2t0d0s0 c3t0d0s0

-r - specify that its a raid 5 configuration
Mirroring the root filesystem # Lets say you want to mirror the main disk which has the following filesystems configured, we will be using
# c1t0d0 as the new mirror disk
#
# We hope to achieve the following device configuration
# d0 - mirrored metadevice which contains the root filesystem
#    d1 - a sub-mirror metadevice of d0 (c0t0d0s0)
#    d2 - a sub-mirror metadevice of d0 (c1t0d0s0)
#
# If either c0t0d0s0 or c1t0d0s0 fails the other will take over, thus the system will continue to work as normal
# The first step is to make sure the partition information is the same on the new mirror disk (c1t0d0)
# basically copies the partition information to the new mirror device

prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c1t0d0s2
# Then we want to install the boot block on the new mirror device, this allows you boot the disk should
# the other disk fails

installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c1t0d0s0
# Create first metadevice which will become the a sub-mirror of d0
# NOTE: although we are using the existing root slice this does not delete any data, we have also
# specified the -f (force) option as the filesystem is mounted

metainit -f d1 1 1 c0t0d0s0
# Create the second metadevice which will become the sub-mirror of d0, we do not need the -f option (force)
# as there is not filesystem on the new device

metainit d2 1 1 c1t0d0s0
# At this point we have two metadevices d11 (contains root filesystem) and d12 (the new disk)
# we now create the mirror metadevice d0

metainit -d0 -m d1
# We now have to update the /etc/system and /etc/vfstab with the new root metadevice information

metaroot d0
# Now reboot the server so that the new mirror metadevice is mounted and the kernel parameters for ODS
# are loaded, we lock the filesystem before rebooting making sure all buffers have been written to the
# filesystem

lockfs -fa
reboot

# Once the server has been rebooted attach the second sub-mirror

metattach d0 d12

# Depending on how big the root filesystem the longer the resyncing of the two mirrors will take

metastat d0
# Once the mirrors are sync'ed you have a root filesystem that is highly available, you can now perform
# the same task with other filesystems such as /var, swap, /usr, etc
Other ODS Commands
Display Metadatabse metadb -i
Display Metadevices metastat
Display metadevice in md.tab format metastat -p

ODS Errors
A list of some of the more common errors of ODS
"no such file or directory error" when trying to configure a metadevice # update the nmd parameter in the /kernel/drv/md.conf file, i normally increase this to it's maximum 1024.
Metadevice in maintenance state # Disks do go bad from time to time, however there is a difference between a total disk failure or a
# disk with bad data blocks, however if you replace the disk and use the same disk slice then the same
# command is used
# First access the disk via format, if you can then run a analyze on the disk to repair/map out any bad
# data blocks

format -> select disk -> anal -> read

# If you cannot access the disk via format then physically replace the disk, then run the below command
# to repair ODS, you must do this for each metadevice configured for that disk

metareplace -e d0 c1t0d0s0

# If you want to replace the disk with a different disk then run
metareplace d0 c1t0d0s0 <new device name>
# Again confirm that the disk re-sync'ed

metastat d0 


No comments:

Post a Comment