uwe.menges

January 2009 Previous month Next month

Maybe you already came across the problem that your data grows and grows while the disk size stays the same, or you don't want to copy over your existing data to a bigger disk. For databases, it's mostly easy to specify more data files at different locations. For normal file systems, it's often not desired to add more mount points to existing structures.

One solution is LVM (Logical Volume Manager). There's a competing technology from IBM (EVMS), but as I'm more used to LVM I'll describe this. All further descriptions are related to LVM2.

Note:
 

A word to the units used in this article. I'll use the IEC notation   (KiB, MiB, GiB, TiB) on all 2-based units in this article (read more   at http://en.wikipedia.org/wiki/Kibibyte), except of pasted output   from commands.

  

Although the lvs/pvs man page tells that using capitalized unit   shortcuts will use the SI (metric) units (1 KB == 1000 bytes) instead   of the 2-based units (1 KiB == 1024 bytes), this is only true on   display, i.e. it doesn't matter if you provide "-L 2g" or "-L 2G" to   lvcreate. If you really want to (or have to) be exact when working on   sizes, use the PE count instead of the bytes (that means providing eg.   "-l" instead of "-L").

Basics

LVM introduces three terms: Physical Volumes (PV), Volume Groups (VG) and Logical Volumes (LV). The PVs are aggregated to VGs and split up again into LVs. These two glueing mechanisms are the core idea providing the flexibility.

PVs can be any block device (complete disks, partitions, or even more sophisticated, layered constructs). I recommend to not use complete disks but partitions for practical reasons (e.g. to prevent your not-so-mindful colleague from using the unpartitioned, "free" disk). I'd only make exceptions for devices larger than 2TiB, because the msdos partition table can't cope with partitions larger than 2TiB and I don't like GPT ([s]fdisk doesn't support GPT).

PVs are themselves separated into Physical Extents (PEs), which are the smallest manageable unit known to LVM. Default PE size is 4MiB (with LVM1, one had to raise the PE size because one LV could only take up 65536 PEs, but this limit is gone with LVM2 so there's no need to change the default anymore).

Please note that for data integrity and availability, your PVs should reside on redundant disks (eg. RAID5, hard- or software) for productive systems.

VGs can be built up with any PVs, but you should minimize the number of VGs. We already saw a case where one VG per PV was created, thus losing one of the two glueing layers, and in turn the flexibility. Also, some operations only work inside a VG and not across VGs. I usually create two VGs, one for the system itself and one for data.

LVs can then be used to make space from one VG accessible as block device. Don't make the LVs too large at first because that may cut you off the second glue layer (unless you add more disks, of course). Estimate the amount of data that will reside on the LV and keep slack space at reasonable small size. Don't allocate all available VG space yet! While growing LVs and file systems is an easy task and can be done online (ie. when the file system is mounted), shrinking is more complex and error-prone and can only be done offline. Please see the section "EXT3 caveats" below for some details about online resizing with EXT3 file systems.

Hands-on

I'll describe some real-life examples here with some commands and options which proved to be useful during administration tasks. Be careful on using the commands, because some can be very destructive for your data if used on the wrong devices. (I used a VMware virtual machine for the following examples)

Creating PVs in bulk

If you have plenty of disks or LUNs, you don't have to spend half a day with wading through some fdisk text or graphical UI if you have sfdisk available. Assume we have 10 disks/LUNs sdb..sdk that we want to use solely for LVM, and they don't have any partition tables yet. Following my recommendation above, I don't want to use the complete disks but instead create LVM PV partitions (partition type 8e) on them, and use these. Initializing PVs, thus making them known to LVM, is done with "pvcreate". I'll put both partition creation and PV initialization into one step.

If we're on bash version 3, we can easily do

    # for i in {b..k}; do echo ',,8e' | sfdisk /dev/sd$i && pvcreate /dev/sd${i}1; done

and are already done! :-) If we're on an older bash version, we have to write the sequence explicitly:

    # for i in b c d e f g h i j k; do echo ',,8e' | sfdisk /dev/sd$i &&
    pvcreate /dev/sd${i}1; done

This will create one big type 8e partition covering the whole space on each disk, and initialize it as LVM PV. I usually check if all is in the desired state:

    # pvs
      PV         VG   Fmt  Attr PSize PFree
      /dev/sdb1       lvm2 --   2.00G 2.00G
      /dev/sdc1       lvm2 --   2.00G 2.00G
      /dev/sdd1       lvm2 --   2.00G 2.00G
      /dev/sde1       lvm2 --   2.00G 2.00G
      /dev/sdf1       lvm2 --   2.00G 2.00G
      /dev/sdg1       lvm2 --   2.00G 2.00G
      /dev/sdh1       lvm2 --   2.00G 2.00G
      /dev/sdi1       lvm2 --   2.00G 2.00G
      /dev/sdj1       lvm2 --   2.00G 2.00G
      /dev/sdk1       lvm2 --   2.00G 2.00G

Creating a VG and LV

So the PV creation went well. Now we can create a VG on these PVs. I'll not use all PVs because I want to play around a bit later:

    # vgcreate vg0 /dev/sd[b-f]1
      Volume group "vg0" successfully created
    # vgs
      VG   #PV #LV #SN Attr   VSize VFree
      vg0    5   0   0 wz--n- 9.98G 9.98G

If you want to see more detail, you may use vgdisplay. But let's create a LV on this VG now:

    # lvcreate -n data -L 2g vg0
      Logical volume "data" created

Let's take a detailed look at the created LV. For this, it's handy to use the -o option for lvs (also for vgs and pvs) to display other (-o ...) or more (when using -o +...) columns (devices in this case):

    # lvs -o +devices
      LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Devices    
      data vg0  -wi-a- 2.00G                               /dev/sdb1(0)
      data vg0  -wi-a- 2.00G                               /dev/sdc1(0)

As you can see, the 2g LV takes two PVs, although it should theoretically fit into one PV. This is caused presumably by some internal overhead. You can see the overhead by examining the PVs again:

    # pvs -o +used,pe_count,pe_alloc_count
      PV         VG   Fmt  Attr PSize PFree Used  PE  Alloc
      /dev/sdb1  vg0  lvm2 a-   2.00G    0  2.00G 511   511
      /dev/sdc1  vg0  lvm2 a-   2.00G 1.99G 4.00M 511     1
      /dev/sdd1  vg0  lvm2 a-   2.00G 2.00G    0  511     0
      [...]

So it takes only 1 PE (4MiB) from the second PV. Now let's create a file system on that LV. I'm using ext3 here, where we have to take some precautions eventually. I've been using reiser3 before, but as you can't even change the file system's label online, and reiserfsck does terrible things when using reiser3-filesystems inside a file on a reiser3 file system (eg. if you have Xen virtual machines), I prefer good old ext3.

EXT3 Caveats

The matter is that, before e2fsprogs 1.39, the file systems created with mke2fs are not well online resizeable by default, you can only resize online up to the next 16GiB boundary (theoretically, one could use the ext2prepare utility on the offline file system to circumvent that limitation, but this has already failed on me, and ext2prepare isn't even included in RHEL/SLES). So if you have e2fsprogs<1.39, it's wise to explicitly specify online resize feature via "-E resize=" parameter (value is in blocks, most likely 4KiB), eg. "-E resize=88888888" will enable the file system to grow online to at least 339GiB (note that this will reserve some space for the online resize - ca. 3MiB in the above case, ca. 33MiB when using 9x9 (= up to 3.72TiB online resize) instead of 8x8).

Since e2fsprogs 1.39, a config file /etc/mke2fs.conf exists which has the "resize_inode" feature included by default. Please note that it isnot possible to change that feature afterwards, you have to give that parameter to mke2fs. So, knowing that, let's create an ext3 fs (although the "-E resize" is useless with this small test case, I'll give it here for completeness):

    # mke2fs -j -E resize=88888888 /dev/vg0/data
    [...]
    # mount /dev/vg0/data /mnt
    # df /mnt
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/mapper/vg0-data   2064208     66136   1893216   4% /mnt

Online Resize

Now as we have the file system mounted, let's extend it online (pretending there is already important productive data on it that can't stand downtime, and pretending we already did a backup of that important data before resizing). Online resizing is a two-tier process, we first have to extend the underlying block device using lvextend before using ext2online to perform the file system resizing (since e2fsprogs 1.39, resize2fs can also do online resizing, I think ext2online becomes deprecated at some time in favour of resize2fs).

    # lvextend -L +2g /dev/vg0/data
      Extending logical volume data to 4.00 GB
      Logical volume data successfully resized
    # ext2online /mnt
    ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b
    # df /mnt
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/mapper/vg0-data   4128448     66304   3852456   2% /mnt

So this has worked without problems as expected. If you do a larger online resize, you'll see that the additional space comes part by part during the resize, eg. if you're looking with df in another console.

Striped LVs

LVM also includes striping capabilities - if you have multiple PVs in a VG, you can instruct LVM to stripe a LV across some or all of the PVs (this is the equivalent to RAID0) with the -i switch for performance reasons.

For this, I'll put the remaining PVs (sd[g-k]1) into the VG:

    # vgextend vg0 /dev/sd[g-k]1
      Volume group "vg0" successfully extended
    # lvcreate -n raid0 -L 5g -i 5 vg0
      Using default stripesize 64KB
      Logical volume "raid0" created

To only get the interesting info, I'll do

    # lvs -o name,size,stripes,devices
      LV    LSize #Str Devices                                                         
      data  4.00G    1 /dev/sdb1(0)                                                    
      data  4.00G    1 /dev/sdc1(0)                                                    
      data  4.00G    1 /dev/sdd1(0)                                                    
      raid0 5.00G    5 /dev/sde1(0),/dev/sdf1(0),/dev/sdg1(0),/dev/sdh1(0),/dev/sdi1(0)

So this is really a striped volume, ready to create a file system on. I was able to get up to 234MiB/s I/O throughput with four older 60MiB/s SCSI disks on one of our test systems, the speedup was really pleasing the testers.

Actions

Filter Blog

By date: