CONFINE nodes (nodes) may have important limitations regarding permanent storage. If several virtual machines or containers are to be run on a node, one for each sliver, this would point us to either access sliver images through the network or find ways to share as much data as possible between slivers. Since a stable access to the community network (CN) is not guaranteed, we need to go for local data sharing. This solution suggests that standard base images are provided to researchers for them to build upon, which would increase data sharing chances as well as ease sliver image preparation tasks. Data sharing also helps reduce memory usage by leveraging the block cache.
Besides, it may be desirable to let researchers perform standard administrative tasks on slivers to further ease testbed usage by leveraging software distribution efforts and standard systems administration knowledge (e.g. installation of standard packages). This leaves some ad-hoc solutions such as shared read-only system directories out of consideration.
Generally speaking, a union mount point is the result of merging or stacking several stores2 into a single directory. What makes this interesting for us is that a union mount point D can be built by merging a read-only base store and a writable store, resp. DR and DW. D shows the content of DR plus any later changes, which are stored in DW by copying the original object and storing its modified version3.
As long as DR remains untouched, it can be combined with different writable stores DW1, DW2… to produce several union mount points D1, D2… each one keeping its own state. The total size required is that of DR plus that of the current changes of each union mount point Di.
Thus, for a given set of slivers based on the same image, we can keep a single
read-only base image store and a writable store for each sliver. Before
starting a given sliver, we union mount its base store and its writable store
/var/lib/lxc/.base/debian-squeeze-java-i386-2011111003 (RO) + /var/lib/lxc/.store/myslice (RW) = /var/lib/lxc/myslice/rootfs/ (RW)
There are many solutions providing directory-based union mount points4 on Linux (none of which are currently part of mainstream), with several ones based on FUSE. The following are of special interest for not requiring FUSE and for their availability:
Aufs can merge many directories using different policies. Included in Debian's kernel since Squeeze.
mini_fo can merge two directories (RO+RW). Included in OpenWrt Backfire's kernel < 2.6.37.
Overlayfs can merge two directories (RO+RW). Included in OpenWrt Backfire's kernel >= 2.6.37.
Btrfs is a file system included in mainstream Linux since 2.6.29. Although considered experimental by its developers, it is quite stable. It can mount a read-only base seed file system and add a writable block device on top of it.
rsync can be used to distribute a sliver image to a node having the
sliver's base image and writable store already mounted under its
can also be used to compact an existing writable store into a new one based
on a common (or different, e.g. updated) base image.
Deduplicating file systems5store only once each different object
(block or file) that has a given content, regardless of how many times it
occurs on the file system. Each object is immutable: a write operation
altering its data creates a new, different object6. For instance, an exact
copy of a directory D1 into another one D2 in the same file system (
D1 D2) takes a minimum space. Changing a file or directory under D2 does
not affect the contents under D1 and vice versa.
Thus, for a given set of slivers based on the same image, we just deploy all slivers in the same deduplicating file system and let it take care of only storing redundant data once. If we keep a base image in the file system, the total size required is approximately that of the base image plus unique changes needed for all the slivers:
/var/lib/lxc/ (deduplicating) /var/lib/lxc/.base/debian-squeeze-java-i386-2011111003/ (optional, unused) /var/lib/lxc/myslice1/rootfs/ /var/lib/lxc/myslice2/rootfs/
There are a few deduplicating file systems on Linux (none of which are currently part of mainstream), most based on FUSE:
lessfs is a FUSE file system implementing deduplication.
In Copy-on-Write (CoW) file systems, a copy of an object is implemented by creating a new object referring to the original data, so the copy doesn't take extra space. When the data shared by of any of these objects is to be modified, a copy of the original data is created and modified instead so the modified object no longer points to the original data. Data which is not shared is modified directly.
For instance, a CoW copy of a directory D into another one D1 in the same
file system (
cp -a --reflink D D1) takes a minimum space. Changing a file
or directory under D1 does not affect the contents under D and vice versa.
Thus, for a given set of slivers based on the same image, we create CoW copies of the later. If we keep the base image D in the file system, the total size required is that of D plus that of the current changes of each CoW copy Di:
/var/lib/lxc/ (CoW) /var/lib/lxc/.base/debian-squeeze-java-i386-2011111003/ /var/lib/lxc/myslice1/rootfs/ (RW) /var/lib/lxc/myslice2/rootfs/ (RW)
The aforementioned Btrfs also allows two kinds of CoW: CoW copies and
snapshot subvolumes. CoW copies are ordinary files and directories
which share data blocks, and they can be created with
Subvolumes in a file system are hierarchies isolated from upper levels which
can be mounted independently, hopefully with their own restrictions.
Snapshot subvolumes are CoW copies of other subvolumes.
Under Debian unstable (Linux 3.0.0), both Aufs (included in stock kernel modules) and lessfs (hand-compiled) were successfully tested as LXC root file systems. lessfs reduced the size of the original image from 410 MiB to 230 MiB using compression, and an additional copy only required 12 MiB more, but it used more than 700 MiB RAM and 80% CPU (on a 2 GHz Core2 Duo).
Other deduplicating file systems were discarded because of system requirements unlikely to be met by nodes.
Overlayfs was tested on a customized OpenWrt image (by Pau) as an LXC root file system. The container booted and worked, but LXC support is still precarious (the network is not virtualized yet so the container sees the host's interfaces; halting the container locks the console if using overlayfs).
mini_fo was tested on OpenWrt Backfire, with no LXC support.
Btrfs seeding was tested under the Debian machine (Linux 3.1.0) using loop devices for both seed and writable images. Although some related problems remain, a seed-based Btrfs file system can be successfully used as LXC root.
A Btrfs CoW copy of a base image was also successfully used as LXC root in the same test machine. Copying the 510 MiB base image took some seconds and an extra 61 MiB, though free space remained the same.
A Btrfs snapshot of a base image subvolume was also successfully used as LXC root in the same test machine. Creating a snapshot of the 510 MiB base image was instantaneous and no extra space was required.
While current deduplicating file systems impose hardware requirements not amenable to nodes, Btrfs and especially union mount points are lightweight options for saving disk space for sliver images. However, there is no mainstream union implementation, though distributions use to include some of them (Aufs for Debian/Voyage, mini_fo/overlayfs for OpenWrt). Btrfs, though mainstream, is not included by default in Voyage nor OpenWrt.
Although LXC support is not complete in either Voyage Linux (no
kernel) and OpenWrt (isolation problems), all the aforementioned technologies
seem to get on well with LXC.
Finally, some issues like setting per sliver quotas are still in need for researching and testing.
Depending on the particular solution, the merged stores will either be already available directories, or entire file systems and block devices. ↩
If the union mount point operates on directories, the whole file will be copied. If it operates on the file system, only the affected blocks may be copied. Thus the first approximation may have an extra penalty on big files. ↩
Deduplication can also occur a posteriori on previously existing, redundant data. ↩