Sheepdog is a storage system that provides a simple key-value interface to Sheepdog client (qemu block driver) as shown in figure 2-1. Sheepdog is consists of multiple nodes. Right figure in figure 2-1 shows the architecture of a regular cluster file system (SAN file system). Compared with this, Sheepdog does not require a shared storage, and provides key-value interface which is more suitable to implement a scalable system on distributed environment.
Figure 2-1: Compare Sheepdog architecture and a regular cluster file system architecture
Figure 2-2 shows Sheepdog components. Sheepdog consists of the following components.
Figure 2-2: Sheepdog components
A Sheepdog client divides a VM image into fixed-size objects (4 MB by default) and store them on the distributed storage system. Each object is identified by globally unique 64 bit id, and replicated to multiple nodes (shown in figure 3).
Figure 3: Virtual disk image
Sheepdog objects are grouped into three types.
Sheepdog uses consistent hashing to decide where objects store (shown in figure 4). Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table. A mechanism of distributing the data not randomly but intelligently is a future work.
Figure 4: Consistent hashing
Each node is placed on consistent hashing ring based on its own id. To determine where to store the object, Sheepdog client gets the object id, finds the corresponding point on the ring, and walk clockwise to determine the target nodes. For example, When given object id is 70 and its redundancy is three, the object is stored to machine B, C, and D.
In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image. But some operation such as updating global information must be done exclusively (e.g. cloning VDI, locking VDI). To implement this in the highly available system, we use a group communication system (GCS). Group communication systems provide specific guarantees such as total ordering of messages. We use corosync, one of most famous group communication systems. Sheepdog nodes are grouped into three types:
VDI operations are done by following steps:
When the master is down, another submaster becomes a new master.