Design and Implementation

Architecture

Sheepdog is a storage system that provides a simple key-value interface to Sheepdog client (qemu block driver) as shown in figure 2-1. Sheepdog is consists of multiple nodes. Right figure in figure 2-1 shows the architecture of a regular cluster file system (SAN file system). Compared with this, Sheepdog does not require a shared storage, and provides key-value interface which is more suitable to implement a scalable system on distributed environment.

Compare Sheepdog architecture and a regular cluster file system architecture

Figure 2-1: Compare Sheepdog architecture and a regular cluster file system architecture

Figure 2-2 shows Sheepdog components. Sheepdog consists of the following components.

  • Disk I/O manager daemon (we call sheep)
  • Cluster manager daemon (we call dog)
  • Client (patched KVM/QEMU)
Sheepdog components

Figure 2-2: Sheepdog components

Virtual Disk Image (VDI)

A Sheepdog client divides a VM image into fixed-size objects (4 MB by default) and store them on the distributed storage system. Each object is identified by globally unique 64 bit id, and replicated to multiple nodes (shown in figure 3).

Virtual disk image

Figure 3: Virtual disk image

Object

Sheepdog objects are grouped into three types.

  • Super Object: A super object contains globally shared data such as a list of VM images.
  • VDI Object: A VDI object contains metadata for a VM image such as image name, disk size, creation time, etc.
  • Data Object: A VM images is divided into a data object. Sheepdog client generally access this object.

Sheepdog uses consistent hashing to decide where objects store (shown in figure 4). Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table. A mechanism of distributing the data not randomly but intelligently is a future work.

Consistent hashing

Figure 4: Consistent hashing

Each node is placed on consistent hashing ring based on its own id. To determine where to store the object, Sheepdog client gets the object id, finds the corresponding point on the ring, and walk clockwise to determine the target nodes. For example, When given object id is 70 and its redundancy is three, the object is stored to machine B, C, and D.

VDI Operation

In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image. But some operation such as updating global information must be done exclusively (e.g. cloning VDI, locking VDI). To implement this in the highly available system, we use a group communication system (GCS). Group communication systems provide specific guarantees such as total ordering of messages. We use corosync, one of most famous group communication systems. Sheepdog nodes are grouped into three types:

  • Master: a leader of Sheepdog nodes. Only the master can update a super object. The master has also responsibility to detect dead nodes.
  • Submaster: nodes who are in the corosync group, but not a master.
  • Slave: nodes who are not in the corosync group.

VDI operations are done by following steps:

  1. Join to the group, and become a submaster.
  2. Send a operation message by multicasting to the group
  3. Submasters receive the operation message
  4. Master updates a super object
  5. Master sends ack to all the group member by multicasting
  6. Submasters Receive the ack message
  7. Submasters update cluster information on their memory
  8. If there are enough members in the group, leave the group. Otherwise keep being a submaster.

When the master is down, another submaster becomes a new master.

Table Of Contents

Previous topic

Sheepdog Project

Next topic

Read me (install, use)