CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 63
Replication Fail Over Demo (Summer School Hands On 2013)
This tutorial was meant for students who took part in the XtreemFS hands-on session at the Contrail Summer School 2013 in Almere, Netherlands. The tutorial describes how to install a simple XtreemFS installation and then how to store and play a replicated video file and simulate the failure of a replica.
The tutorial assumes that XtreemFS will be installed on a single machine (to make things easy). You can use any supported Linux distribution - no matter if it's your private laptop or a virtual machine.
The following instructions are similar to the quickstart guide which is available at https://www.xtreemfs.org/quickstart.php.
You have to install the xtreemfs-server
and xtreemfs-client
packages. Please follow the instructions from the download page: https://xtreemfs.org/download.php
As user root start the XtreemFS servers:
(Run "sudo -s" to become the root user.)
- Start the Directory Service:
/etc/init.d/xtreemfs-dir start
- Start the Metadata Server:
/etc/init.d/xtreemfs-mrc start
- Start the OSD (Object Storage Device):
/etc/init.d/xtreemfs-osd start
Run the XtreemFS client to create and mount a new volume.
- If not already loaded, load the FUSE kernel module:
modprobe fuse
- Create a new volume with the default settings.
mkfs.xtreemfs localhost/myVolume
- Create a mount point:
mkdir ~/xtreemfs
- Mount XtreemFS to this mount point.
mount.xtreemfs localhost/myVolume ~/xtreemfs
Depending on your distribution, you may have to add users to a special group to allow them to mount FUSE file systems. In openSUSE users must be in the group "trusted
", in Ubuntu in the group "fuse
". You may need to log out and log in again for the new group membership to become effective.
-
Now you can write files into the XtreemFS volume at
~/xtreemfs
. For example, start a file manager like Nautilus to copy files into the XtreemFS volume. -
Have a look at the MRC webinterface which is available at: https://localhost:30636 You'll see the new volume and statistics for it (e.g. the number of files). For example:
- To un-mount XtreemFS on your local system:
umount.xtreemfs ~/xtreemfs
- Finally, remove the volume from the MRC:
rmfs.xtreemfs localhost/myVolume
After setting up a simple XtreemFS installation, you'll store a video file into XtreemFS which gets replicated across three replicas using the XtreemFS Read/Write replication. Then you playback the video from XtreemFS: the client will retrieve the file content from the file's primary replica. To simulate an replica outage, you'll kill the replica's process and see how the video shortly stalls until the client fails over to another replica.
In order to store a file with three replicas in XtreemFS, you also need three OSDs. Therefore, we'll shutdown the previously started OSD and create three configuration files for three new OSDs. These steps have to be executed as user root.
- Stop the single OSD.
/etc/init.d/xtreemfs-osd stop
- For each OSD a new configuration file has to be created. You can either take the shortcut and download the provided configuration files by us or create them on your own to understand the necessary changes.
You can download our version of the configuration files as follows:
cd /etc/xos/xtreemfs/
wget https://www.zib.de/berlin/demo/osd1.config.properties
wget https://www.zib.de/berlin/demo/osd2.config.properties
wget https://www.zib.de/berlin/demo/osd3.config.properties
Alternatively, go through the following steps to create and edit the three OSD configuration files manually:
cd /etc/xos/xtreemfs/
cp osdconfig.properties osd1.config.properties
cp osdconfig.properties osd2.config.properties
cp osdconfig.properties osd3.config.properties
Now the following configuration options have to be changed in each file osd{1,2,3}.config.properties
and set to a unique value:
Option | first OSD | second OSD | third OSD |
---|---|---|---|
listen.port | 32641 | 32642 | 32643 |
http_port | 30641 | 30642 | 30643 |
object_dir | /var/lib/xtreemfs/osd1/ | /var/lib/xtreemfs/osd2/ | /var/lib/xtreemfs/osd3/ |
uuid | osd1 | osd2 | osd3 |
You can use a command-line text editor to edit each configuration file. Beginners are recommended to use "nano" e.g. run:
nano osd1.config.properties
Here's a short explanation of each option:
Option | Description |
---|---|
listen.port | TCP Port where XtreemFS clients connect to. |
http_port | TCP Port of the webinterface. |
object_dir | Directory where file contents are stored. |
uuid | Universally unique identifier (UUID) under which the OSD registers at the DIR (Directory Service). |
- After configuring the three new OSDs, you have to start them. Therefore, you use the more advanced start script "
/usr/share/xtreemfs/xtreemfs-osd-farm
". Run:
/usr/share/xtreemfs/xtreemfs-osd-farm start
- Now the new three OSDs register at the DIR and will be used by all volumes in the future.
As local user create and mount a volume called "replicated
" as was done before:
- Create the volume:
mkfs.xtreemfs localhost/replicated
- Mount the volume:
mount.xtreemfs localhost/replicated ~/xtreemfs
Now you use the XtreemFS tool "xtfsutil
" to configure the file replication. We set the replication factor to three replicas and use the Read/Write replication with quorums:
xtfsutil --set-drp --replication-policy quorum --replication-factor 3 ~/xtreemfs
For the demonstration we use the freely available short film "Sintel" which was produced by the Blender Foundation. It can be downloaded from https://www.sintel.org/download/. We'll use the version with the smallest file size which is "1024 x 436 (123 MB. mp4, 5.1)". For example, download it using the command line tool "wget
":
cd ~/xtreemfs
wget "https://mirrorblender.top-ix.org/movies/sintel-1024-surround.mp4"
While the file is written into XtreemFS, have a look at the visualization of opened, replicated files.
- Therefore, browse to the DIR webinterface: https://localhost:30638/replica_status
At the top, all registered OSDs and their availability are shown (green = up, red = down). Below the first row of OSDs, for each open file it's shown whether the OSD is the current primary (blue) or a backup (green) replica of the file.
How did the system end up in this state? Here's a detailed description of what happened:
- The client creates the file at the MRC which assigned three replicas to it.
- Then, the client starts to write data at the respective OSD of the first replica.
- The OSD receives the write request and tries to acquire the lease for the primary state. Therefore, it contacts the remaining replicas.
- After the OSD has received a majority of acknowledges (including its own vote) for the lease, it becomes primary for the file. The lease is automatically renewed in the background as long as the file is open at the OSD.
- Now the OSD accepts write requests from clients: the data is propagated to all backup replicas. When a majority of replicas successfully wrote the data, the primary acknowledges the client's write request.
After the download has finished, play the file directly from XtreemFS using the video player "mplayer
":
mplayer ~/xtreemfs/sintel-1024-surround.mp4
Please mute the sound to not disturb other students :-)
Have a look at the DIR webinterface to find out which replica is the current primary of the video file.
Now shutdown the respective OSD on your machine. For example, if the video is received from the OSD "osd1
", run the following command to shutdown only osd1:
/usr/share/xtreemfs/xtreemfs-osd-farm stop osd1
(Of course, you can also forcefully terminate the process by using "kill
".)
Now, the video will stall for several seconds. After the primary's lease has expired, another replica can try to become primary for the file. When this succeeded, the new primary will continue answering read requests from the client and the video will resume.
Here are more details what happens in the background during the failover:
- After the new primary has acquired the lease, it has to ensure that it has the latest version of the file.
- Therefore, it asks all backup replicas for their list of chunks and the chunks' version number.
- The primary merges the list of chunks and finds out if it has missing chunks. If so, it downloads them from the backup replicas.
- After the primary has finished downloading missing chunks, it is allowed to respond to client read requests.
To restart the stopped OSD e.g., osd1, run the script with the "start
" parameter:
/usr/share/xtreemfs/xtreemfs-osd-farm start osd1
If you enjoyed the instructions so far, here are several optional tasks you can try out as well:
Real geeks watch the video with ASCII characters only ;-) Therefore, you have to run mplayer with the video output filter "ASCII art":
mplayer -vo aa ~/xtreemfs/sintel-1024-surround.mp4
(If the filter is not available on your system, try to install the package "libaa
" or "libaa
".)
Here's a screenshot of it:
An advantage of the primary/backup approach is that the primary can respond to read requests using its local replica. Consequently, you might be wondering if the client can still read the file if all remaining backup replicas are unavailable. Find it out! Run:
/usr/share/xtreemfs/xtreemfs-osd-farm stop <OSD config name>
for each OSD which holds a backup replica.
After stopping the last backup replica, the video playback will resume - but only for several seconds until it stops. What happened? When all backup replicas are unavailable, the primary can no longer renew its lease. There's no majority of replicas available which agrees that the primary may renew its lease. After the expiration of the lease, the primary must no longer serve read requests (with possibly outdated data) because a backup replica may have become primary in meantime and the replica may have been modified.
A failover is possible after the lease times out. Therefore, the lease timeout defines the maximum time of unavailability. The default lease timeout is set to 14 seconds plus a grace period of 1 second (to take differences between local clocks into account). Therefore, clients retry a failed request after 15 seconds by default.
To achieve faster failovers, reduce the lease timeout in the configuration of each OSD. The grace period is left to its default value of 1 second. Set the lease timeout to 5 seconds with the following option in each "/etc/xos/xtreemfs/osd{1,2,3}.config.properties
" file:
flease.lease_timeout_ms = 5000 |
---|
The parameter name starts with "flease." because XtreemFS' distributed scalable lease algorithm is called "Flease".
After changing each configuration file, restart all OSDs:
/usr/share/xtreemfs/xtreemfs-osd-farm stop <OSD config name>
Unmount the volume and remount it with shorter request timeout values:
umount.xtreemfs ~/xtreemfs
mount.xtreemfs localhost/replicated ~/xtreemfs --retry-delay 6 --connect-timeout 6 --request-timeout 6
Now simulate the failover again and experience that the failover time has reduced. Please note that shorter lease timeouts will increase the system load because replicas have to renew active leases more often.
You can also simulate the failover while you're downloading the video file. Repeat the download with the "wget
" command line tool as described above and stop the current primary. You'll see that the download stalls as well and will resume when the system failed over to the new replica.
By default, the replicas will be assigned randomly to a new file i.e., any ordering of the three OSDs is possible. The client will always try to access the first replica, then try the next one and so on.
You can influence the placement of replicas and their selection in the client by configuring the "OSD Selection" and "Replica Selection" policies. For example:
- The undocumented policy 3998 will always sort OSDs by their UUID. Once enabled, the list of replicas of new files will always be ordered "[osd1, osd2, osd3]". Enable it as follows:
xtfsutil --set-osp 1000,3998 ~/xtreemfs
- The second undocumented policy 3999 will always reverse the list of replicas i.e., if set as "Replica Selection" policy our client will go through the list of replicas starting at the back.
xtfsutil --set-rsp 3999 ~/xtreemfs
The mentioned policies are only used for demonstration. In practice, you can try several more sophisticated policies e.g., to sort the list of replicas based on the measured latency or the distance between data centers. See the user guide for more information: https://www.xtreemfs.org/userguide.php
Instead of three OSDs per machine, you can run one OSD per machine. The advanced quickstart on the XtreemFS website describes the necessary steps: https://www.xtreemfs.org/quickstart_distr.php
When you run a truly distributed setup, you'll notice that more data is transferred between primary and backup replicas than written by the client. For example, use the command line tool "nload
" to visualize the current throughput on each machine:
nload -u K -U K
The reason for this behavior is documented in our bug tracker: https://github.com/xtreemfs/xtreemfs/issues/259
Every write received by the primary replica results in a new version of the affected chunk. Due to a limitation in the current code, the complete chunk will be sent to the backup replicas, even if less data was updated by the client. For example, adding 8 kB to an existing chunk with the previous size 56 kB will result into transferring the complete new 64 kB chunk.
For example, "wget
" writes with a buffer of 8 kB to XtreemFS. This will result in a severe traffic overhead. You can compensate this by reducing the chunk size of the volume e.g., set it to 8 kB:
xtfsutil --set-dsp -w 1 -s 8 ~/xtreemfs
However, higher chunk sizes are better as they reduce the footprint on the OSD. If there less chunks to be versioned, the memory and file system load decreases.