Overview

Ceph Object Storage as fast as it gets or Benchmarking Ceph

5 Comments

CenterDevice is a distributed document management and sharing software without any single centralized component. In our next evolution we are going to use the distributed object store Ceph for storing our encrypted documents. In this article, my colleague Daniel Schneller and I present benchmarks for CenterDevice’s current Ceph installation. As it turns out, Ceph is as fast as it gets. In fact, Ceph gives you a reliable, cheap, and performant alternative to commercial SANs and distributed file storage.

Setup

We use four Dell PowerEdge 510 servers with 128 GB RAM and 14×4 TB disks — two mirrored disks for the OS and 12 disks for Ceph storage. The servers are connected via 4×1 Gbit network interfaces with two redundant switches. Two network interfaces each are bonded to form a high availability link according to IEEE 802.3ad. This way, a link or switch can fail while connectivity continues over the other link. Furthermore, using layer 3+4 link aggregation via LACP, different TCP streams may use both links, enhancing network throughput.

Before measuring Ceph’s Object Store performance, we establish a baseline for the expected maximum performance by measuring the performance of the disks and the network.

Baseline – Disk

For the disk performance baseline, we proceed in two steps. First, we measure the performance of a single disk. Second, we determine the performance of the whole disk subsystem of one server by stressing all disks at once.

To determine the write speed, we use dd to write arbitrary data as fast as possible. It’s important to circumvent the OS’ disk cache to get realistic results (oflag). We write a 10 GB file by reading from /dev/zero:
> dd if=/dev/zero of=/var/lib/ceph/osd/ceph-10/deleteme bs=10G count=1 oflag=direct

For the second step, we start the same dd process for all 12 data disks simultaneously — every dd is put in the background to run in parallel.
> for i in `mount | grep osd | awk '{print $3}'`; do dd if=/dev/zero of=$i/deleteme bs=10G count=1 oflag=direct &; done

For determining read speed, we read the created files; again, bypassing the cache (ifloag), reading from one and then from all disks:
> time dd if=/var/lib/ceph/osd/ceph-10/deleteme of=/dev/null bs=10G count=1 iflag=direct
> for i in `mount | grep osd | awk '{print $3}'`; do dd if=$i/deleteme of=/dev/null bs=10G count=1 iflag=direct &; done

The results are shown in figure 1. As you can see, a single disk can write up to 154 MB/s and read up to 222 MB/s. If all disks are used at the same time, the write speed decreases to an average of 93 MB/s and the read speed to 142 MB/s respectively which are pretty good results.

Figure 1 — Write performance for 1 and 12 disks in parallel.

Figure 1 — Write performance for 1 and 12 disks in parallel.

Baseline – Network

Ceph suggests that there should be two separate designated networks: a cluster and a public network. Ceph uses the cluster network for internal synchronization as well as replication while the public network is handling client requests. We use four physical links, bonded together in pairs to form one link for each of the two networks. See the configuration for a bonded interface on Ubuntu below — we use a little trick to set the VLAN specific interface name to a human readable string instead of something like bond0.105. This makes the configuration less error prone.

> cat /etc/network/interfaces
...
auto bond2
iface bond2 inet manual
bond-slaves p2p3 p2p4 # interface to bond
bond-mode 802.3ad # activate LACP
bond-miimon 100 # monitor link health
bond-xmit_hash_policy layer3+4 # use Layer 3+4 for link selection
pre-up ip link set dev bond2 mtu 9000 # set Jumbo Frames

auto vlan-ceph-clust
iface vlan-ceph-clust inet static
pre-up ip link add link bond2 name vlan-ceph-clust type vlan id 105 # Little trick
  # to set human readable interface name
pre-up ip link set dev vlan-ceph-clust mtu 9000 # Jumbo Frames
post-down ip link delete vlan-ceph-clust # unset human readable interface name
address 10.102.5.12 # IP config
netmask 255.255.255.0
network 10.102.5.0
broadcast 10.102.5.255
...

The theoretical maximum performance for the bonded network interfaces is 2 Gbit/s = 250 MB/s. iperf  is an excellent tool to measure network throughput which we use for this measurement. We start an iperf server process on node01 and two iperf clients sending via two TCP streams each on node02 and node03:

[node02] > iperf -c node01.ceph-cluster -P 2
[node03] > iperf -c node01.ceph-cluster -P 2
[node01] > iperf -s -B node01.ceph-cluster
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address node01.ceph-cluster
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.102.5.11 port 5001 connected with 10.102.5.12 port 49412
[ 5] local 10.102.5.11 port 5001 connected with 10.102.5.12 port 49413
[ 6] local 10.102.5.11 port 5001 connected with 10.102.5.13 port 59947
[ 7] local 10.102.5.11 port 5001 connected with 10.102.5.13 port 59946
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 342 MBytes 286 Mbits/sec
[ 5] 0.0-10.0 sec 271 MBytes 227 Mbits/sec
[SUM] 0.0-10.0 sec 613 MBytes 513 Mbits/sec
[ 6] 0.0-10.0 sec 293 MBytes 246 Mbits/sec
[ 7] 0.0-10.0 sec 338 MBytes 283 Mbits/sec
[SUM] 0.0-10.0 sec 631 MBytes Mbits/sec

The results are rather disappointing as we achieve only roughly 513 Mbit/s + 529 Mbit/s or about 130 MB/s. What’s wrong here?

Kommentare

  • QHartman

    30. May 2014 von QHartman

    How do you have your journals configured for this benchmarking? How many OSDs are you running on each machine?

    • Lukas Pustina

      Hi QHartman,

      we have 12 hard disks per server for Ceph and have one OSD running per disk. So there are 12 OSDs running per machine. The journals use the default configuration. We considered buying SSDs and share them among multiple OSDs, but currently we are quite happy with the performance. So we have this as a future advancement.

  • Rhea

    Hi Lukas

    Did you compare Ceph object storage to any other? Also have you had any issues with Ceph in production?

    • Lukas Pustina

      Hi Rhea,

      we used GlusterFS in the past, but realized that is does not scale for our application. That’s the reason why we decided to use Ceph as an object store.

      The Ceph object store is extremely mature and resilient and we haven’t had any problem yet. CephFS on the other hand is not production ready in my opinion. We use it just for sharing files between nodes for administration purposes only.

      I hope this helps,
      Lukas

  • Arihanth Jain

    17. December 2014 von Arihanth Jain

    Hi Lucas,

    Thanks for sharing the benchmark work.

    For 4 load generator (rados bench run in parallel on all 4 nodes), how did you collect the measurements from all nodes and then aggregate it to obtain the graph? Was it collected from individual nodes with rados running in background and then combined somehow?

    I our setup there exists a management node (apart from 8 other storage nodes), does it make sense to perform a 8 load generator test and collect its results from the management node alone using some tool like collectl?

Comment

Your email address will not be published. Required fields are marked *