Measuring your OpenStack Cloud with Gnocchi and Ceph storage backend

No Comments

To solve our performance problems with Gnocchi and the whole OpenStack telemetry stack, we tried Gnocchi with Ceph as backend starting with OpenStack-Ansible Newton. The experience wasn’t good. Sooner or later, we experienced slow requests and stuck PGs in our Ceph. In one case, only deleting the Gnocchi pool saved our cluster.

As a result, we switched back to MongoDB as the storage backend for ceilometer. It was not performing well, but at least it did not put our whole storage cluster at risk.

This left us with our performance problems, but then we stumbled upon the following performance tests for Gnocchi. One was done by Julien Danjou, the developer of Gnocchi. They got us thinking what went wrong with our setup.

So with Openstack-Ansible Pike and a new cloud, we gave Gnocchi another try. After our experience with Gnocchi and Ceph, we didn’t want to take the performance tests for granted. And as every setup is a bit different, we set up a simple performance test on our own. We started 700 VMs over time and then got a cup of coffee. OK, more than one cup. After some days we experienced the same problems with Ceph we already knew. We saw more and more slow requests.

As we use OpenStack-Ansible for our cloud and a three-controller setup, we deployed Gnocchi on each of our controllers. The default parameters of OpenStack-Ansible use file as storage backend and MySQL as the coordination backend. We changed the storage backend to Ceph and kept the rest of the default settings.

gnocchi_storage_driver: ceph

The MySQL backend is not a recommended coordination backend by tooz (https://docs.openstack.org/tooz/latest/user/drivers.html), so we used Zookeeper. As OpenStack-Ansible cannot include a role for everything, we had to integrate the Zookeeper role (https://github.com/openstack/ansible-role-zookeeper.git) into our setup:

conf.d:
zookeeper_hosts:
{% for server in groups['control_nodes'] %}
 {{ server }}:
   ip: {{ hostvars[server]['ansible_default_ipv4']['address'] }}
{% endfor%}
env.d:
component_skel:
 zookeeper_server:
   belongs_to:
   - zookeeper_all

container_skel:
 zookeeper_container:
   belongs_to:
     - infra_containers
     - shared-infra_containers
   contains:
     - zookeeper_server
   properties:
     service_name: zookeeper

Now we could set up Zookeeper as coordination backend for Gnocchi:

gnocchi_coordination_url: "zookeeper://{% for host in groups['zookeeper_all'] %}{{ hostvars[host]['container_address'] }}:2181{% if not loop.last %},{% endif %}{% endfor %}"

gnocchi_pip_packages:
 - cryptography
 - redis
 - gnocchiclient
# this is what we want:
#  - "gnocchi[mysql,ceph,ceph_alternative_lib,redis]"
# but as there is no librados >=12.2 pip package we have to first install ceph without alternative support
# after adding the ceph repo to gnocchi container, python-rados>=12.2.0 is installed and linked automatically
# and gnocchi will automatically take up the features present in the used rados lib.
 - "gnocchi[mysql,ceph,redis]"
 - keystonemiddleware
 - python-memcached
# addiitional pip packages needed for zookeeper coordination backend
 - tooz
 - lz4
 - kazoo

A word of caution: the name of the Ceph alternative lib implementation (ceph_alternative_lib) varies between Gnocchi versions.

This will help distribute the work across all metric processors on all controllers.

But that didn’t solve our problem either. The problem seemed to be our Ceph cluster. Searching the web, a lot of bug tickets showed other people experienced the same problem. But all the bug tickets put us on the right track. Newer versions of Gnocchi can separate the storage of your data. You can use a different storage type for incoming short-lived data and long-time storage.

The next step was to set up the storage layer for our incoming data. We chose Redis, as recommended, from the list of supported backends. To set up the Redis cluster, we chose this ansible role. Next, we had to configure Gnocchi with OpenStack-Ansible to use the Redis Cluster as incoming storage:

gnocchi_conf_overrides:
 incoming:
   driver: redis
   redis_url: redis://{{ hostvars[groups['redis-master'][0]]['ansible_default_ipv4']['address'] }}:{{ hostvars[groups['redis-master'][0]]['redis_sentinel_port'] }}?sentinel=master01{% for host in groups['redis-slave'] %}&sentinel_fallback={{ hostvars[host]['ansible_default_ipv4']['address'] }}:{{ hostvars[host]['redis_sentinel_port'] }}{% endfor %}

gnocchi_distro_packages:
 - apache2
 - apache2-utils
 - libapache2-mod-wsgi
 - git
 - build-essential
 - python-dev
 - libpq-dev
 - python-rados
# additional package for python redis client
 - python-redis

We ran our performance test again and eureka! No more slow requests in Ceph. Our performance test included 700 VMs with one vCPU and one GB of RAM. We weren’t interested in the VMs but only in the telemetry data they would generate. We assume it will take some time for our cloud to grow beyond 700 VMs. In the meantime, our cluster might evolve, e.g. Ceph only has SSD journals, no SSD storage, Gnocchi will evolve, our knowledge about Gnocchi and Ceph will evolve. So we expect our current setup to cope with the upcoming load. All in all, it will give us enough time to experiment with more hints from this talk to aim for the 10000 VMs. We hope this article will help some other people to integrate Gnocchi into their OpenStack setup.

Tags

Christian Zunker

Christian Zunker is creating and operating distributed software systems and infrastructures. In his current role as a System Engineer, he builds OpenStack based clouds for cc cloud GmbH and its customers.

Daniel Marks is creating and operating complex distributed software systems and infrastructures. In his current role as Senior Cloud Engineer, he builds OpenStack-based clouds for codecentric and its customers.

Comment

Your email address will not be published. Required fields are marked *