OpenShift cluster tests

No Comments

We subject our clusters to a lot of automated tests in the widest sense – monitoring, health checks, load tests, penetration tests, vulnerability scans, the list goes on – but every so often I come across test cases that are not well served by any of them. They are usually specific to the way a cluster is used, or how the organisation operating it works. Sometimes there is no objectively correct or incorrect answer, no obvious expected value to specify in our assert statements. I will look at three examples to explain why I think these tests are worth your while.

The first test concerns the OpenShift default of letting all authenticated users create (or, more accurately, request) projects. Let’s say we want to deny non-admin users this power. How do we make sure we have complied with this rule?

Second, our architecture may require an application scaled to three pods to be distributed across three data centre zones for high availability. We need a test that shows that the built infrastructure matches the architectural requirement.

Third, let’s assume we have just experienced an unplanned downtime. Communication between two projects has failed. Clearly remediation comes first, but how would the administrator go about writing a test that makes sure the pod network is configured correctly?

The three scenarios have a number of things in common. Each requires direct access to the cluster state held in the master’s etcd database. That interaction alone ensures that these are not inexpensive tests in performance terms. Broadly speaking, these tests should run daily, preferably at a time of reduced load, not every hour of the day. Running them is perhaps most useful after cluster maintenance or upgrades. We will look at sample implementations of these tests in just a moment.

How much work will creating tests like these involve? Thankfully, very little. If we are unsure what to test, a quick glance at our operational guidelines or architecture documentation will help us get started. Writing tests will come naturally to anyone familiar with OpenShift, and should take no more than five minutes in most cases. Kubernetes gives us all the tools we need to implement our test runner.

Test setup

This chart shows the components that make up the test runner. A CronJob object triggers a pod deployment at half past midnight. The pod's ConfigMap supplies tests and its service account supplies cluster-reader access to the OpenShift cluster.
Test runner components

The CronJob object triggers nightly test runs. The payload is a lightweight single-container pod with Kate Ward’s unit test framework shUnit2, oc client, and assorted tools (curl, psql, mysql, jq, awk). All test data is taken from a ConfigMap mounted at launch. The ConfigMap in turn is generated from a folder of test scripts in Git. We will return to the scripts in just a moment.

For now the CronJob object waits for the appointed hour, then triggers a test run. shunit2 processes the test suite (consisting of all test scripts in /etc/openshift-unit.d) and then reports results. Due to a limitation of the CronJob API prior to Kubernetes 1.8, the pod reports success (zero) even in case of errors as returning an error leads to constant redeployments and considerable load on the cluster.

This chart shows three boxes arranged horizontally: they are labeled (from left to right) initial setup (cluster-admin), service account (cluster-reader) and regular use (with restricted SCC and non-privileged SC).
Permissions

From a permissions point of view, administrator access is required to create the project initially, but from that point onward the service account is read-only and the container runs with the ‘restricted’ security context constraint and a non-privileged security context.

Logs are written to standard output and so managed by the existing log server. Using only the default suite of tests, the test pod reports the following:

test_project_quotas
test_nodes_ready
test_nodes_no_warnings
test_security_context_privileged
test_anyuid
test_self_provisioner

Ran 6 tests.

OK

These tests are just placeholders, however. The tests that matter are the ones that reflect your organisation’s individual rules, guidelines and decisions.

Roles and permissions

Let’s return to the first example mentioned in the introduction, that is, the self-provisioner rule. It ensures that the administrator has taken the corresponding cluster role from the groups system:authenticated and system:authenticated:oauth, which usually means that an administrator has issued the following command:

$ oc adm policy remove-cluster-role-from-group \
  self-provisioner \
  system:authenticated system:authenticated:oauth
cluster role "self-provisioner" removed: ["system:authenticated" "system:authenticated:oauth"]

Using the oc tool, verifying that this has not been forgotten or reversed at a later point, is as straightforward as asking who is entitled to create (the verb) projectrequests (the resource):

The shUnit2 framework intrudes only very slightly on the code here. The utility function suite_addText allows the framework to combine many files in a test suite with a single return value. The test code must reside in a function whose name contains the word test. The writer also needs to be familiar with the framework’s assert functions, assertEquals in this case. Placing a space at the start of the string and a semicolon at the end are conventions that make error messages more legible:

test_self_provisioner
ASSERT: non-admin users must not create project requests; expected:<0> but was:<1>

Whereas most infrastructure tests strive for objectivity and test coverage, cluster tests like this one are unrepentantly subjective and selective. A comparison with rspec-puppet tests is instructive. Here is a brief excerpt from a Puppet manifest with matching rspec-puppet test:

The test asserts the following:

This approach makes it much harder to argue that some properties (e.g. users with basic-user credentials are allowed to create projects) are more important than others (e.g. there’s a JSON file which is read-only unless you are the owner). If we place the two tests side by side, we are reminded that rspec-puppet strives for full map coverage, whereas we are focused on points of interest. These points of interest may seem arbitrary, but so, perhaps, are the decisions and operational guidelines they support and reinforce.

Architecture

How to test for high availability, the second example outlined in the introduction? Anti-affinity rules give us fine-grained control over placement on nodes, but unless we only have one node per zone, we cannot rely on the scheduler alone here. One alternative approach is to identify the nodes and examine the zone label:

As before, we start with plain oc requests and refine the output using basic command line tools. The label zone expresses anti-affinity, the label region affinity: services are spread out across zones and concentrated in regions. We fetch the nodes first (note the use of the wide switch), then extract the zone from the node definition before counting the number of unique zones. The expected number is three.

Post-incident review

So far, operational guidelines and architectural decisions have directed our test selection. Incidents are another valuable guide. Making sure they occur only once trumps trying to anticipate weaknesses in our infrastructure.

For example, our multi-tenant cluster might contain a project alice which accesses a project bob using a pod network join:

$ oc adm pod-network join-projects --to=bob alice

Let’s assume that the join between the two projects has been lost. Perhaps an additional join from alice to eve was created. The fact that one (the original join is gone) does not intuitively follow from the other (an apparently unrelated new join was created) makes this all the more likely. Affected services then run into timeouts and stop processing requests.

The problem is quickly diagnosed and fixed, but having suffered one service failure, we really ought to write a test that alerts us should the join disappear again:

To follow the test, we need to appreciate what happens when oc adm pod-network join-projects is called: the source project’s network ID is changed to that of the destination project. Once the two projects share a network ID, they can communicate with each other. (Hence the unfortunate side-effect of creating an additional join from project alice to eve: alice receives the network ID of eve and can no longer reach services in project bob.) The test only has to fetch the network IDs of alice and bob, de-duplicate and count lines. If the join is still in place, the line count will be one.

Choosing a language

In case you are wondering why this is not a Go application, I have to confess to some library envy. Clearly the command line component would have been much more elegant, for example, and there is more repetition in the tests than I would like. The exports script seeks to address this by bundling frequently used queries such as ‘list all projects created by users’, but that does not make up for the fact that we give up the luxury of one-line web servers, Bootstrap reports adorned with canvas charts, parallel execution for oc and non-oc test cases, and so on.

Those quibbles, however, hardly justify switching to a different language. If we were to do so, which language should we choose? Go? JavaScript? Python? Ruby? Each of these choices would exclude many users who happen to have prioritised other languages. Shell scripting is familiar to most OpenShift users and a natural extension of the way they interact with OpenShift anyway. Nearly everything of substance in our tests, moreover, relies on oc calls; no standard library can abstract away the fundamental awkwardness of building an application around system calls. They only feel entirely natural in a shell environment.

Shorter paths, fewer destinations

Many tests are essential. The ones we have considered here, strictly speaking, are not. It comes down to an individual assessment of risk and usefulness. Personally, I am much more willing to grant anyuid powers to a service account if I know the next nightly test will fail should I forget to remove them later. Sometimes safety nets get in the way, but they can also have a liberating effect.

This approach allows us to specify test conditions at the appropriate level and above all quickly, with minimal investment in infrastructure and training. The goal is the shortest path to a small number of valuable points of interest, not comprehensive map coverage: sightseeing, not cartography.

You may also find that there is nothing that colleagues cannot express more succinctly and elegantly than you thought possible, especially in the world of Bash. Learning from other people’s tests may be, for me, the most enjoyable aspect of it all.

For those still undeterred, log into your administrator’s account and set the timer:

$ git clone https://github.com/gerald1248/openshift-unit.git
$ make -C openshift-unit
Gerald Schmidt

Spends too much time with clouds, public and private. On cloudless days you can catch him working on open source projects of dubious usefulness.

More content about kubernetes

Comment

Your email address will not be published. Required fields are marked *