Have you ever wanted to find a list of nodes that updated a specific resource in a period of time? Such as “show me all the nodes in production that had an application service restart in the last hour”? Or, “which nodes have updated their apt cache recently?” For example,
1 2 3 4 5 6 7 8
I have released a new knife plugin to do that, but first some background.
At CHEF, we run the community’s cookbook site, Supermarket. We monitor the systems that run the site with Sensu. The current infrastructure runs instances on Amazon Web Services EC2, with an Elastic Load Balancer (ELB) in front of them. As a corrective action for a Supermarket outage, CHEF’s operations team added a new check for elevated HTTP 500 responses from the application servers behind the ELB. One thing we found was that when Supermarket was deployed, and the
unicorn server restarted, we would see elevated 500’s, but the site often wouldn’t actually be impacted.
The Sensu check is run from a “relay” node. That is, it isn’t run on the application servers or the Sensu server – it’s run out of band since it’s for the ELB. One might imagine we could have similar checks for other services that aren’t run on “managed nodes,” but that’s neither here nor there. The issue is that we get an alert message that looks like this:
The first part,
[i-d1dfd5d9/check-elb-backend-500] is the node name and the check that alerted. The node name here is the monitoring relay that runs the check, not the actual node or nodes where Supermarket was deployed and restarted. This is where Chef Reporting comes into play. In Chef Reporting, we can view information about recent Chef client runs, which gives us a graph like this.
If we go look at the reports in the Chef Manage console, we can drill down to something like this.
This shows that unicorn was restarted in this run. That’s great, but if I’m getting this alert at a time when I’m not particularly coherent (e.g, 2AM), I want a command in a playbook that I can run to get more information quickly without having to log into the webui and click around imprecisely. CHEF publishes a
knife-reporting gem that has a couple handy sub-commands to retrieve this run data. For example, we can list runs.
1 2 3 4 5 6 7 8 9 10
Or, we can display a specific run.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
This is handy, but a little limited. What if I want to display only the runs containing the
That’s where my
knife-report-resource plugin helps. At first, it was very much specific to finding unicorn restarts on Supermarket app servers. However, I wanted to make it more general purpose as I think people would want to be able to find when arbitrary resources were updated. This is how it works:
- Query the Chef Server for a particular set of nodes. For example,
'role:supermarket-app AND chef_environment:supermarket-prod'.
- Get all the Chef client runs for a specified time period up until the current time. By default, it starts from one hour ago, but we can pass an ISO8601 timestamp.
- Iterate over all the runs looking for runs by the nodes that were returned by the search query, gathering the specified resource type and name.
- Display some nice output with the node’s FQDN, the run’s UUID, and a timestamp.
From the earlier example:
1 2 3 4 5 6 7 8
Then, we can drill down further into one of these runs with the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
Hopefully you find this plugin useful! It is a RubyGem, and is available on RubyGems.org, and the source is available on GitHub.