Monitoring Processes with Monit and Slack

(This post was originally posted on Cubicle Rebel’s engineering blog. Check it out!)

Our clients rely on us to provide rock-solid long-term stability for their projects, and one critical aspect of maintaining uptime is providing realtime logging and monitoring of long-running services (also called daemons) on our servers.

We use Slack exclusively for communication and notifications regarding technical matters. Because of Slack’s great webhook support, we decided to hook it up to Monit and have Monit automatically notify us and attempt to restart the process if anything goes down. Here’s how we went about it.

For the purposes of demonstration, let’s assume that we have an application called Beerist (a social network for beer drinkers). It happens to be a Rails app, and we want to use Unicorn as our Rack server so that we can serve requests concurrently with forked OS processes.

1 - Set Slack up

The first thing you want to do is to get the Webhook URL from Slack by visiting this page and signing in. You will be prompted to choose a channel to post to, but it doesn’t really matter which one you pick because you will be able to override the (default) channel in your payload. After you get the URL, don’t close the page just yet - it contains useful information we need for Step 3.

2 - Install Monit on your server

Monit is available on most Unix distributions. If you’re running Ubuntu, the usual sudo apt-get update and sudo apt-get install monit will suffice. You can verify that Monit is working by starting it: sudo service monit start. monit status -v will print out Monit’s status in verbose mode, which will come in handy.

3 - Familiarize yourself with Slack’s incoming webhooks endpoint

Go back to the page with the Webhook URL, or visit the equivalent documentation page here, and documentation on message attachments here. Read through them, but I’ll give a brief overview here anyway, as well as the options we went with.

We’ll be sending serialized JSON in the request body. The JSON payload structure looks like this:

{
  "text":        String,
  "channel":     String,
  "username":    String,
  "icon_emoji":  String,
  "attachments": [
    "fields": [
    {
      "title": String,
      "short": Boolean,
      "value": String
    },
    {
      "title": String,
      "short": Boolean,
      "value": String
    },
    // ... add more fields here if you wish
    ]
  ]
}

4 - Configuring Monit

The hardest part of this tutorial is this step. Monit can monitor all sorts of things, and depending on what you want to monitor, this can either be very painful or very easy.

In the case of processes, Monit will require a pidfile as well as the shell command for starting (and stopping) the process, if you want Monit to be able to automatically restart the process for you.

Let’s say our app happens to be located at /apps/beerist, and our unicorn pidfile is located at /apps/beerist/current/tmp/pids/unicorn.pid.

Open the Monit configuration file at /etc/monit/monitc. It should be well commented with some sensible defaults already in place. Monit has its own internal DSL, which explains the weird, terse syntax. The only thing you really want to look at and change is the set daemon setting near the top of the file. Monit runs as a daemon which wakes itself up periodically to monitor whatever its supposed to monitor. This setting allows you to set how frequently you want Monit to wake up, and is specified in seconds. In this case, we have it set to just 5 seconds.

Under the Services section, start by adding the following line:

check process unicorn with pidfile /apps/beerist/current/tmp/pids/unicorn.pid

Remember that we want Monit to notify us AND restart the server if Unicorn is down. Let’s add the following test:

check process unicorn with pidfile /apps/beerist/current/tmp/pids/unicorn.pid
  if does not exist for 1 cycle
    do something
    else if succeeded for 3 cycles then do something else

if does not exist in this case refers to the process indicated by the process ID in the pidfile. We will come back to do something in just a bit. Meanwhile, let’s figure out how we can get Monit to send POST requests.

5 - Write a script to send `POST` requests

Out of the box, Monit’s alert system only supports email-based notifications. Seriously though, this is 2015 - who has time for email?

Thankfully, we don’t have to use Monit’s built-in alert system, because Monit allows us to run arbitrary scripts with the exec action, which means we can do something like:

check process unicorn with pidfile /apps/beerist/current/tmp/pids/unicorn.pid
  if does not exist for 1 cycle
    then exec "/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message 'Unicorn is down!' --alert danger"
    else if succeeded for 3 cycles then exec "/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message Unicorn is back up! --alert good"

So, we can write a script to do the POST-ing to Slack for us.

To speed things up a little, I’m going to just link to the script we wrote for this purpose, which you are free to use. It’s quite long but if you want to roll your own you can definitely make it shorter.

Assuming you use our Ruby script, please ensure that you have some version of Ruby installed on your system, and take note of the location of the ruby executable in the current path (whereis ruby) because it is good practice to indicate the full path to the executable.

In our case, the ruby executable is located at /usr/bin/ruby (as you saw earlier) and our script is in the home directory of cooluser.

At this point in time, it is possible to test that this setup works, but we’ll wait until the end of the next step. We can do a monit status -v to make sure that we’re on the right track:

cooluser@cubiclerebels:/home/cooluser# monit status -v
Runtime constants:
 Control file       = /etc/monit/monitrc
 Log file           = /var/log/monit.log
 Pid file           = /var/run/monit.pid
 Id file            = /var/lib/monit/id
 Debug              = True
 Log                = True
 Use syslog         = False
 Is Daemon          = True
 Use process engine = True
 Poll time          = 5 seconds with start delay 0 seconds
 Expect buffer      = 256 bytes
 Event queue        = base directory /var/lib/monit/events with 100 slots
 Mail from          = cooluser@cubiclerebels.com
 Mail subject       = $SERVICE $EVENT at $DATE
 Mail message       = message: Monit $ACTI..(truncated)
 Start monit httpd  = True
 httpd bind address = localhost
 httpd portnumber   = 9999
 httpd signature    = True
 Use ssl encryption = False
 httpd auth. style  = Host/Net allow list
 
The service list contains the following entries:
 
Process Name          = unicorn
 Pid file             = /apps/beerist/tmp/pids/unicorn.pid
 Monitoring mode      = active
 Existence            = if does not exist 1 times within 1 cycle(s) then exec '/home/cooluser/unicorn_down.sh' timeout 0 cycle(s) else if succeeded 3 times within 3 cycle(s) then exec '/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message Unicorn is back up! --alert good' timeout 0 cycle(s)
 Pid                  = if changed 1 times within 1 cycle(s) then alert
 Ppid                 = if changed 1 times within 1 cycle(s) then alert
 
System Name           = awesomepossums.cubiclerebels.cluster
 Monitoring mode      = active

6 - Write a wrapper script

Due to Monit’s DSL limitations, each test can only be associated with one action. If an action is not specified for a monitored process, the default action is to restart the process when the process associated with the pidfile is no longer running.

However, once we indicate that we want to run exec if something happens, then that becomes the action, and Monit will no longer attempt to restart the process. And you can’t do something like:

if does not exist for 1 cycle
  then exec "/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message 'Unicorn is down!' --alert danger" and restart

because the Monit DSL is painfully limited.

Thankfully, there is a way to sidestep this limitation, and that is by writing a wrapper script that will execute the (multiple) commands that we need to be executed.

Open a new file called unicorn_wrapper.sh or something and place the following lines inside:

#!/bin/bash
 
/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message 'Unicorn is down!' --alert danger
 
cd /apps/beerist/ && RAILS_ENV=production BUNDLE_GEMFILE=/apps/beerist/Gemfile bundle exec unicorn -c /apps/beerist/config/unicorn.rb -E deployment -D;

That second ugly-looking line is our shell command for starting Unicorn with the correct context. For other processes like Apache, the command usually looks a lot shorter, like /etc/init.d/apache start.

With this script, we can replace

check process unicorn with pidfile /apps/beerist/current/tmp/pids/unicorn.pid
  if does not exist for 1 cycle
    then exec "/usr/bin/ruby /home/cooluser/post_to_slack.rb 'Unicorn is down!'"
    else if succeeded for 3 cycles then exec "/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message Unicorn is back up! --alert good"

with

check process unicorn with pidfile /apps/beerist/current/tmp/pids/unicorn.pid
  if does not exist for 1 cycle
    then exec "/home/cooluser/unicorn_wrapper.sh"
    else if succeeded for 3 cycles then exec "/usr/bin/ruby /home/cooluser/post_to_slack.rb --service Unicorn --message Unicorn is back up! --alert good"

And this will correctly run the script and attempt to restart the Unicorn process. Beer drinkers rejoice!

7 - Testing everything

At this point, do monit reload to reload the configuration and monit validate to make sure you haven’t made any syntax errors, and monit status -v to make sure that the process is being monitored.

We can manually test if it works by killing the process and seeing if it revives. Something like ps auwx | grep unicorn will reveal the master Unicorn process, which you can kill with kill XXXX. If everything works correctly, Slack should receive a notification in the channel you indicated, and Unicorn should be restarted (under a different pid), after which you should receive another notification.

cooluser@cubiclerebels:/home/cooluser# ps auwx | grep unicorn
cooluser 5155  2.6  7.6 324864 129944 ?       Sl   12:25   0:09 unicorn master -c /apps/beerist/config/unicorn.rb -E deployment -D
cooluser 5403  0.0  7.4 325020 125944 ?       Sl   12:26   0:00 unicorn worker[0] -c /apps/beerist/config/unicorn.rb -E deployment -D
cooluser 5406  0.0  7.4 325020 125752 ?       Sl   12:26   0:00 unicorn worker[1] -c /apps/beerist/config/unicorn.rb -E deployment -D

If this works for you on the first try, congratulations!

8 - Extensions

We’ve tried to be as general as possible when describing the steps, so you can adapt this technique to monitor arbitrary processes, or even anything at all. Take a look at Monit’s documentation for a exhaustive list of what kind of things Monit can monitor, and what kind of tests you can set up on those monitors, including CPU/memory usage, space usage, uid/gid/pid changes, and even network tests (ICMP pings, bandwidth, socket connection, etc). The possibilities are endless, but the syntax remains very similar, if not exactly the same.

9 - Conclusion

As you can probably tell, we take reliability very seriously here at Cubicle Rebels, and this is just one of the things we do to make sure everything goes up and stays up. There’s a lot more to do with this technique, such as ensuring that the script is automatically uploaded when servers are provisioned.

Stay tuned for more articles and tutorials!