Gatus: Automated Health Dashboard

Gatus

Gatus is a project that I started because I wanted to have a reliable automated health dashboard that was as low maintenance as possible.

Gatus is a developer-oriented health dashboard that gives you the ability to monitor your services using HTTP, ICMP, TCP, and even DNS queries as well as evaluate the result of said queries by using a list of conditions on values like the status code, the response time, the certificate expiration, the body and many others. The icing on top is that each of these health checks can be paired with alerting via Slack, PagerDuty, Discord, Twilio and more.

Gatus Dashboard

Originally, determining whether a service was healthy or not was going to be very simple: is the status 5xx? If yes, the service is unhealthy. After implementing that, however, I felt that it was far too lacking, because while it could help determine availability, it couldn’t really make the difference between “everything is ok” and “everything is not ok”.

While having 2xx responses is definitely a good start, what about the content of an endpoint? What if, for some unknown reason, your service is up, but it’s distributing no content at all? This wasn’t a question that Gatus could answer, so I decided to implement conditions, which would be used as a basis to determine whether a service was healthy or not.

Here’s a few examples:

Condition	Description	Passing values	Failing values
`[STATUS] == 200`	Status must be equal to 200	200	201, 404, …
`[STATUS] < 300`	Status must lower than 300	200, 201, 299	301, 302, …
`[STATUS] <= 299`	Status must be less than or equal to 299	200, 201, 299	301, 302, …
`[STATUS] > 400`	Status must be greater than 400	401, 402, 403, 404	400, 200, …
`[STATUS] == any(200, 429)`	Status must be either 200 or 429	200, 429	201, 400, …
`[CONNECTED] == true`	Connection to host must’ve been successful	true	false
`[RESPONSE_TIME] < 500`	Response time must be below 500ms	100ms, 200ms, 300ms	500ms, 501ms
`[IP] == 127.0.0.1`	Target IP must be 127.0.0.1	127.0.0.1	0.0.0.0
`[BODY] == 1`	The body must be equal to 1	1	`{}`, `2`, …
`[BODY].user.name == john`	JSONPath value of `$.user.name` is equal to `john`	`{"user":{"name":"john"}}`
`[BODY].data[0].id == 1`	JSONPath value of `$.data[0].id` is equal to 1	`{"data":[{"id":1}]}`
`[BODY].age == [BODY].id`	JSONPath value of `$.age` is equal JSONPath `$.id`	`{"age":1,"id":1}`
`len([BODY].data) < 5`	Array at JSONPath `$.data` has less than 5 elements	`{"data":[{"id":1}]}`
`len([BODY].name) == 8`	String at JSONPath `$.name` has a length of 8	`{"name":"john.doe"}`	`{"name":"bob"}`
`has([BODY].errors) == false`	JSONPath `$.errors` does not exist	`{"name":"john.doe"}`	`{"errors":[]}`
`has([BODY].users) == true`	JSONPath `$.users` exists	`{"users":[]}`	`{}`
`[BODY].name == pat(john*)`	String at JSONPath `$.name` matches pattern `john*`	`{"name":"john.doe"}`	`{"name":"bob"}`
`[BODY].id == any(1, 2)`	Value at JSONPath `$.id` is equal to `1` or `2`	1, 2	3, 4, 5
`[CERTIFICATE_EXPIRATION] > 48h`	Certificate expiration is more than 48h away	49h, 50h, 123h	1h, 24h, …
`[DOMAIN_EXPIRATION] > 720h`	The domain must expire in more than 720h	4000h	1h, 24h, …

To give you a better idea of what the configuration file looks like, here’s a sample:

endpoints:
  - name: website                 # Name of your endpoint, can be anything
    url: "https://twin.sh/health"
    interval: 5m                  # Duration to wait between every status check (default: 60s)
    conditions:
      - "[STATUS] == 200"         # Status must be 200
      - "[BODY].status == UP"     # The json path "$.status" must be equal to UP
      - "[RESPONSE_TIME] < 300"   # Response time must be under 300ms

The configuration above checks the health of the service website by sending a request to https://twin.sh/health every 30 seconds, and validates whether the status is 200, the body has the field status with the value UP, and the response time is less than 300ms.

To help you better visualize what [BODY].status == UP means, here’s what the body of https://twin.sh/health looks like:

{
  "status": "UP"
}

As I mentioned, this is a simple example. You could have significantly more specific conditions:

endpoints:
  - name: some-endpoint-that-does-not-exist
    url: https://example.org/api/v1/posts/1234
    interval: 5m
    conditions:
      - "[STATUS] == 200"
      - "[BODY].id == 1234"
      - "len([BODY].title) > 0"
      - "len([BODY].author) > 0"
      - "len([BODY].tags) >= 3"

The above would expect a status of 200, as well as a body with a title and an author that’s with at least 1 character long as well as a list of at least 3 tags, e.g.:

{
  "id": 1234,
  "title": "Gardening tips or something like that",
  "author": "johndoe",
  "tags": [
    "gardening",
    "grass",
    "nature",
    "tips"
  ] 
}

You could add more specific conditions, such as "[BODY].tags[0] == gardening" and "[BODY].author == johndoe", if you know that the name of the first tag is gardening and you know that the author’s name is johndoe respectively.

If the tag was an object with a name field instead of just a string, your condition would be "[BODY].tags[0].name == gardening" and the expected body would look something like:

{
  "id": 1234,
  "title": "Gardening tips or something like that",
  "author": "johndoe",
  "tags": [
    {
      "name": "gardening"
    },
    {
      "name": "grass"
    },
    {
      "name": "nature"
    },
    {
      "name": "tips"
    }
  ] 
}

Regardless, Gatus can be used for both use cases: - You know exactly what response to expect (e.g. you know the author’s name is johndoe) - You have an idea what the response looks like, but you don’t know exactly what the data itself is (e.g. you know that there should be at least 3 tags, because the minimum number of tags allowed is 3)

Among other things, Gatus also supports GraphQL requests:

endpoints:
  - name: filter-users-by-gender
    url: https://example.com/playground
    method: POST
    graphql: true
    body: |
      {
        users(gender: "female") {
          id
          name
          gender
          avatar
        }
      }
    headers:
      Content-Type: application/json
    conditions:
      - "[STATUS] == 200"
      - "len([BODY].data.users) > 0"
      - "[BODY].data.users[0].gender == female"
      - "[BODY].data.users[0].id > 0"
      - "len([BODY].data.users[0].name) > 0"

For further documentation, please view the GitHub repository at github.com/TwiN/gatus

Finally, some of you may be asking yourselves the following:

Why would I use Gatus when I can just use Prometheus’ Alertmanager or Cloudwatch?

Neither of these can tell you that there’s a problem if there are no clients actively calling the endpoint. In other words, it’s because monitoring metrics mostly rely on existing traffic, which effectively means that unless your clients are already experiencing an issue, you won’t be notified.

Gatus, on the other hand, allows you to configure health checks for each of your features, which in turn allows it to monitor these features and potentially alert you before any clients are impacted.

A sign you may want to look into Gatus is by simply asking yourself whether you’d receive an alert if your load balancer was to go down right now. Will any of your existing alerts by triggered? Your metrics won’t report an increase in errors if there’s no traffic that makes it to your applications. This puts you in a situation where your clients are the ones that will notify you about the degradation of your services rather than you reassuring them that you’re working on fixing the issue before they even know about it.

To close this off, I’d love to say that Gatus is a fancy acronym, but in fact, it’s just a play on the word status.

I’ve also built a managed solution for Gatus out of Gatus, if that’s something you’re interested in.