Blog

Using AWS services to monitor the health of non-AWS resources

Join me in this technical blog post to explore the AWS landscape and find out how we can monitor the health of non-AWS resources, using AWS Route 53, AWS Cloudwatch and AWS Lambda.

In 2018 we did an intake of a project containing some legacy applications. These applications have a health check sitting behind a publicly available REST call, but there was no monitoring or alerting installed, or at least it was not directly available to our team.

This triggered me to find out how I could leverage AWS to set up monitoring and alerting for non-AWS resources. In this blog post, I will be exploring different AWS services to see what they have to offer. I’m writing this post, not in hindsight, but as I’m exploring several possible solutions in the AWS landscape.

If you have questions or remarks, don’t hesitate to reach out and send me an email.

On July 25th 2019, we’re hosting an AWS event at our office in Leuven. Should you be reading this on time and if you’re close by, then you can register via Meetup and come over if you’re interested.

OK, let’s get going!

I’ll be using the AWS console because I think this is the easiest way to get things to work when you’re doing a POC. You get a nice GUI for all of the AWS services, which provides a lot of info and support.

There are 2 constraints I’m imposing upon myself:

1) Use AWS as much as possible.

Recently some colleagues and myself passed the AWS Certified Solutions Architect Associate exam. By studying for this exam, we’ve learned a lot about the available AWS services out there and I feel like applying these learnings in this POC.

2) Don’t change anything to the legacy apps and don’t install anything on the legacy infrastructure.

Although I see benefit for the project by setting up monitoring, I must realize this is no actual assignment from our customer nor is he paying for this POC. I’m doing this experiment in my free time, but I also want to ensure I’m not breaking our customer’s applications or the infrastructure they are hosted on.

Legacy health check interface

You can verify the health check of our legacy apps by doing a REST call to a “/status” path. If you get a 200 back, it means the web app itself is “alive”.

The response body is a JSON object that contains info about the health of the components (for lack of a better word) the web app integrates with, e.g. a database, elasticsearch, another app, third party system, … The JSON structure of the response is as follows:

{
 "COMPONENT1": "UP",
 "COMPONENT2": "WARNING",
 "COMPONENT3": "CRITICAL",
 "COMPONENT4": "DOWN"
}


Let’s agree that for a health check to be considered as healthy, we want a 200 and all components must have the status WARNING or UP.

Route 53

Amazon Route 53 is Amazon’s DNS service. You can use it to register a domain, route traffic to your resources and check the health of these resources. The cool thing is, you can also use the Route 53 health check feature for non-AWS resources. You can target any endpoint you like, as long as it’s accessible by AWS.

So I started off by setting up a health check for one of the apps, by using the Route 53 console.

Route53 new health check

When configuring a health check via the console, there is an “advanced configuration” section, where you can choose the option “string matching”. Now you can fill in the JSON response body that would match a healthy status, in the “search string” field. For example:

{
 "COMPONENT1": "UP",
 "COMPONENT2": "UP"
}


When all is set up, your Route 53 health checks homepage could look like this:

Route53 health checks home

Let’s have a look at what AWS is charging us for our brand new health check. With a base price of $0.75 per month, and an extra $2.00 for https and for string matching, we’re going to be charged about $4.75 per month for our health check.

Cloudwatch

Now that we have our health checks installed, let’s try to set up some monitoring and alerting.

Amazon Cloudwatch is a service that lets you acquire insight about what is going on with your cloud resources and applications. You can collect logs and metrics, set up monitoring dashboards and raise alarms.

Note that AWS will charge you for metrics, API calls, dashboards, alarms, logs and events. You can find pricing info here.

As it turns out, when you set up health checks in Route 53, metrics are automatically available in Cloudwatch. You can already have a peek by going to the Cloudwatch Metrics section in the AWS console. Fill in “Route 53” in the search bar, click “Health Check Metrics” and explore. You should be seeing something like this:

Cloudwatch Route53 health check metrics

A value of 1 means our health check was OK, a value of 0 means the health check failed.

Dashboards

In Cloudwatch you can gather your metrics in custom made dashboards. You create your own layout and drop in metric widgets. For our apps, I made a layout with the different environments from left to right. I’ll be dropping in metrics widgets for each app in each environment column. I haven’t added all apps yet, but for now the result looks like this:

Cloudwatch Dashboard

Alarms

With Cloudwatch Alarms, you can perform actions based on metric values. For example, you could create an alarm that is triggered when our health check metric value drops to 0. There are several actions AWS can perform, but the one we’re interested in, is sending a notification to Simple Notification Service or SNS. Just think of SNS as a pub/sub mechanism: when you and your team members subscribe to SNS with your email addresses, you will get a notification whenever the alarm publishes events to SNS.

When you’re creating a health check in the Route 53 console, it’s very easy to configure an alarm along with your health check and link it to an SNS topic. You can always view, modify and create alarms in the Cloudwatch console.

Voila, we have successfully set up health checks, monitoring and alerting, using nothing but AWS services!

Route53 Diagram

Limitations

Unfortunately, there were some limitations with the Route 53 health checks, at least for what I was trying to achieve…

  • String matching

String matching turned out to be very limited. I could only match an exact string. This is a problem, since I said earlier that any of the components should have status “UP” or “WARNING” in order to be considered healthy, but the status should not be “CRITICAL” or “DOWN”. I didn’t find any documentation on how to do “and” or “or” conditions in string matching, or regular expressions. An extra problem was that the order of my components in my REST response is not fixed :-( There is a way to combine health checks, but it didn’t feel like the right path to follow.

  • Request interval

When setting up a health check, you can choose a standard (30 seconds) or fast (10 seconds) interval. But you must know that there are typically about 15 checkers sending requests from all kinds of regions. This means your endpoint will be hit every 2 seconds or so. You can decrease the number of checkers to 3, which still implies a minimal interval of at most 10 seconds.

Let’s see if we can swap Route 53 health checks for something else in the AWS toolbox.

Lambda

AWS Lambda is Amazon’s solution for serverless. It lets you run code (which AWS calls a function) without having to worry about provisioning or managing servers. Suppose I could write some code that calls the status endpoints of our apps and parses the result. I would just have to upload my code to AWS Lambda and configure how and when it will get triggered.

There are quite a few triggers available for Lambda functions. Most or them connect other AWS services to your Lambda and are either push or pull based. For example, you can trigger a Lambda function whenever a file gets uploaded into Amazon’s Simple Storage Service. You can also use the Invoke API to trigger a Lambda from your custom application. For this, there are several language-specific SDK’s made available by AWS.

For our health checks, I would like to trigger my Lambda every 15 minutes. Cloudwatch Event Rules can do just that! You can create a rule that fires on a cron expression or a fixed delay. Here’s a screenshot of the Cloudwatch Events console for creating a rule:

Cloudwatch Events Rule

OK, now let’s see if I can write some code to do the health check…

AWS Lambda supports several runtimes. At the time of writing, you can choose amongst .NET, Go, Java, Node.js, Python and Ruby. You can start to write code from scratch, use a blueprint (a template with some sample code) or browse the Lambda repository for some off-the-shelf applications. Since most of my professional career I have been writing code in Java, this was the go-to runtime for me, but then I stumbled upon a Python blueprint called lambda-canary. This blueprint uses a Cloudwatch Event trigger, just like I’ve described above, and the Python code checks a url for a valid response, sweet!

After googling some Python tutorials, I managed to modify the code in order to check the status of the components. I read the response as JSON and see whether all of my components are in a valid state:

def check_health(json_result, components):
  print('Endpoint response: {}'.format(json_result))

  for component in components:
    if json_result[component] in VALID_STATES:
      print('Component {} is in valid state {}'.format(component, json_result[component]))
    else:
      print('Component {} is in invalid state {}'.format(component, json_result[component]))
      return False

  return True


In the Lambda Designer, you can edit your function in an embedded code editor and create test events. This gives you a pretty short feedback loop for some trial and error hacking :-)

Lambda designer

OK, so now I have a Lambda function that checks my app status, and a rule that fires every 15 minutes. So far, so good. But actually I have several apps and different environments I want to check. How can I do this without duplicating my Lambda function?

I was looking at stuff like aliases and layers, but they seem to serve other purposes.

The best I could find was letting my Cloudwatch event rule invoke the same Lambda function multiple times but with different data. I did this by following this blog post, where AWS describes using constant JSON text as the payload for an event. When you look at the screenshot in that post, you can also see a button in the lower right corner that says “Add target”. This means you can have a single rule that invokes multiple functions when it fires.

I changed my Python code to accept a payload that looks like this:

{
  "app": "MyApp",
  "env": "ACC",
  "endpoint": "https://my-app.com/status",
  "components": [
    "component1",
    "component2"
  ]
}


OK great! Now we have a single rule that triggers a generic Lambda function to check the status of all my apps in all environments.

Lambda Pricing

Let’s have a look at what this Lambda would cost. Lambda’s are charged based on the number of requests, duration and memory usage. I must say I find it hard to understand how to calculate the cost, but here’s how I think it works…

  • Requests

You pay $0.0000002 per request. At an interval of 15 minutes, this is about 2880 requests per month per app. In total, that’s $0.000576 per month per app.

  • Duration

You pay $0.0000166667 per GB-second. If I do a test call, I can see that my Lambda burns about 72MB and 600ms. So if I cap my memory usage at 128MB, the cost is $0.000000208 per 100ms. That’s $0.000001248 for 1 function call of 600ms or $0,00359424 for 2880 requests per month per app.

In total, the cost would be $0.00417024 per app per month. I found a Lambda cost calculator online that gives about the same result. That’s way cheaper than the cost of a Route 53 health check!

But wait, there’s more: AWS states in their documentation there’s an indefinite free tier of 1M requests and 400,000 GB-seconds per month. For 128MB, that’s 3,200,000 seconds worth of free compute time and more than enough for our little Lambda.

Custom metrics

OK so the next step is to get some metrics on a dashboard for this Lambda. Luckily, Lambda already gives you some metrics out of the box:

Lambda metrics

We can use the error metric, since our lambda-canary blueprint raises an exception whenever the health check fails. Unfortunately, I see no way to split up these metrics based on the parameters (which app and environment) I’m invoking the function with.

Bummer!

Let’s find out what we can do about this…

Oh snap! I found some documentation here and here that says you can define your very own type of metric in AWS. And looking at the examples, it seems pretty easy to let your Lambda push metric data to Cloudwatch. My Python code and custom metric data looks like this:

cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
 MetricData = [
  {
   'MetricName': 'Status',
   'Dimensions': [
    {
     'Name': 'app',
     'Value': app
    },
    {
     'Name': 'env',
     'Value': env
    }
   ],
   'Unit': 'None',
   'Value': 1 if status_healthy else 0
  },
 ],
 Namespace = 'MyAppName'
)


The Namespace is like the category under which the metrics are available. There is a metric per combination of dimensions.

Custom metrics

If you want to push metrics to Cloudwatch like this, your Lambda is going to need extra permissions. You may or may not have noticed, but when we created our Lambda function in the console, AWS automatically created an execution role. This is like the set of permissions your function has to go crazy in the AWS world. Your function starts off with just some basic Cloudwatch Logs permissions, to push its logs out to Cloudwatch. If you want your function to do something extra, like sending metrics to Cloudwatch or storing a file in the Simple Storage Service (S3), you’re going to have to edit your execution role and give some extra permissions. You can start reading here if you’ld like to know more.

My extra set of permissions looks like this:

{
 "Effect": "Allow",
 "Action": [
  "cloudwatch:PutMetricData"
 ],
 "Resource": "*"
}


The resource wildcard seems a bit unsecure, but I didn’t find an Amazon Resource Name (ARN) for my Cloudwatch metric I want to push data to. I did find some hints in the documentation that specific ARNs don’t exist for Cloudwatch.

To go full circle, you can also set up alarms for your custom metrics, so you can get notified when your health check fails, just like we did for the Route 53 health checks.

Wrap-up

To wrap it up, here’s a diagram of what we’ve been building with our Lambda:

Lambda Diagram

I hope you had fun and learned as much as I did while tagging along. I think I’ve shown that you can do a lot with AWS and that Lambda’s are super cheap and are becoming quite a powerful service, especially if you can glue them together with some other AWS services.

If you have anything to add, you can find us on twitter, facebook, linkedin or instagram.

Happy Clouding!

Address
Diestsevest 32/0c
3000, Leuven
Connect with us