A Decoupled Event Bus with CloudWatch Events

During beer o'clock at work, I happened upon Roman, who is a software developer on our API and Integrations team. He asked for my opinion on creating a sort-of "contract" between teams when setting up SNS topics and subscribing them to SQS queues. Now, this is how conversations with me often start. I'm a software developer by trade, but I have my feet in cloud infrastructure and security as well so, at the very least, I'm a good sounding board for people's architecture ideas. I pushed a bit deeper and he finally relented and stated that he doesn't want to think about the infrastructure, only the contract between teams and, really, he wants to emit an event and have it consumed by someone else… if they care enough to consume it.

Identifying the Real Problem

Aha, now the real problem domain has been stated. Often, we get stuck designing architectures around what we already know and our designs reflect that echo chamber. By reaching out and talking with others, we can often collectively find a better solution. This is nobody's fault, it's difficult to think outside the box if you've never been exposed to new things. In other words, you don't know what you don't know.

I told him that what he's describing is an event or message bus. Also, the infrastructure he's designing (and often what he has to deal with day-to-day) is a leaky abstraction. As a software developer, you and your application shouldn't always need to think about every component and layer in the OSI model at all times. A good understanding of these layers and components is key, but often the lower layers leak into software and hamper more than help. He agreed that an event bus is what he wanted, but his design suffered from the fact that his application knows about the event bus (an SNS topic) and that other teams would need help subscribing to the topic. Not to mention, disaster recovery and service failover was not part of the design (and is difficult to build with those services).

In any event bus scenario, there is a producer (creator of the events) and the consumer (often multiple; they process the events). So there's two problems here: the first is that the producer knows about the structure of the event bus. It's an SNS topic, it reside in a specific region and an application needs permission to publish event to it. Second, the consumers know about both the SQS queue and the SNS topic. As a final problem, in cloud providers you are often working in multiple regions. If you want a reliable event bus, you don't want your application to publish messages to another region (so-called "cross-region" requests), because if the event bus region goes offline, now your app goes offline too.

Someone smart may then say: you should use Kinesis because it is a better event bus. True, multiple clients can consume off of one Kinesis stream, but you are still stuck in a single region. Also, your applications know about the Kinesis stream because they need to use the appropriate API call, thereby making it a leaky abstraction. Lastly, someone is going to need to manage that Kinesis resource, which makes it a shared concern/dependency for teams.

Introducing CloudWatch Events

Instead, I introduced him to CloudWatch Events. No, this doesn't have anything to do with alarms and metrics, but it's housed in the same service. It's a nifty service that already exists in everyone's AWS account by default. Many AWS services emit events all the time, and are then put onto the event bus that CloudWatch Events provides. You can consume from this event bus by creating a CloudWatch Event Rule, which includes a pattern to match against and one or more targets to send the event to.

In essence, an event rule uses a pattern to match against the JSON payload of any event flowing through the system. This happens is near real-time speeds. You can pattern match as granular as you want, from everything to specific actions within an AWS service. Once a match is made, the event payload is sent to a target. These targets range from Lambda functions, SNS topics, SQS queues, to even Kinesis streams. New targets are being added constantly so read the docs for more information.

The payload of a CloudWatch Event is a specific JSON structure. There are four keys required: source, detail, detail-type, and resources. However, that's about where any standard ends. source and detail are free-form strings, resources is an array of free-form strings, and detail-type is a JSON blob for any extra data about the event. For instance, when an auto-scaling group creates a new EC2 instance, it emits an event with source: "aws.autoscaling" and detail: "EC2 instance created successfully". If you create an event rule to pattern match on these two items, you can now know about any time an ASG successfully creates an EC2 instance.

Where the Real Use Comes In

At this point, CloudWatch Events is only mildy interesting. You get to know about AWS infrastructure events, but not every application cares about that. But that's not where CloudWatch Events shines. You can also push custom events from your applications into this service and have them consumed by other services. In this way, Roman's vision of not having to care about the contract with other services and simply emit an event is realized. In the event payload, you can set the source to your project name, resources to which server it came from, detail to the type of event it was, and detail-type to any extra data about the event. For instance, did you just receive a new customer signup? Great, emit an event, and maybe someone else will care enough to consume it. This is what your payload would look like:

{
  "source": "com.example-corp.customerapp",
  "detail": "New Signup",
  "resources": [
    "i-201kd02d02nd02ls"
  ],
  "detail-type": {
    "plan": "Enterprise500",
    "customer_id": 123456
  }
}

A consumer that cares about a new signup from this app would have an event rule pattern of:

{
  "source": [
    "com.example-corp.customerapp"
  ],
  "detail": [
    "New Signup"
  ]
}

…and you're done. Attach this rule to a Target (e.g. Lambda function) and you can now process the new signup event however you like. Maybe another team comes along and only cares about the big customers, so they write an event rule like below and email it to themselves via SNS:

{
  "source": [
    "com.example-corp.customerapp"
  ],
  "detail": [
    "New Signup"
  ],
  "detail-type": [
    "plan": [
      "Enterprise1000"
    ] 
  ]
}

All of these consumers can be implemented as CloudFormation stacks and launched in any region where the events are emitted. That means they are small, self-contained, serverless microservices. It hits all the buzzwords!

Decoupling Services and Leaky Abstractions

An application (producer) residing on an EC2 instance first needs a generic IAM permission (events:PutEvents) to emit events, and then it uses the AWS SDK to emit a specific event payload. And that's it. The producer doesn't know anything about the event bus, only that it exists somewhere in the region. That takes care of any leaky abstractions. It also doesn't know about any consumers. And consumers, on the other hand, have no idea where the event came from (apart from Event Rule pattern matching) and live as pure Lambda functions (or SNS, etc) logic. So that takes care of decoupling. Both the producer and the consumer can have different lifecycles, unbeknownst to each other. Since CloudWatch events are region-specific, you need to have your consumers in any region where the events are produced, but this is great because it forces you to design for immediate region failover (no cross-region event consumption allowed).

Decoupling Teams

If you have a data team and they want to know about all these events, they don't need to contact your application teams. They simply create an Event Rule with no filters (so it pattern matches everything) and create a Kinesis stream as a target. Beware, this is a lot of data (like drinking from a firehose!), but at least an entire team is decoupled from other teams.

Cost Considerations

I'm a frugal guy and I don't like using services that cost a lot of money once you start throwing a bunch of data at it.

In terms of cost, using CloudWatch Events as your custom event bus is incredibly cheap. This is the world of serverless, so you're not paying for idle times, only when the system is being used. Reading AWS-generated events is free, and emitting your own custom events is $1.00 per million events. The cost will also include any other AWS resources used in the rule targets (e.g. Lambda functions). For most low-to-medium usage, you're looking at less than $3 per month.

Some Examples

I've used CloudWatch Events to build many things, and it's so easy (especially with CloudFormation) and decoupled that it works like magic. So far I've created:

Where to Go From Here

CloudWatch Events may solve some problems, and it may not, depending on the problem domain. However, it does a good job of being an event bus that stays out of an application developer's way, and is decoupled enough that producers and consumers don't need to know about each other. The structure of the JSON event payload is simple enough to extend, but rigid enough that a custom schema isn't required.

Try it out for yourself on a small project and see how it goes. I bet it will be all you need without having to resort to Kafka or Kinesis. Save that stuff for the "enterprise" people.