A Decoupled Event Bus with CloudWatch Events
During beer o'clock at work, I happened upon Roman, who is a software developer on our API and Integrations team. He asked for my opinion on creating a sort-of "contract" between teams when setting up SNS topics and subscribing them to SQS queues. Now, this is how conversations with me often start. I'm a software developer by trade, but I have my feet in cloud infrastructure and security as well so, at the very least, I'm a good sounding board for people's architecture ideas. I pushed a bit deeper and he finally relented and stated that he doesn't want to think about the infrastructure, only the contract between teams and, really, he wants to emit an event and have it consumed by someone else… if they care enough to consume it.
Identifying the Real Problem
Aha, now the real problem domain has been stated. Often, we get stuck designing architectures around what we already know and our designs reflect that echo chamber. By reaching out and talking with others, we can often collectively find a better solution. This is nobody's fault, it's difficult to think outside the box if you've never been exposed to new things. In other words, you don't know what you don't know.
I told him that what he's describing is an event or message bus. Also, the infrastructure he's designing (and often what he has to deal with day-to-day) is a leaky abstraction. As a software developer, you and your application shouldn't always need to think about every component and layer in the OSI model at all times. A good understanding of these layers and components is key, but often the lower layers leak into software and hamper more than help. He agreed that an event bus is what he wanted, but his design suffered from the fact that his application knows about the event bus (an SNS topic) and that other teams would need help subscribing to the topic. Not to mention, disaster recovery and service failover was not part of the design (and is difficult to build with those services).
In any event bus scenario, there is a producer (creator of the events) and the consumer (often multiple; they process the events). So there's two problems here: the first is that the producer knows about the structure of the event bus. It's an SNS topic, it reside in a specific region and an application needs permission to publish event to it. Second, the consumers know about both the SQS queue and the SNS topic. As a final problem, in cloud providers you are often working in multiple regions. If you want a reliable event bus, you don't want your application to publish messages to another region (so-called "cross-region" requests), because if the event bus region goes offline, now your app goes offline too.
Someone smart may then say: you should use Kinesis because it is a better event bus. True, multiple clients can consume off of one Kinesis stream, but you are still stuck in a single region. Also, your applications know about the Kinesis stream because they need to use the appropriate API call, thereby making it a leaky abstraction. Lastly, someone is going to need to manage that Kinesis resource, which makes it a shared concern/dependency for teams.
Introducing CloudWatch Events
Instead, I introduced him to CloudWatch Events. No, this doesn't have anything to do with alarms and metrics, but it's housed in the same service. It's a nifty service that already exists in everyone's AWS account by default. Many AWS services emit events all the time, and are then put onto the event bus that CloudWatch Events provides. You can consume from this event bus by creating a CloudWatch Event Rule, which includes a pattern to match against and one or more targets to send the event to.
In essence, an event rule uses a pattern to match against the JSON payload of any event flowing through the system. This happens is near real-time speeds. You can pattern match as granular as you want, from everything to specific actions within an AWS service. Once a match is made, the event payload is sent to a target. These targets range from Lambda functions, SNS topics, SQS queues, to even Kinesis streams. New targets are being added constantly so read the docs for more information.
The payload of a CloudWatch Event is a specific JSON structure. There are
four keys required: source, detail, detail-type, and resources. However,
that's about where any standard ends. source
and detail
are free-form
strings, resources
is an array of free-form strings, and detail-type
is
a JSON blob for any extra data about the event. For instance, when an
auto-scaling group creates a new EC2 instance, it emits an event with
source: "aws.autoscaling"
and
detail: "EC2 instance created successfully"
. If you create an event
rule to pattern match on these two items, you can now know about any time
an ASG successfully creates an EC2 instance.
Where the Real Use Comes In
At this point, CloudWatch Events is only mildy interesting. You get to
know about AWS
infrastructure events, but not every application cares about that. But
that's not where CloudWatch Events shines. You can also push custom
events from your applications into this service and have them consumed by
other services. In this way, Roman's vision of not having to care about
the contract with other services and simply emit an event is realized. In
the event payload, you can set the source
to your project name,
resources
to which server it came from, detail
to the type of event it
was, and detail-type
to any extra data about the event. For instance,
did you just receive a new customer signup? Great, emit an event, and
maybe someone else will care enough to consume it. This is what your
payload would look like:
{
"source": "com.example-corp.customerapp",
"detail": "New Signup",
"resources": [
"i-201kd02d02nd02ls"
],
"detail-type": {
"plan": "Enterprise500",
"customer_id": 123456
}
}
A consumer that cares about a new signup from this app would have an event rule pattern of:
{
"source": [
"com.example-corp.customerapp"
],
"detail": [
"New Signup"
]
}
…and you're done. Attach this rule to a Target (e.g. Lambda function) and you can now process the new signup event however you like. Maybe another team comes along and only cares about the big customers, so they write an event rule like below and email it to themselves via SNS:
{
"source": [
"com.example-corp.customerapp"
],
"detail": [
"New Signup"
],
"detail-type": [
"plan": [
"Enterprise1000"
]
]
}
All of these consumers can be implemented as CloudFormation stacks and launched in any region where the events are emitted. That means they are small, self-contained, serverless microservices. It hits all the buzzwords!
Decoupling Services and Leaky Abstractions
An application (producer) residing on an EC2 instance first needs a
generic IAM permission (events:PutEvents
) to emit events, and then it
uses the AWS SDK to emit a specific event payload. And that's it. The
producer doesn't know anything about the event bus, only that it exists
somewhere in the region. That takes care of any leaky abstractions.
It also doesn't know about any consumers. And consumers, on the other
hand, have no idea where the event came from (apart from Event Rule
pattern matching) and live as pure Lambda functions (or SNS, etc) logic.
So that takes care of decoupling. Both the producer and the consumer can
have different lifecycles, unbeknownst to each other. Since CloudWatch
events are region-specific, you need to have your consumers in any region
where the events are produced, but this is great because it forces you to
design for immediate region failover (no cross-region event consumption
allowed).
Decoupling Teams
If you have a data team and they want to know about all these events, they don't need to contact your application teams. They simply create an Event Rule with no filters (so it pattern matches everything) and create a Kinesis stream as a target. Beware, this is a lot of data (like drinking from a firehose!), but at least an entire team is decoupled from other teams.
Cost Considerations
I'm a frugal guy and I don't like using services that cost a lot of money once you start throwing a bunch of data at it.
In terms of cost, using CloudWatch Events as your custom event bus is incredibly cheap. This is the world of serverless, so you're not paying for idle times, only when the system is being used. Reading AWS-generated events is free, and emitting your own custom events is $1.00 per million events. The cost will also include any other AWS resources used in the rule targets (e.g. Lambda functions). For most low-to-medium usage, you're looking at less than $3 per month.
Some Examples
I've used CloudWatch Events to build many things, and it's so easy (especially with CloudFormation) and decoupled that it works like magic. So far I've created:
- a Slack notifier for when people assume a privileged IAM role
- an OpsGenie notifier for when someone logs in as the AWS root user
- Automatically Naming ASG-backed EC2 instances
- A user data event emitter (I'll dive into this in a future article)
Where to Go From Here
CloudWatch Events may solve some problems, and it may not, depending on the problem domain. However, it does a good job of being an event bus that stays out of an application developer's way, and is decoupled enough that producers and consumers don't need to know about each other. The structure of the JSON event payload is simple enough to extend, but rigid enough that a custom schema isn't required.
Try it out for yourself on a small project and see how it goes. I bet it will be all you need without having to resort to Kafka or Kinesis. Save that stuff for the "enterprise" people.