Operationalizing the AlienVault Sensor CloudFormation Template - Part 3

This is part 3 in a series of articles. To follow along via code, visit the Github repository.

The last article discussed some refactoring use new(-ish) CloudFormation features, which help improve the readability and reduce the template's file size. This article temporarily moves away from template modifications and focuses on how someone can review a CloudFormation template for security and operational risks.

Parameters

Let's start with the Parameters section, because that is going to tell us what dependencies we need to bring into this template from our AWS or organizational environment.

SSHLocation

First off we encounter the SSHLocation parameter, which sets the ingress rule on the security group to allow SSH access into the instance. Hmm, why are we allowing SSH into a security appliance? It has implications on integrity of the instance itself. Nevertheless, let's assume there's a good business case for it. This parameter is provided with a default of 0.0.0.0/0 which means the SSH port is exposed to the Internet. This is a security risk. Of course, the template designer was nice enough to warn us about not using 0.0.0.0/0 and yet set it as an insecure default anyway. I know people, and people choose defaults first. If people want to shoot themselves in the foot, so be it, but let's put a guard rail up. To reduce the security risk, we can remove the Default attribute and make the parameter a required input. This ensures the person deploying the template considers the implications of opening up SSH to a particular IP address range.

KeyName

Next we have a KeyName, which makes sense because they provide SSH access to the EC2 instance. But, this opens up an operational risk and a security risk. EC2 keypairs cannot be changed once the instance is running so how are we supposed to rotate the keypair every X months when the machine would be running? The security risk is that if a keypair is being used, then does that mean we are expected to share the keypair with them? That opens up the possibility for a third-party vendor to access the machine in our environment, not the situation I would want to happen without some guarantees. But even if the keypair is kept within our organization, how do we share the private key material with more than one person, because this generic account violates the core security principal of repudiation (i.e. we can no longer identify the specific person using the key). The least-optimal solution here is to restrict the keypair to a small set of people, rotate it regularly and rotate the instance. The most-optimal solution is to remove the keypair entirely and find another way to access the instance (SSM Session Manager, perhaps).

APITermination

Next is APITermination, and there is no good description for it. Reading the rest of the template we can see that this parameter allows one to disable the ability for the EC2 instance to be terminated by someone using the AWS CLI or console. As this sensor is a security product, it is good that this defaults to true.

HTTPLocation

HTTPLocation is the next parameter and suffers from the same fate as SSHLocation, so the solution is the same.

PublicIP

The parameter PublicIP is interesting because we just went through the steps of understanding SSHLocation and HTTPLocation and now they are saying the EC2 instance could potentially just be accessible on a privately-routed network. This is interesting, and may reduce some security risks for us, but the description leaves me wanting to learn more.

VpcId and SubnetId

VpcId and SubnetId tell me that the EC2 instance will reside in a pre-existing VPC and subnet in my AWS region. This is a security risk because what does the sensor need to communicate with to warrant a particular subnet? Does it connect to other resources in the subnet? Should it be in its own VPC and subnet? This also presents an operational risk that there is a sensor living in a subnet that could take up an IP address or accept traffic from other resources.

TrafficMirroring and NodeName

There is no problem with TrafficMirroring or NodeName. However, NodeName tells me that we may have multiple nodes in the same region, and that may have implications on the stack exports we setup in the last article. In-fixing the nodename into the export name will solve this problem (e.g. !Sub "alienvault:${NodeName}:endpoint:url".

Skippable Sections

We can skip the Conditions and Metadata block as these are not provisioning resources.

Mappings

The Mappings block is interesting here. It tells me that they have predefined AMIs for this sensor. When I investigated the AMIs in AWS, I found that they were owned by AlienVault, which is a good thing. The problem here is an operational risk. When the stack is deployed, it will use the AMI that is set for its region. This AMI could be unavailable for whatever reason and now the template cannot be used. How do I deploy the sensor to new regions (i.e. there is no ap-north-1 mapping)? Also, this mapping means that AT&T must update the template with new AMIs every so often, begging the question of how are we supposed to keep our server up to date with new OS-level packages? Does the server update itself? The solution to mitigate part of the operational risk is to remove the AMI mapping from the template and, instead, require the customer to provide an AMI ID as a parameter. The list of AMI IDs would then be downloaded from a URL on the AT&T website. That way the template is evergreen, meaning it does not show its age and a template that is 1 year old can still be used, allowing for a more deterministic deployment.

Resources

The meat of the template is in the Resources section.

The EC2 Instance

The EC2 instance resource is fairly straight forward, but it presents a major operational risk. This is, by its nature, a snowflake server. Snowflakes are things that are unique and we try to avoid them in the Cloud, simply because Cloud servers tend to be ephemeral and disappear from time to time (like cloud often do). Seeing a single EC2 instance being provisioned like this tells me that AT&T has not considered using an auto-scaling group (ASG) for self-healing (bringing up a new instance when the other has died). They have not considered how often an EC2 retirement notice event happens. Self-healing is one of the core principles of resiliency. As a large security company that supposedly integrates with AWS, I'm surprised they don't consider this. The solution here is to provision an ASG for self-healing and not elasticity, not to mention the ability to bring up the server in another availability zone (AZ) to prevent a full failure. AZs go offline all the time and multi-AZ deployment is an AWS best practice for resiliency.

Instance IAM Role

Next up is the IAM role, used by the EC2 instance to reach out and do…stuff. Based on the permissions it is given, we can define what that "stuff" is. Whoa. It's not good, and has multiple security and operational risks. This is a poorly-written IAM policy that any SRE, infrastructure developer, or Cloud security engineer would fail. It suffers from three major issues:

  1. It bundles all the actions into a single statement. This makes it easy to read, but difficult to create a least privilege profile.
  2. It uses wildcards in the actions, which can grant more permissions that one expects.
  3. It uses a wildcard in the resource, thereby granting each action access to every resource.

A lot of these permissions make sense when you consider that the sensor is trying to discover assets and events within an entire AWS account. But let's drill down into specifics. Looking at the S3 action s3:Get*, they are granting themselves access to download every object from every bucket in the AWS account. They are also granting themselves the ability to get torrent links (because why not?) of objects. The S3 action is done presumably out of ease of use, but this is a very insecure default for a major security company to take. The sensor may need to download logs, but not every bucket is a log and sometimes these buckets contain customer or health data, which would violate an organization's compliance. The solution here is to split these actions into separate policy statements (or policy resources) and then scope the resources to only those resources that the sensor is allowed to see. If the sensor needs to read from a new resource, update the stack template with the new resource, then apply the change to the stack. I'd also like to see some clarity around why the sensor needs some of these permissions, so this definitely warrants a follow-up from AT&T and comments added to the template.

Security Groups / Firewall

Next we have the security groups and their ingress rules. While most seem okay – wait a sec, are they actually opening up port 80? There's no SSL involved or anything, so what kind of communication will be happening with this machine, because it will be done in plaintext? Even a self-signed SSL certificate could be used here, and switch the port to 443. The port 80 also looks to be labelled as a health check port, which begs the question about whether AT&T can even reach the health check if the HTTPLocation is not open to 0.0.0.0/0 or, worse, in a privately-routed subnet. Why would a major security company rely on port 80? My solution would be to consider putting the machine behind an application load balancer (ALB; costs ~$18/mo), provisioning an ACM SSL certificate, and terminating SSL at the ALB. It's a minimum level of security for a security sensor.

EBS Volume

Lastly we have the DataStorage resource, which is an EBS volume. It concerns me a little that there is data being stored by the sensor, when the sensor is supposed to be pushing everything to a Security Operations Centre (SOC). It's at least encrypted, which is good, but it's also just the default KMS encryption, so it's more like checkbox security. Depending on the type of data the sensor is capturing, and we don't know that information, default encryption may or may not be good enough for your organization. In addition, if 100GB enough space? What happens to the sensor if the volume is full? What happens if the EC2 instance is terminated abruptly, leaving data on the EBS volume? Is the customer expected to provision a new EC2 instance, then reattach the data storage volume to do some data transfer? It's not explained here. And lastly, why is the DeletionPolicy set to Delete when this is a data storage volume? If the stack is deleted, so to will the EBS volume and, given its resource name, would we want it to be retained?


So that's the end of the review. There are numerous security and operational risks with deploying the template as-is. It will need some risk mitigation in place, which is the path the next article starts down.