Deploying Threat Intel Platforms, From GitHub.

It all started with a crazy research project.

A few years ago, I was involved with an odd research project. The team was made up of a mixture of individuals, most of us not sharing a work location, work culture (we were contractors on the project) or a common methodology of how to configure, deploy and manage both our prototype and CI infrastructure. In year's past, the paradigm was simple- your sysadmin team (eg: YOU) had a folder on your laptop with all the nuts and bolts required to deploy an application. You (or someone close to you) racked the servers, configured the OS and, if you were a superstar, documented it in a text file somewhere.

As things like AWS, Ansible and TravisCI became more mainstream, 'crazy devs' (hi!) decided we wanted to automate the deployment of our code. We wanted to treat it more like a pipeline rather than deal with the traditional QA process, which required more people and usually months to execute on. QA? We wanted to use production as QA!

We were trying to chain ansible with some earlier CI solution (a popular one I can't remember the name of anymore, life comes at you.. fast). The problem wasn't trying to manage and automate the code deployment, as much as it became managing the playbooks that deployed the application(s). We could have kept those playbooks in with the core code, but that's more over-head in the repo and more people touching the core code that didn't need to. It also meant mixing doc and issues with the core code, when really those were separate.

Early Stages.

Screen Shot 2018-04-13 at 14.12.49.png

It may not mean much in the earlier stages, but as you bring more "non developer focused resources" into the mix, you don't may not want them having access to other, more sensitive areas of the application. Revision control is helpful, but so is reducing complexity for those who's core competency isn't code development. I don't really care what minor version of python you want on the box, and they don't really care what the code does, just that "make install" does what it should.

Then there's the lifecycle changes of the code vs "the service". What do you do when you start a new version of the core code? Do you copy over all the playbook "ops" related stuff to the new repo? How do you transcend versions of the shipped product? We had this exact same problem with earlier CIF deployments too. At this stage, the available CI's were becoming more integrated with things like GitHub that, after some quick google searching the answer became clear: Put it in a repo dummy!


You Either Do or Don't.

Most shops fall into one of two camps, either you have a well documented ops pipeline, or you don't. Very few fall in between. If you have a process and are fully resourced to sustain that pipeline, great! Most of us.. are not. GitHub (or GitLab, or hosted git, it doesn't matter) affords you a couple of things 'for free':

  1. Ops playbooks (in ansible speak) are now version controlled. Want to make changes? Submit a PR. Want to make sure a 3rd party reviews those changes before they make it into the core deployment code? Enforce code reviews.

  2. GitHub/Lab/Whatver usually has a wiki. It's with the code. That mixed with a little ASCII FLOW magic and now your architecture doc is with your code. All backed up and easily accessible.

  3. Infrastructure issue tracking. Have an issue with a piece of the deployment or a piece of its infrastructure? You now have a compartmentalized place to track it!

  4. Separation of duties. Want to involve parts of your non-developer team with your deployment, but want to keep their minds free of the issues related to the code itself? Want to give them more flexibility to augment the ops part of your service, but not the development part? Done.

  5. Ability to open-source and garner community help with the core of your codebase, but not your operational playbooks (eg: CIF :)).

  6. Resilience. Since you're using things like ansible (and something like AWS), your infrastructure is now documented and backed by BY DEFAULT, not as an after-thought. This mitigates much of your technical debt risk as a manager.

  7. The "fork ability" to have other teams in your org, literally fork your work and test new ideas, without disrupting the original code base.

  8. Want your other teams to be able to test and replicate the service exactly as it's setup in production? Fork it! They find a bug in the deployment? Submit a PR!

The last two CIF deployments we've done follow this paradigm. We have a repo called "ses_ops" in which we have our custom ansible playbooks for each piece of the infrastructure. SES is the name of the service we run our CIF instance under. It looks something like this:


Within each component, the ansible playbooks are responsible for bringing up and configuring our AWS infrastructure. Since we don't use an 'all in one' CIF instance, we have to bring up load-balancers, configure elasticsearch nodes and press cif-router AMIs (disk images, the cif-routers are immutable). When we release a new version of CIF, we simply run the ansible code that configures a new AMI, push the new auto-scaling-group config and commit the change to GitHub.

The whole process takes 15 minutes, is fully tracked and mostly automated. I can hand the repo to someone and say; here are all the things you need to stand up the service, the wiki has a step by step process how to configure and execute ansible. I use it myself, because i'm terrible at trying to remember the various moving pieces, which, because of this process I don't have to do that often anymore. Stuff just runs, for years. Done is the engine of more.

CSIRTG Platform Operations.

Screen Shot 2018-04-13 at 14.07.10.png

This is also how we operate the CSIRTG platform, which is a monolith Ruby on Rails application. Unlike `ses_ops` though, where we need to manually configure and more carefully deploy some of the code (cif is more of a non-standard webapp), CSIRTG is deployed using ElasticBeanstalk. Since ElasticBeanstalk abstracts a bit of the deployment code for us, there isn't really a need for ansible and the configurations kept in the code base are very minor. Most of it is kept in the EB app itself and EB just reads from a special config directory in the code base where we can make tweaks as we go. Everything else is an environmental variable.

It's a trade-off, but there are less eye's involved with this code base than the others. The CSIRTG repo is linked to CircleCI, which, as we submit pull requests, tests are executed, and if they pass- CircleCI automatically pushes the new code to ElasticBeanstalk for deployment. This enables us to push multiple versions of the app every day. If I see a user trigger an odd 500 somewhere, usually a fix is deployed in minutes, not weeks. I don't have to try and remember what commands I need to deploy the new app. I submit a pull request, if it passes, we hit merge and deployment takes place.

I've been running this app for ~4 years now, and outside of a few "asset" issues (javascript, css), we've probably deployed 500+ versions of the code and only run into a handful of issues. The backout plan? EB enables you to do two things- first being only deploy to a single set of nodes at a time. If one fails, it stops and your other production nodes are able to keep chugging along on the old code while you figure it out. The second, if something goes really sideways, you can re-deploy the most recent stable version and try again.Of course this all assumes you don't want to test your changes on a set of QA instances, which EB makes it trivial to do. With the success and flexibility we've had using this, we've only needed it a handful of times. Usually coupled with changing major ruby or rails versions.

But does it do WEBSCALE?

Over the years these methodologies not only helped us scale sideways, but coupled with a very cheap AWS bill, enabled us to run faster with very few FTE cycles. We don't worry about documenting things as much or what commands we were supposed to run to deploy something. In order to get it to work, it has to be doc'd, that's the nature of AWS and Ansible.

The more you deploy, the faster you can run, the less anxiety you have, the more you can compete. We don't lose track of longer running issues- because they're easy to write down in a compartmentalized way. This is key, because if you have a repo with 1000 issues in it, you might as well just delete them all. Nobody is going to prune through that list to see what's important, and with infrastructure related problems, that can really harm your operation.