Docker on ECS

Approaching 2017 at MightySignal, we took a health check of our tech stack. Our product was growing nicely but our infrastructure was lagging. We collect a lot of data on mobile apps, mostly in huge burst workflows. We could not automatically scale the cluster to meet demand, so we’d adopted a procedure to increase capacity, complete the job, and downsize. The steps were simple but manual.

Every engineer oversees a large domain when your team is three people. Product work always beckons. To revisit and monitor the the same capacity problems week after week puts undue cognitive load on our team and takes our precious time (not to mention on weekends). We targeted year end to remove this burden.

Getting Started

Marco, our in-house Docker expert, strongly recommended a container-based solution. He joined our team recently and had already started using Docker in development. We decided to run a pilot with our API servers and migrate the rest of the stack if successful. We chose ECS as most of our stack is built on AWS.

Before the migration, we had ~20-25 on-demand EC2 instances running various projects. Multiple client-facing UI instances and staging environments. Same for the API. Instances scraping mobile app metadata for various stores and countries, downloading and storing mobile apps, classifying SDKs, monitoring external services, and updating our Elasticsearch cluster.

Our primary applications are built with Ruby on Rails, using Redis-backed Sidekiq for background job orchestration. We used Capistrano for deployments and use Chef recipes to bake AMI’s for our instances.

The Migration

We constructed our Dockerfiles. Using our existing Chef recipes as a guide, we built container assets from Chef files and templates and brought them into our applications’ repositories.

We automated our deployment pipeline with CircleCI. Newly pushed commits trigger test runs. Release tags initiate the creation of container images. Images are then uploaded to ECR to be accessed by ECS.

Tools

We didn’t get very far before we discovered the need for a tooling upgrade. We decided to standardize tag convention <branch name>.<incrementer> to support our tag-based deployment workflow. Introducing mightytag, a convenience layer around git tag.

Updating task definitions and services during deployment had to be abstracted away for correctness and our sanity. mightydeploy - you should start noticing a pattern - makes it simple to tweak a set of sensible defaults for container settings to configure each service. It also templates sensitive information into the container’s environment variables at runtime by downloading and merging in a secrets JSON file stored in s3. Once defined, a service can be updated to a desired image by specifying it in the options.

mightydeploy --service WEB --images application=sidekiq-nginx=master.5

We soon realized accessing the production Rails console was no longer a simple SSH. ECS distributes containers across a cluster’s instances. Getting on the console now involves finding the task definitions, locating the instance it runs on, and accessing the container.

mightyrun --key ~/.ssh/my_key.pem --service API sidekiq bash

We developed mightyrun to make this simpler. The interface is reminiscent of docker-compose’s. The command above runs bash on the sidekiq container of the API service.

Restructuring

Finally we had to restructure the background queueing logic. Whether intentionally or not, Sidekiq nudges you toward grouping background queues into broad categories. See the default Sidekiq config that defines queues based on priority. We settled on “server group” style queueing that separated concerns. The instances with the scraper queue would scrape app stores while those running the sdk queue would classify SDKs. We enjoyed this convenient setup but eventually felt restricted as we soon had to manage a wide variety of tasks bucketed together in broadly defined queues. To support a new queue meant either adding its workload to an existing server group or creating and maintaining a new set of instances to support it. The unappealing second option led to busier, overgrown queues. As the product needs grew, jobs from all over the product competed for priority and time on the same instances.

Eager to move away from this setup, we established a 1:1 relationship between services (+ task definitions) and queues. The ios_web_scrape queue will only do iOS web scrapes and can be adjusted via the ios_web_scrape service.

Days of adjusting, rewriting, and keyboard-smashing later, we stood up over 30 separate services from a handful of queues. One queue was even broken into 10 (!!) separate services.

A Brave New World

We migrated our entire system in a month. ~5 on-demand EC2 instances manage all client-facing live systems in an auto-scaling group. ~10 spot instances handle all the background jobs. The entire team uses Docker locally to develop and deploys with our CircleCI-based workflow.

Benefits

Our EC2 costs shrunk. We run multiple applications per instance - even ones that were previously incompatible - and more heavily utilize our instances. This allows us to downsize our cluster of on-demand instances. Another huge gain was the addition of spot instances to handle all the background jobs. Our new microservice (hah!) structure allows us to prioritize low priority background jobs and swap in spot instances when we can, saving us ~80% for the same computing power. In total, we saved 73% on instance costs for an ECS setup that reproduced the same functionality.

More importantly, we eliminated the scaling headache that prompted the entire project. Our instance autoscaling groups are still manually resized, but we don’t need to scale for background jobs anymore. With ECS, we just create a service for that particular job and specify a task count that complete the background job in a reasonable time. After configuring all the services, we can resize the spot instance autoscaling group such that it can manage the size of the entire cluster. Each background task itself is usually inactive, so we cram containers onto the boxes with low cpu and memory reservations. We allocated ~75 total tasks in the background job cluster over 10 instances. Spot instances are so cheap in comparison to on-demand ones that we just keep them up all the time.

I’m equally excited about some other perks. We took a giant step towards immutable infrastructure. Our team is small and none of us are experts on server maintenance. Docker and ECS can monitor and handle misbehaving containers for us. We simplified developer onboarding tremendously. Install Docker, clone a repository and run docker-compose up. No more OS-specific installations of Redis, Elasticsearch, Mysql, etc. Upgrading macOS will be less terror-inducing without worrying about some permissions being wiped and or other side effects affecting my setup.

We offload our advanced querying needs to Elasticsearch, to allow our customers to consider all signals and find the mobile market data they seek. We construct millions of documents with all the relevant app attributes - user base size, ratings by countries, SDK install and uninstall histories, etc. - and deposit them into our Elasticsearch cluster weekly. Compute-intensive tasks like these have seen a huge performance boost. We can now spread 20 processes across 10 machines rather than have 20 Sidekiq threads on 1 machine competing for resources.

Downsides

Deployments on ECS are slower and more opaque, especially those behind load balancers. Capistrano was simple and well-understood. To deploy now is to use mightydeploy and then monitor the events log on ECS to find when the service re-establishes a steady state. It can be quick or it can lag. Either way only the minimal information in the events log surfaces. We’re still searching for ways to increase visibility. Including the CircleCI image building, our deployment time increased from around a minute to between 5 and 10 minutes.

The lack of built-in Docker image layer caching options in CircleCI caused us significant headaches in the beginning. We were able to significantly speed up image creation by revising our CircleCI deployment stage to pull down a saved image and using it locally as the layer cache. Kudos to CircleCI for having such a flexible platform that made this possible. This apparently will be addressed in CircleCI v2.

It’s too early to call our experiment a success, but the early signs are promising. Lower costs, less maintenance, and new scaling capabilities in exchange for slower and more complex deployments. And most exciting of all, we have one slightly less busy engineering team.


ecs

1373 Words

2016-12-26 14:33 -0800