Scheduled Tasks with ECS and Step Functions

Written by Abimael Martinez, August 03, 2022

Background

Amazon Web Services allows you to schedule tasks using ECS and CloudWatch, and for noncritical jobs, it is enough to run these regularly. It is simple, and we can manage many scheduled tasks with code using Terraform without the overhead and complexity of cloud orchestration systems such as Kubernetes.

But there’s a caveat to task scheduling with ECS + CloudWatch. The duo is missing two essential features for our workflow: task locks and timeouts. This post will cover how we leveraged two other AWS services, Step Functions and DynamoDB, to add these missing features.

The Problem

In the Real-time Bidding (RTB) team within NextRoll, we previously managed scheduled tasks using Celery, which is good in some respects, such as being flexible and integrating with Python (the language in which we wrote most of our tasks). But it also had significant drawbacks: it was hard to maintain and limited our programming language choices to Python. More importantly, our code deploys were tied to Celery deploys. That interaction caused minor but recurrent problems, such as task locks not getting released or multiple tasks of the same kind getting dispatched.

Because we were going to invest a significant amount of time fixing our problems with Celery anyway, we decided to invest some time investigating alternatives and comparing. So we defined some minimum requirements for the task management system to make the scope of our search smaller:

Cron-like scheduling

The essential feature. It consists of running a task at fixed times, dates, or intervals.
Singleton behavior

We should be able to constrain task definitions so that only a single instance of it can run at any given time. This behavior helps prevent race conditions and resource starvation scenarios while producing repeatable side effects.
Timeouts

Tasks should also be constrained to only run for a limited and configurable interval. This behavior prevents runaway resource consumption and frozen tasks and helps identify bugs quicker.

Some Solution(s)

We found a couple of options that met our minimum requirements, but they didn’t quite fit our taste:

Code it into each of the tasks, perhaps via shared libraries.

Sometimes, timeouts and locking do not work when managed by the task itself. Funny bugs and unexpected situations are not uncommon, as experienced when attempting this solution using Python in a previous iteration.

We prefer a solution that manages the task from outside its container for reliability and ease of use. It also allows for more flexibility when choosing a programming language for a task because we don’t need to have existing libraries in the target language to manage it.
Use a cloud orchestration system such as Kubernetes or Mesos.

Orchestration systems have their complexities and management overhead. We didn’t have a Kubernetes expert in our team, for example. And other systems don’t have managed solutions in AWS. Although we prefer cloud orchestration systems to code-in-tasks, we continued to look for something simpler.

While good in some aspects, these solutions drove us to keep looking for alternatives. One of those was ECS + CloudWatch scheduling, which we almost discarded because there is no way to achieve our minimum requirements (singleton behavior or timeout behavior) with only ECS and CloudWatch configurations.

Enter Step Functions

gregorydickson

Looking past the discouraging GitHub issues, we found a suggestion in the comments that could perhaps solve our problem. We looked at the Step Functions service, and we found that in combination with the ECS and CloudWatch systems, this met our requirements:

A managed system with a low overhead configuration
Outside-the-container task management for singleton behavior and timeouts
Bonus: integrations with other AWS services

Although it is a novel approach, documentation is plenty, and it was easy to test it out before completely committing to the solution. We used the Step Functions integration with DynamoDB to implement the singleton behavior using per-task locks and the integration with ECS tasks to define timeouts.

But talk is cheap. On to the code!

Defining the Step Functions

Step Function

AWS Step Functions provides serverless orchestration for modern applications.

Step Functions are a list of steps or states (as in state machine) defined via JSON. You can define those steps using the AWS console for a visual aid useful for prototyping or, as we prefer, using Terraform and keeping our infrastructure as code.

We start here intending to configure a single task with singleton behavior that times out after a configurable amount of time using Terraform. After setting this up, you should be able to trigger the task from the Step Functions section in the AWS console.

Using the aws_sfn_state_machine from the AWS library we define a step function that checks and adds a lock corresponding to the task. If successful, it will run the task and remove the lock after it finishes. Here’s the code for the step function itself, the “steps” go nested inside the States map as key/value pairs:

resource "aws_sfn_state_machine" "task_lock" {
	name     = var.task_name
	role_arn = var.sfn_role

	definition = jsonencode({
		Comment = "Task Lock"
		StartAt = "CheckLock"
		States = {

Now we’ll define the CheckLock step. This step will add an item that corresponds to the task to DynamoDB on the condition that it doesn’t exist, and if that succeeds, we run the ECS Task (RunTask step). Here’s the step definition :

CheckLock = {
	Type     = "Task"
	Resource = "arn:aws:states:::dynamodb:putItem"
	Parameters = {
		TableName = var.table_name
		Item = {
			Task   = { S = var.task_name }
			Locked = { BOOL = true }
		}
		ConditionExpression = "attribute_not_exists(Task)"
	}
	Next = "RunTask"
}

Here are the details:

Type is the type of step we’re defining.
Resource is the AWS resource we’ll instantiate.
Parameters are the parameters passed to the Resource, different for each kind of resource. In particular, ConditionExpression tells DynamoDB to return an error if the item key is in the table already.
Next is the task that will get run after this.

For the RunTask step we’ll run an ECS task and wait for it to finish. After the task finishes, we’ll go to the RemoveLock step:

RunTask = {
	Type     = "Task"
	Resource = "arn:aws:states:::ecs:runTask.sync"
	Parameters = {
		Cluster        = var.task_cluster
		TaskDefinition = aws_ecs_task_definition.default.arn
	}
	TimeoutSeconds = var.timeout_minutes * 60
	Next = "RemoveLock"
}

In the snippet above, the key sections are the resource runTask.sync, where the .sync part means the step function will wait for the ECS task to complete, and the TimeoutSeconds, after which the step function will terminate the ECS Task resource if it hasn’t already.

Finally, we remove the lock:

RemoveLock = {
	Type     = "Task"
	Resource = "arn:aws:states:::dynamodb:deleteItem"
	Parameters = {
		TableName = var.locking.table.name
		Key = {
			Task = { S = var.name }
		}
	}
	End = true
}

Here, the only new parameter is the End parameter. It means no steps remain, and the step function finished successfully.

Next = “Steps”

In this post, we’ve worked on a simplified version of a task-locking Step Function. It is missing a couple of important sections that will allow tasks and operations to fail gracefully, such as the Catch and Retry sections. You will also need to define the permissions for the step function role and the task scheduling using CloudWatch. But those should be easy to figure out.

An additional recommendation: encapsulate these configurations in a Terraform module that allows easy reuse and shared resources (such as the DynamoDB table).

Conclusion

Step Functions are an effective tool to integrate different services from AWS without writing code. The one we’ve defined for managing our scheduled tasks has run without issue, and its operation has been simple. We’ve found that Step Functions are a service we will keep close by in our dev toolbox.