Interested in working with us? We are hiring!

See open positions

Scheduled Tasks with ECS and Step Functions

Abimael Martinez Written by Abimael Martinez, August 03, 2022

Background

Amazon Web Services allows you to schedule tasks using ECS and CloudWatch, and for noncritical jobs, it is enough to run these regularly. It is simple, and we can manage many scheduled tasks with code using Terraform without the overhead and complexity of cloud orchestration systems such as Kubernetes.

But there’s a caveat to task scheduling with ECS + CloudWatch. The duo is missing two essential features for our workflow: task locks and timeouts. This post will cover how we leveraged two other AWS services, Step Functions and DynamoDB, to add these missing features.

The Problem

In the Real-time Bidding (RTB) team within NextRoll, we previously managed scheduled tasks using Celery, which is good in some respects, such as being flexible and integrating with Python (the language in which we wrote most of our tasks). But it also had significant drawbacks: it was hard to maintain and limited our programming language choices to Python. More importantly, our code deploys were tied to Celery deploys. That interaction caused minor but recurrent problems, such as task locks not getting released or multiple tasks of the same kind getting dispatched.

Because we were going to invest a significant amount of time fixing our problems with Celery anyway, we decided to invest some time investigating alternatives and comparing. So we defined some minimum requirements for the task management system to make the scope of our search smaller:

Some Solution(s)

We found a couple of options that met our minimum requirements, but they didn’t quite fit our taste:

While good in some aspects, these solutions drove us to keep looking for alternatives. One of those was ECS + CloudWatch scheduling, which we almost discarded because there is no way to achieve our minimum requirements (singleton behavior or timeout behavior) with only ECS and CloudWatch configurations.

Enter Step Functions

gregorydickson

Looking past the discouraging GitHub issues, we found a suggestion in the comments that could perhaps solve our problem. We looked at the Step Functions service, and we found that in combination with the ECS and CloudWatch systems, this met our requirements:

Although it is a novel approach, documentation is plenty, and it was easy to test it out before completely committing to the solution. We used the Step Functions integration with DynamoDB to implement the singleton behavior using per-task locks and the integration with ECS tasks to define timeouts.

But talk is cheap. On to the code!

Defining the Step Functions

Step Function

AWS Step Functions provides serverless orchestration for modern applications.

Step Functions are a list of steps or states (as in state machine) defined via JSON. You can define those steps using the AWS console for a visual aid useful for prototyping or, as we prefer, using Terraform and keeping our infrastructure as code.

We start here intending to configure a single task with singleton behavior that times out after a configurable amount of time using Terraform. After setting this up, you should be able to trigger the task from the Step Functions section in the AWS console.

Using the aws_sfn_state_machine from the AWS library we define a step function that checks and adds a lock corresponding to the task. If successful, it will run the task and remove the lock after it finishes. Here’s the code for the step function itself, the “steps” go nested inside the States map as key/value pairs:

resource "aws_sfn_state_machine" "task_lock" {
	name     = var.task_name
	role_arn = var.sfn_role

	definition = jsonencode({
		Comment = "Task Lock"
		StartAt = "CheckLock"
		States = {

Now we’ll define the CheckLock step. This step will add an item that corresponds to the task to DynamoDB on the condition that it doesn’t exist, and if that succeeds, we run the ECS Task (RunTask step). Here’s the step definition :

CheckLock = {
	Type     = "Task"
	Resource = "arn:aws:states:::dynamodb:putItem"
	Parameters = {
		TableName = var.table_name
		Item = {
			Task   = { S = var.task_name }
			Locked = { BOOL = true }
		}
		ConditionExpression = "attribute_not_exists(Task)"
	}
	Next = "RunTask"
}

Here are the details:

For the RunTask step we’ll run an ECS task and wait for it to finish. After the task finishes, we’ll go to the RemoveLock step:

RunTask = {
	Type     = "Task"
	Resource = "arn:aws:states:::ecs:runTask.sync"
	Parameters = {
		Cluster        = var.task_cluster
		TaskDefinition = aws_ecs_task_definition.default.arn
	}
	TimeoutSeconds = var.timeout_minutes * 60
	Next = "RemoveLock"
}

In the snippet above, the key sections are the resource runTask.sync, where the .sync part means the step function will wait for the ECS task to complete, and the TimeoutSeconds, after which the step function will terminate the ECS Task resource if it hasn’t already.

Finally, we remove the lock:

RemoveLock = {
	Type     = "Task"
	Resource = "arn:aws:states:::dynamodb:deleteItem"
	Parameters = {
		TableName = var.locking.table.name
		Key = {
			Task = { S = var.name }
		}
	}
	End = true
}

Here, the only new parameter is the End parameter. It means no steps remain, and the step function finished successfully.

Next = “Steps”

In this post, we’ve worked on a simplified version of a task-locking Step Function. It is missing a couple of important sections that will allow tasks and operations to fail gracefully, such as the Catch and Retry sections. You will also need to define the permissions for the step function role and the task scheduling using CloudWatch. But those should be easy to figure out.

An additional recommendation: encapsulate these configurations in a Terraform module that allows easy reuse and shared resources (such as the DynamoDB table).

Conclusion

Step Functions are an effective tool to integrate different services from AWS without writing code. The one we’ve defined for managing our scheduled tasks has run without issue, and its operation has been simple. We’ve found that Step Functions are a service we will keep close by in our dev toolbox.