Amazon Cloud¶
AWS security credentials¶
Nextflow uses the AWS security credentials to make programmatic calls to AWS services.
You can provide your AWS access keys using the standard AWS variables shown below:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
If AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
are not defined in the environment, Nextflow will attempt to
retrieve credentials from your ~/.aws/credentials
or ~/.aws/config
files. The default
profile can be
overridden via the environmental variable AWS_PROFILE
(or AWS_DEFAULT_PROFILE
).
Alternatively AWS credentials can be specified in the Nextflow configuration file.
See AWS configuration for more details.
Note
Credentials can also be provided by using an IAM Instance Role. The benefit of this approach is that it spares you from managing/distributing AWS keys explicitly. Read the IAM Roles documentation and this blog post for more details.
AWS IAM policies¶
AIM policies are the mechanism used by AWS to defines permissions for IAM identities. In order to access certain AWS services, the proper policies must be attached to the identity associated to the AWS credentials.
Minimal permissions policies to be attached to the AWS account used by Nextflow are:
To interface AWS Batch:
"batch:DescribeJobQueues" "batch:CancelJob" "batch:SubmitJob" "batch:ListJobs" "batch:DescribeComputeEnvironments" "batch:TerminateJob" "batch:DescribeJobs" "batch:RegisterJobDefinition" "batch:DescribeJobDefinitions"
To be able to see the EC2 instances:
"ecs:DescribeTasks" "ec2:DescribeInstances" "ec2:DescribeInstanceTypes" "ec2:DescribeInstanceAttribute" "ecs:DescribeContainerInstances" "ec2:DescribeInstanceStatus"
To pull container images stored in the ECR repositories:
"ecr:GetAuthorizationToken" "ecr:BatchCheckLayerAvailability" "ecr:GetDownloadUrlForLayer" "ecr:GetRepositoryPolicy" "ecr:DescribeRepositories" "ecr:ListImages" "ecr:DescribeImages" "ecr:BatchGetImage" "ecr:GetLifecyclePolicy" "ecr:GetLifecyclePolicyPreview" "ecr:ListTagsForResource" "ecr:DescribeImageScanFindings"
S3 policies¶
Nextflow requires policies also to access S3 buckets in order to:: - use the workdir - pull input data - publish results
Depending on the pipeline configuration, the above actions can be done all in a single bucket but, more likely, spread across multiple buckets. Once the list of buckets used by the pipeline is identified, there are two alternative ways to give Nextflow access to these buckets:
grant access to all buckets by attaching the policy “s3:*” to the AIM identity. This works only if buckets do not set their own access policies (see point 2);
for a more fine grained control, assign to each bucket the following policy (replace the placeholders with the actual values):
{ "Version": "2012-10-17", "Id": "<my policy id>", "Statement": [ { "Sid": "<my statement id>", "Effect": "Allow", "Principal": { "AWS": "<ARN of the nextflow identity>" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::<bucket name>/*" }, { "Sid": "AllowSSLRequestsOnly", "Effect": "Deny", "Principal": "*", "Action": "s3:*", "Resource": [ "arn:aws:s3:::<bucket name>", "arn:aws:s3:::<bucket name>/*" ], "Condition": { "Bool": { "aws:SecureTransport": "false" } } } ] }
See the bucket policy documentation for additional details.
AWS Batch¶
Note
Requires Nextflow version 0.26.0 or later.
AWS Batch is a managed computing service that allows the execution of containerised workloads in the Amazon cloud infrastructure. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized compute resources) based on the volume and specific resource requirements of the jobs submitted.
Nextflow provides a built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in the cloud offloading the process executions as Batch jobs.
AWS CLI¶
Nextflow requires to access the AWS command line tool (aws
) from the container in
which the job runs in order to stage the required input files and to copy back the resulting output files in the S3 storage.
The aws
tool can be made available in the container in two ways:
1 - installed in the Docker image(s) used during the pipeline execution
2 - installed in a custom AMI (Amazon Machine Image) to use in place of the default AMI when configuring AWS Batch (see next section).
The latter approach is preferred because it allows the use of existing Docker images without the need to add the AWS CLI tool to them.
See the sections below to learn how to create a custom AMI and install the AWS CLI tool to it.
Get started¶
- 1 - In the AWS Console, create a Compute environment (CE) in your AWS Batch Service.
if are using a custom AMI (see following sections), the AMI ID must be specified in the CE configuration
make sure to select an AMI (either custom or existing) with Docker installed (see following sections)
make sure the policy
AmazonS3FullAccess
(granting access to S3 buckets) is attached to the instance role configured for the CEif you plan to use Docker images from Amazon ECS container, make sure the
AmazonEC2ContainerServiceforEC2Role
policy is also attached to the instance role
2 - In the AWS Console, create (at least) one Job Queue and bind it to the Compute environment
3 - In the AWS Console, create an S3 storage’s bucket for the bucket-dir (see below) and others for the input data and results, if/as needed
4 - Make sure your pipeline processes specifies one or more Docker containers by using the container directive.
5 - Container images need to be published in a Docker registry such as Docker Hub, Quay or ECS Container Registry that can be reached by ECS Batch.
Configuration¶
When configuring your pipeline:
import the nf-amazon plugin
specify the AWS Batch executor
specify one or more AWS Batch queues for the execution by using the queue directive.
An example nextflow.config
file is shown below:
plugins {
id 'nf-amazon'
}
process {
executor = 'awsbatch'
queue = 'my-batch-queue'
container = 'quay.io/biocontainers/salmon'
}
aws {
batch {
// NOTE: this setting is only required if the AWS CLI tool is installed in a custom AMI
cliPath = '/home/ec2-user/miniconda/bin/aws'
}
region = 'us-east-1'
}
Different queues bound to the same or different Compute environments can be configured according to each process’ requirements.
Custom AMI¶
There are several reasons why you might need to create your own AMI (Amazon Machine Image) to use in your Compute environments. Typically:
you do not want to modify your existing Docker images and prefer to install the CLI tool on the hosting environment
the existing AMI (selected from the marketplace) does not have Docker installed
- you need to attach a larger storage to your EC2 instance (the default ECS instance AMI has only a 30G storage
volume which may not be enough for most data analysis pipelines)
you need to install additional software, not available in the Docker image used to execute the job
Create your custom AMI¶
In the EC2 Dashboard, click the Launch Instance button, then choose AWS Marketplace in the left pane and enter ECS in the search box. In result list select Amazon ECS-Optimized Amazon Linux 2 AMI, then continue as usual to configure and launch the instance.
Note
The selected instance has a bootstrap volume of 8GB and a second EBS volume 30G for computation which is hardly enough for real world genomic workloads. Make sure to specify an amount of storage in the second volume large enough for the needs of your pipeline execution.
When the instance is running, SSH into it (or connect with the Session Manager service), install the AWS CLI tool or any other tool that may be required (see next sections).
Once done that, create a new AMI by using the Create Image option in the EC2 Dashboard or the AWS command line tool.
The new AMI ID needs to be specified when creating the Batch Compute Environment.
Warning
Any installation must be completed on the EC2 instance BEFORE creating the AMI.
AWS CLI installation¶
Warning
The AWS CLI tool must to be installed in your custom AMI by using a self-contained package manager such as Conda.
The reason is that when the AWS CLI tool executes using Conda it will use the version of python supplied by Conda.
If you don’t use Conda and install the AWS CLI using something like pip the aws
command will attempt to run using the version of python found in the running container which won’t be able to find
the necessary dependencies.
The following snippet shows how to install AWS CLI with Miniconda in the home folder:
cd $HOME
sudo yum install -y bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda
$HOME/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
When complete, verify that the AWS CLI package works correctly:
$ ./miniconda/bin/aws --version
aws-cli/1.19.79 Python/3.8.5 Linux/4.14.231-173.361.amzn2.x86_64 botocore/1.20.79
Note
The aws
tool will be placed in a directory named bin
in the main installation folder.
Modifying this directory structure, after the installation, will cause the tool to not work properly.
To configure Nextflow to use this installation, specify the cliPath
parameter in the AWS Batch
configuration as shown below:
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
Replace the path above with the one matching the location where aws
tool is installed in your AMI.
Note
Using a version of Nextflow prior 19.07.x the config setting executor.awscli should be used instead of aws.batch.cliPath.
Docker installation¶
Docker is required by Nextflow to execute tasks on AWS Batch. Amazon ECS-Optimized Amazon Linux 2 AMI has Docker installed, however if you create your AMI starting from a different AMI that does not have Docker installed, you need to do it manually.
The following snippet shows how to install Docker on an Amazon EC2 instance:
sudo yum update -y
sudo amazon-linux-extras install docker
sudo yum install docker
sudo service docker start
Then, add the ec2-user
to the docker group so you can execute Docker commands without using sudo
:
sudo usermod -a -G docker ec2-user
You may have to reboot your instance to provide permissions for the ec2-user
to access the Docker daemon. This has
to be done BEFORE creating the AMI from the current EC2 instance.
Amazon ECS container agent installation¶
The ECS container agent is a component of Amazon Elastic Container Service (Amazon ECS) and is responsible for managing containers on behalf of Amazon ECS. AWS Batch uses Amazon ECS to execute containerized jobs and therefore requires the agent to be installed on compute resources within your Compute environments.
The ECS container agent is included in the Amazon ECS-Optimized Amazon Linux 2 AMI, but if you select a different AMI you can also install it on any EC2 instance that supports the Amazon ECS specification.
To install the agent, follow these steps:
sudo amazon-linux-extras disable docker
sudo amazon-linux-extras install -y ecs
sudo systemctl enable --now ecs
To test the installation:
curl -s http://localhost:51678/v1/metadata | python -mjson.tool (test)
Note
The AmazonEC2ContainerServiceforEC2Role
policy must be attached to the instance role in order to be able to
connect the EC2 instance created by the Compute Environment to the ECS container.
Jobs & Execution¶
Custom job definition¶
Nextflow automatically creates the Batch Job definitions needed to execute your pipeline processes. Therefore it’s not required to define them before running your workflow.
However you may still need to specify a custom Job Definition to fine control the configuration settings of a specific job e.g. to define custom mount paths or other Batch Job special settings.
To do that first create a Job Definition in the AWS Console (or with other means). Note the name of the Job Definition
you created. You can then associate a process execution with this Job definition by using the container
directive and specifing, in place of the container image name, the Job definition name prefixed by the
job-definition://
string, as shown below:
process.container = 'job-definition://your-job-definition-name'
Pipeline execution¶
The pipeline can be launched either in a local computer or a EC2 instance. The latter is suggested for heavy or long running workloads.
Pipeline input data can be stored either locally or in a S3 bucket.
The pipeline execution must specifies a AWS Storage bucket where jobs intermediate results are stored with the
-bucket-dir
command line options. For example:
nextflow run my-pipeline -bucket-dir s3://my-bucket/some/path
Warning
The bucket path should include at least a top level directory name e.g. use s3://my-bucket/work
not just s3://my-bucket
.
Hybrid workloads¶
Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are execute in the local computer or local computing cluster and some jobs are offloaded to AWS Batch service.
To enable this feature use one or more Process selectors in your Nextflow configuration file to apply the AWS Batch configuration only to a subset of processes in your workflow. For example:
aws {
region = 'eu-west-1'
batch {
cliPath = '/home/ec2-user/miniconda/bin/aws'
}
}
process {
withLabel: bigTask {
executor = 'awsbatch'
queue = 'my-batch-queue'
container = 'my/image:tag'
}
}
The above configuration snippet will deploy the execution with AWS Batch only for processes annotated
with the label bigTask
, the remaining process with run in the local computer.
Volume mounts¶
User provided container volume mounts can be provided as shown below:
aws {
region = 'eu-west-1'
batch {
volumes = '/tmp'
}
}
Multiple volumes can be specified using a comma separated paths. The usual Docker volume mount syntax can be used to specify complex volumes for which the container paths is different from the host paths or to specify read-only option. For example:
aws {
region = 'eu-west-1'
batch {
volumes = ['/tmp', '/host/path:/mnt/path:ro']
}
}
The above snippet defines two volume mounts the jobs executed in your pipeline. The first mounting the
host path /tmp
in the same path in the container and using read-write access mode. The second
mounts the path /host/path
in the host environment to the /mnt/path
in the container using the
read-only access mode.
Note
This feature requires Nextflow version 19.07.x or later.
Troubleshooting¶
Problem: The Pipeline execution terminates with an AWS error message similar to the one shown below:
JobQueue <your queue> not found
Make sure you have defined a AWS region in the Nextflow configuration file and it matches the region in which your Batch environment has been created.
Problem: A process execution fails reporting the following error message:
Process <your task> terminated for an unknown reason -- Likely it has been terminated by the external system
This may happen when Batch is unable to execute the process script. A common cause of this problem is that the Docker container image you have specified uses a non standard entrypoint which does not allow the execution of the Bash launcher script required by Nextflow to run the job.
This may also happen if the AWS CLI doesn’t run correctly.
Other places to check for error information:
The
.nextflow.log
file.The Job execution log in the AWS Batch dashboard.
The CloudWatch logs found in the
/aws/batch/job
log group.
Problem: A process execution is stalled in the RUNNABLE
status and the pipeline output is similar to the one below:
executor > awsbatch (1)
process > <your process> (1) [ 0%] 0 of ....
It may happen that the pipeline execution hangs indefinitely because one of the jobs is held in the queue and never gets
executed. In AWS Console, the queue reports the job as RUNNABLE
but it never moves from there.
There are multiple reasons why this can happen. They are mainly related to the Compute Environment workload/configuration, the docker service or container configuration, network status, etc.
This AWS page provides several resolutions and tips to investigate and work around the issue.
Advanced configuration¶
Read AWS Batch configuration section to learn more about advanced Batch configuration options.