Azure
New in version 21.04.0.
Azure Blob Storage
Nextflow has built-in support for Azure Blob Storage. Files stored in an Azure blob container can be accessed transparently in your pipeline script like any other file in the local file system.
The Blob storage account name and key need to be provided in the Nextflow configuration file as shown below:
azure {
storage {
accountName = "<YOUR BLOB ACCOUNT NAME>"
accountKey = "<YOUR BLOB ACCOUNT KEY>"
}
}
Alternatively, the Shared Access Token can be specified with the sasToken
option instead of accountKey
.
Tip
When creating the Shared Access Token, make sure to allow the resource types Container
and Object
and allow the permissions: Read
, Write
, Delete
, List
, Add
, Create
.
Tip
The value of sasToken
is the token stripped by the character ?
from the beginning of the token.
Once the Blob Storage credentials are set, you can access the files in the blob container like local files by prepending the file path with az://
followed by the container name. For example, a blob container named my-data
with a file named foo.txt
can be specified in your Nextflow script as az://my-data/foo.txt
.
Azure Batch
Tip
This section describes how to manually set up and use Nextflow with Azure Batch. You may be interested in using Batch Forge in Seqera Platform, which automatically creates the required Azure infrastructure for you with minimal intervention.
Azure Batch is a managed computing service that allows the execution of containerised workloads in the Azure cloud infrastructure.
Nextflow provides built-in support for Azure Batch, allowing the seamless deployment of Nextflow pipelines in the cloud, in which tasks are offloaded as Batch jobs.
Read the Azure Batch executor section to learn more about the azurebatch
executor in Nextflow.
Get started
Create a Batch account in the Azure portal. Take note of the account name and key.
Make sure to adjust your quotas to the pipeline’s needs. There are limits on certain resources associated with the Batch account. Many of these limits are default quotas applied by Azure at the subscription or account level. Quotas impact the number of Pools, CPUs and Jobs you can create at any given time.
Create a Storage account and, within that, an Azure Blob Container in the same location where the Batch account was created. Take note of the account name and key.
If you plan to use Azure Files, create an Azure File share within the same Storage account and upload your input data.
Associate the Storage account with the Azure Batch account.
Make sure every process in your pipeline specifies one or more Docker containers with the container directive.
Make sure all of your container images are published in a Docker registry that can be accessed by your Azure Batch environment, such as Docker Hub, Quay, or Azure Container Registry .
A minimal Nextflow configuration for Azure Batch looks like the following snippet:
process {
executor = 'azurebatch'
}
azure {
storage {
accountName = "<YOUR STORAGE ACCOUNT NAME>"
accountKey = "<YOUR STORAGE ACCOUNT KEY>"
}
batch {
location = '<YOUR LOCATION>'
accountName = '<YOUR BATCH ACCOUNT NAME>'
accountKey = '<YOUR BATCH ACCOUNT KEY>'
autoPoolMode = true
}
}
In the above example, replace the location placeholder with the name of your Azure region and the account placeholders with the values corresponding to your configuration.
Tip
The list of Azure regions can be found by executing the following Azure CLI command:
az account list-locations -o table
Finally, launch your pipeline with the above configuration:
nextflow run <PIPELINE NAME> -w az://YOUR-CONTAINER/work
Replacing <PIPELINE NAME>
with a pipeline name e.g. nextflow-io/rnaseq-nf
and YOUR-CONTAINER
with a blob container in the storage account defined in the above configuration.
See the Batch documentation for further details about the configuration for Azure Batch.
Pools configuration
When using the autoPoolMode
option, Nextflow automatically creates a pool
of compute nodes appropriate for your pipeline.
By default, the cpus
and memory
directives are used to find the smallest machine type that fits the requested resources in the Azure machine family, specified by machineType
. If memory
is not specified, 1 GB of memory is allocated per CPU. When no options are specified, it only uses one compute node of the type Standard_D4_v3
.
To specify multiple Azure machine families, use a comma separated list with glob (*
) values in the machineType
directive. For example, the following will select any machine size from D or E v5 machines, with additional data disk, denoted by the d
suffix:
process.machineType = "Standard_D*d_v5,Standard_E*d_v5"
For example, the following process will create a pool of Standard_E4d_v5
machines based when using autoPoolMode
:
process EXAMPLE_PROCESS {
machineType "Standard_E*d_v5"
cpus 16
memory 8.GB
script:
"""
echo "cpus: ${task.cpus}"
"""
}
Note when creating tasks that use fewer than 4 CPUs, Nextflow will create a pool with machines that have 4 times the number of CPUs required in order to pack more tasks onto each machine. This means the pipeline spends less time waiting for machines to be created, startup and join the Azure Batch pool. Similarly, if a process requires fewer than 8 CPUs Nextflow will use a machine with double the number of CPUs required. If you wish to override this behaviour you can use a specific machineType
directive, e.g. using a machineType
directive of Standard_E2d_v5
will use always use a Standard_E2d_v5 machine.
The pool is not removed when the pipeline terminates, unless the configuration setting deletePoolsOnCompletion = true
is added in your Nextflow configuration file.
Pool specific settings should be provided in the auto
pool configuration scope. If you wish to specify a single machine size for all processes, you can specify a fixed vmSize
for the auto
pool.
azure {
batch {
pools {
auto {
vmType = 'Standard_D2_v2'
}
}
}
}
Warning
To avoid any extra charges in the Batch account, remember to clean up the Batch pools or use auto scaling.
Warning
Make sure your Batch account has enough resources to satisfy the pipeline’s requirements and the pool configuration.
Warning
Nextflow uses the same pool ID across pipeline executions, if the pool features have not changed. Therefore, when using deletePoolsOnCompletion = true
, make sure the pool is completely removed from the Azure Batch account before re-running the pipeline. The following message is returned when the pool is still shutting down:
Error executing process > '<process name> (1)'
Caused by:
Azure Batch pool '<pool name>' not in active state
Named pools
If you want to have more precise control over the compute node pools used in your pipeline, such as using a different pool depending on the task in your pipeline, you can use the queue directive in Nextflow to specify the ID of a Azure Batch compute pool that should be used to execute that process.
The pool is expected to be already available in the Batch environment, unless the setting allowPoolCreation = true
is provided in the azure.batch
config scope in the pipeline configuration file. In the latter case, Nextflow will create the pools on-demand.
The configuration details for each pool can be specified using a snippet as shown below:
azure {
batch {
pools {
foo {
vmType = 'Standard_D2_v2'
vmCount = 10
}
bar {
vmType = 'Standard_E2_v3'
vmCount = 5
}
}
}
}
The above example defines the configuration for two node pools. The first will provision 10 compute nodes of type Standard_D2_v2
, the second 5 nodes of type Standard_E2_v3
. See the Azure configuration section for the complete list of available configuration options.
Warning
The pool name can only contain alphanumeric, hyphen and underscore characters.
Warning
If the pool name includes a hyphen, make sure to wrap it with single quotes. For example:
azure {
batch {
pools {
'foo-2' {
...
}
}
}
}
Requirements on pre-existing named pools
When Nextflow is configured to use a pool already available in the Batch account, the target pool must satisfy the following requirements:
The pool must be declared as
dockerCompatible
(Container Type
property).The task slots per node must match the number of cores for the selected VM. Otherwise, Nextflow will return an error like “Azure Batch pool ‘ID’ slots per node does not match the VM num cores (slots: N, cores: Y)”.
Unless you are using Fusion, all tasks must have AzCopy available in the path. If
azure.batch.copyToolInstallMode = 'node'
this will require every node to have the azcopy binary located at$AZ_BATCH_NODE_SHARED_DIR/bin/
.
Pool autoscaling
Azure Batch can automatically scale pools based on parameters that you define, saving you time and money. With automatic scaling, Batch dynamically adds nodes to a pool as task demands increase, and removes compute nodes as task demands decrease.
To enable this feature for pools created by Nextflow, add the option autoScale = true
to the corresponding pool configuration scope. For example, when using the autoPoolMode
, the setting looks like:
azure {
batch {
pools {
auto {
autoScale = true
vmType = 'Standard_D2_v2'
vmCount = 5
maxVmCount = 50
}
}
}
}
Nextflow uses the formula shown below to determine the number of VMs to be provisioned in the pool:
// Get pool lifetime since creation.
lifespan = time() - time("{{poolCreationTime}}");
interval = TimeInterval_Minute * {{scaleInterval}};
// Compute the target nodes based on pending tasks.
// $PendingTasks == The sum of $ActiveTasks and $RunningTasks
$samples = $PendingTasks.GetSamplePercent(interval);
$tasks = $samples < 70 ? max(0, $PendingTasks.GetSample(1)) : max( $PendingTasks.GetSample(1), avg($PendingTasks.GetSample(interval)));
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes/2);
targetPoolSize = max(0, min($targetVMs, {{maxVmCount}}));
// For first interval deploy 1 node, for other intervals scale up/down as per tasks.
$TargetDedicatedNodes = lifespan < interval ? {{vmCount}} : targetPoolSize;
$NodeDeallocationOption = taskcompletion;
The above formula initialises a pool with the number of VMs specified by the vmCount
option, and scales up the pool on-demand, based on the number of pending tasks, up to maxVmCount
nodes. If no jobs are submitted for execution, it scales down to zero nodes automatically.
If you need a different strategy, you can provide your own formula using the scaleFormula
option. See the Azure Batch documentation for details.
Pool nodes
When Nextflow creates a pool of compute nodes, it selects:
the virtual machine image reference to be installed on the node
the Batch node agent SKU, a program that runs on each node and provides an interface between the node and the Batch service
Together, these settings determine the Operating System and version installed on each node.
By default, Nextflow creates pool nodes based on CentOS 8, but this behavior can be customised in the pool configuration. Below are configurations for image reference/SKU combinations to select two popular systems.
Ubuntu 20.04 (default):
azure.batch.pools.<name>.sku = "batch.node.ubuntu 20.04" azure.batch.pools.<name>.offer = "ubuntu-server-container" azure.batch.pools.<name>.publisher = "microsoft-azure-batch"
CentOS 8:
azure.batch.pools.<name>.sku = "batch.node.centos 8" azure.batch.pools.<name>.offer = "centos-container" azure.batch.pools.<name>.publisher = "microsoft-azure-batch"
In the above snippet, replace <name>
with the name of your Azure node pool.
See the Azure configuration section and the Azure Batch nodes documentation for more details.
Private container registry
New in version 21.05.0-edge.
A private container registry for Docker images can be specified as follows:
azure {
registry {
server = '<YOUR REGISTRY SERVER>' // e.g.: docker.io, quay.io, <ACCOUNT>.azurecr.io, etc.
userName = '<YOUR REGISTRY USER NAME>'
password = '<YOUR REGISTRY PASSWORD>'
}
}
The private registry is an addition, not a replacement, to the existing configuration. Public images from other registries will still be pulled as normal, if they are requested.
Note
When using containers hosted in a private registry, the registry name must also be provided in the container name specified via the container directive using the format: [server]/[your-organization]/[your-image]:[tag]
. Read more about fully qualified image names in the Docker documentation.
Virtual Network
New in version 23.03.0-edge.
Sometimes it might be useful to create a pool in an existing Virtual Network. To do so, the
virtualNetwork
option can be added to the pool settings as follows:
azure {
batch {
pools {
auto {
autoScale = true
vmType = 'Standard_D2_v2'
vmCount = 5
virtualNetwork = '<YOUR SUBNET ID>'
}
}
}
}
The value of the setting must be the identifier of a subnet available in the virtual network to join. A valid subnet ID has the following form:
/subscriptions/<YOUR SUBSCRIPTION ID>/resourceGroups/<YOUR RESOURCE GROUP NAME>/providers/Microsoft.Network/virtualNetworks/<YOUR VIRTUAL NETWORK NAME>/subnets/<YOUR SUBNET NAME>
Warning
Batch Authentication with Shared Keys does not allow to link external resources (like Virtual Networks) to the pool. Therefore, Active Directory Authentication must be used in conjunction with the virtualNetwork
setting.
Hybrid workloads
Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed in the local computer or local computing cluster and some jobs are offloaded to Azure Batch.
To enable this feature, use one or more Process selectors in your Nextflow configuration to apply the Azure Batch configuration to the subset of processes that you want to offload. For example:
process {
withLabel: bigTask {
executor = 'azurebatch'
queue = 'my-batch-pool'
container = 'my/image:tag'
}
}
azure {
storage {
accountName = '<YOUR STORAGE ACCOUNT NAME>'
accountKey = '<YOUR STORAGE ACCOUNT KEY>'
}
batch {
location = '<YOUR LOCATION>'
accountName = '<YOUR BATCH ACCOUNT NAME>'
accountKey = '<YOUR BATCH ACCOUNT KEY>'
}
}
With the above configuration, processes with the bigTask label will run on Azure Batch, while the remaining processes will run on the local computer.
Then launch the pipeline with the -bucket-dir
option to specify an Azure Blob Storage path for the jobs computed with Azure Batch, and optionally, use the -work-dir
option to specify the local storage for the jobs computed locally:
nextflow run <script or project name> -bucket-dir az://my-container/some/path
Warning
The Azure Blob Storage path needs to contain at least one sub-directory (e.g. az://my-container/work
rather than az://my-container
).
Note
Nextflow will automatically manage the transfer of input and output files between the local and cloud environments when using hybrid workloads.
Tip
When using Fusion, the -bucket-dir
option is not required. Fusion implements a distributed virtual file system that allows seamless access to Azure Blob Storage using a standard POSIX interface, enabling direct mounting of remote blob storage as if it were a local file system. This simplifies and speeds up most operations, bridging the gap between cloud-native storage and data analysis workflows.
Microsoft Entra
Using Microsoft Entra for role-based access control is more secure than using access keys and should be used wherever possible. You can authenticate to Azure Entra using a Managed Identity when running on resources within the Azure environment, or by authenticating as an Azure Service Principal when running on external resources.
Required role assignments
To access Azure resources, you must have the relevant role assignments (permissions). Access Azure Blob storage data (e.g. to retrieve data from a private Azure Storage account) you must have the following permissions:
Storage Blob Data Reader
Storage Blob Data Contributor
To run Nextflow on Azure Batch you must have the following permissions to create and destroy resources in the Azure Batch account:
Batch Contributor
To assign the necessary roles to a Managed Identity or Service Principal, refer to the official Azure documentation.
Managed identities
New in version 24.05.0-edge.
An Azure Managed Identity can be used to authenticate with Azure Resources without the use of access keys or credentials. When using a Managed Identity, an Azure resource is able to authenticate because of what it is, instead of using access keys. For example, if Nextflow is running on an Azure Virtual Machine with a managed identity enabled which had the relevant permissions, it would be able to run a Nextflow workflow on Azure Batch and use data from a private storage account with no additional credentials supplied. This is more secure and reliable than using Access Keys or a Service Principal which can be compromised. The limitation is they are only able to work from within the Azure account, i.e. you cannot use a managed identity from an external service.
An Azure Managed identity comes in two forms, system-assigned and user-assigned. Both are functionally equivalent but have a slightly method
System Assigned Managed Identity
When running on an Azure Service such as an Azure Virtual Machine, you can enable system-assigned Managed Identity for that machine. This grants the machine an identity in Azure from which it receives the permissions and role assignments.
First we must enable system-assigned Managed Identity. Once you have done this the machine has an identity in Azure Entra.
Next we must add the relevant role assignments to this Managed Identity. On the Azure Portal page for the virtual machine, select ‘Identity’ and then click ‘Azure Role Assignments’ to modify the existing role assignments. Note you must have
Microsoft.Authorization/roleAssignments/write
to perform this action.Make sure the identity has the following role assignments:
Storage Blob Data Reader
Storage Blob Data Contributor
Batch Contributor
Save the changes.
Use the following Nextflow configuration to enable Nextflow to adopt the system-assigned identity while running on this machine:
process.executor = 'azurebatch'
azure {
managedIdentity {
system = true
}
storage {
accountName = '<YOUR STORAGE ACCOUNT NAME>'
}
batch {
accountName = '<YOUR BATCH ACCOUNT NAME>'
location = '<YOUR BATCH ACCOUNT LOCATION>'
}
}
User Assigned Managed Identity
A system-assigned managed identity is essentially ‘anonymous’ and is tied to a single resource. By comparison, a user-assigned managed identity is created by the user and can be assigned to multiple resources, furthermore the lifecycle of a user-assigned managed identity is not tied to the resource. See the Azure Documentation for further details.
We can add a user-assigned identity to a resource in a similar manner to a system-assigned identity, but we must create it first.
Create a Managed Identity as per the Azure Documentation.
Assign the relevant role permissions to the Managed Identity as before.
Assign the Managed Identity to the Azure Resource, this can be done at creation time or afterwards. As an example, see the documentation on how to do this for a virtual machine.
Retrieve the client ID for the Managed Identity. On the Azure Portal, this can be found on the ‘Overview’ or ‘Properties’ page as ‘client ID’.
Use the following configuration to enable Nextflow to adopt the user-assigned identity while running on the Azure Resource:
process.executor = 'azurebatch'
azure {
managedIdentity {
clientId = '<USER ASSIGNED MANAGED IDENTITY CLIENT ID>'
}
storage {
accountName = '<YOUR STORAGE ACCOUNT NAME>'
}
batch {
accountName = '<YOUR BATCH ACCOUNT NAME>'
location = '<YOUR BATCH ACCOUNT LOCATION>'
}
}
Service Principals
New in version 22.11.0-edge.
Service Principal credentials can be used access to Azure Batch and Storage accounts. Similar to a Managed Identity, a Service Principal is an account which can have specific permissions and role based access. However, unlike with Managed Identities you must use a secret key to authenticate as a Service Principal. However, this means you can access authenticate as a Service Principal when operating outside of the Azure account.
The credentials for Service Principal can be specified as follows:
azure {
activeDirectory {
servicePrincipalId = '<YOUR SERVICE PRINCIPAL CLIENT ID>'
servicePrincipalSecret = '<YOUR SERVICE PRINCIPAL CLIENT SECRET>'
tenantId = '<YOUR TENANT ID>'
}
storage {
accountName = '<YOUR STORAGE ACCOUNT NAME>'
}
batch {
accountName = '<YOUR BATCH ACCOUNT NAME>'
location = '<YOUR BATCH ACCOUNT LOCATION>'
}
}
Advanced configuration
Read the Azure configuration section to learn more about advanced configuration options.