IMPORTANT: You are viewing a beta version of the official module to install Weights & Biases. This new version is incompatible with earlier versions, and it is not currently meant for production use. Please contact your Customer Success Manager for details before using.
This is a Terraform module for provisioning a Weights & Biases Cluster on AWS. Weights & Biases Local is our self-hosted distribution of wandb.ai. It offers enterprises a private instance of the Weights & Biases application, with no resource limits and with additional enterprise-grade architectural features like audit logging and SAML single sign-on.
This module is intended to run in an AWS account with minimal preparation, however it does have the following pre-requisites:
- AWS Identity & Access Management (IAM)
- AWS Key Management System (KMS)
- Amazon Aurora MySQL (RDS)
- Amazon VPC
- Amazon S3
- Amazon Route53
- Amazon Certificate Manager (ACM)
- Amazon Elastic Loadbalancing (ALB)
- Amazon Secrets Manager
If you are managing DNS via AWS Route53 the hosted zone entry is created automatically as part of your domain management.
If you're managing DNS outside of Route53, you will need to:
- Create a Route53 zone name
{subdomain}.{domain}(e.gtest.wandb.ai) - Create a NS record in your parent system and point it to the newly created Route53
- Enable the
external_dnsoption in this module
You can learn more about creating a hosted zone for a
subdomain,
which you will need to do for the subdomain you are planning to use for your
Weights & Biases installation. To create this hosted zone with Terraform, use
the aws_route53_zone
resource.
While this is not required, it is recommend to already have an existing ACM certification. Certificate validation can take up two hours, causing timeouts during module apply if the cert is generated as one of the resources contained in the module.
-
Ensure account meets module pre-requisites from above.
-
Please note that while some resources are individually and uniquely tagged, all common tags are expected to be configured within the AWS provider as shown in the example code snippet below.
-
Create a Terraform configuration that pulls in this module and specifies values of the required variables:
provider "aws" {
region = "<your AWS region>"
default_tags {
tags = var.common_tags
}
}
module "wandb" {
source = "<filepath to cloned module directory>"
namespace = "<prefix for naming AWS resources>"
}- Run
terraform initandterraform apply
By default, the type of kubernetes instances, number of instances, redis cluster size, and database instance sizes are
standardized via configurations in ./deployment-size.tf, and is configured via the size input
variable.
Available sizes are, small, medium, large, xlarge, and xxlarge. Default is small.
All the values set via deployment-size.tf can be overridden by setting the appropriate input variables.
kubernetes_instance_types- The instance type for the EKS nodeskubernetes_min_nodes_per_az- The minimum number of nodes in each AZ for the EKS clusterkubernetes_max_nodes_per_az- The maximum number of nodes in each AZ for the EKS clusterelasticache_node_type- The instance type for the redis clusterdatabase_instance_class- The instance type for the database
We have added additional variable that make enabling BYOB easier to enable.
bucket_permissions_mode accepts 1 of 3 values;
strictthe default requires an explict list of the buckets for proper access, the same as byob before7.3.0.restrictedmakes use of the new variablebucket_restricted_accountswhich is a list of AWS account Id's where the BYOBs can be hosted from. ex:["1234567890", "1234876590"]publicenables access to any BYOB properly configured not present in the the calling account. Effectively this enables cross account s3 access to ANY aws s3 account.
Important
Enabling BYOB or cross-account reguardless of bucket_permissions_mode still requires a policy attached to that bucket to allowing the eks node role to perform s3 actions.
To find out the role which needs to be allowed access to your BYOB go to bucket section of https://YOUR_WANDB_DEPLOYMENT/console/settings/system or see the output of the module cluster_node_role
You can use the Secure Storage Connector submodule to create a bucket that allows access for the deployed cluster
We have included documentation and reference examples for additional common installation scenarios for Weights & Biases, as well as examples for supporting resources that lack official modules.
Users can update the EKS cluster version to the latest version offered by AWS using the input variable eks_cluster_version. Cluster and nodegroup version updates can only be done in increments of one minor version at a time, so multi-version upgrades must be executed step-wise.
Bump eks_cluster_version to the next minor version and terraform apply. The data.aws_eks_addon_version data sources in modules/app_eks/add-ons.tf automatically resolve the latest addon versions compatible with the new cluster minor. AWS upgrades the control plane first, then rolls managed node groups to a compatible AMI, and addon versions update in place.
Upgrades must be executed step-wise from one minor version to the next — AWS rejects multi-minor upgrades in a single API call (e.g. you cannot go from 1.30 → 1.32 directly; you must go 1.30 → 1.31 → 1.32).
You can pin individual addon versions explicitly via the per-addon eks_addon_*_version overrides; those win over the data-source default.
In most cases no separate addon step is needed before the cluster upgrade — just bump eks_cluster_version and apply. AWS's recommended sequence is: upgrade the control plane, then nodes, then addons. The data sources follow this naturally because they query the API for the latest version compatible with the current cluster minor, which updates automatically after the control plane upgrade.
The pre-roll mechanism exists for the rare case where AWS documents that a specific addon must be at a minimum version before the cluster can be upgraded. When that happens, a human curates an entry in the local.eks_addons_preroll_versions map in modules/app_eks/add-ons.tf, and the operator sets eks_addons_preroll_version to the target minor while leaving eks_cluster_version unchanged:
- Pre-roll the required addon(s). Set
eks_addons_preroll_versionto the next minor version.terraform apply. Only addons with an entry in the preroll map are affected. - Bump the cluster. Set
eks_cluster_versionto the same target minor and unseteks_addons_preroll_version.terraform apply.
As of the current EKS supported versions (1.30–1.35), no preroll entries are required. The preroll map is intentionally empty. Addons fall into three categories:
- vpc-cni, aws-ebs-csi-driver, aws-efs-csi-driver — forward-compatible; the latest version for the current K8s minor continues to work after the upgrade.
- kube-proxy, metrics-server — locked to the cluster minor by Kubernetes' version-skew policy; they cannot be prerolled and must update after the control plane upgrade.
- coredns — forward-compatible in practice; AWS does not require it to be updated before the cluster upgrade.
Upgrades must be executed in step-wise fashion from one version to the next. You cannot skip versions when upgrading EKS.
| Name | Version |
|---|---|
| terraform | ~> 1.9 |
| aws | ~> 5.95 |
| helm | < 3.0.0 |
| kubernetes | ~> 2.23 |
| null | ~> 3.0 |
| time | ~> 0.13 |
| Name | Version |
|---|---|
| aws | 5.100.0 |
| null | 3.2.4 |
| time | 0.13.1 |
| Name | Source | Version |
|---|---|---|
| acm | terraform-aws-modules/acm/aws | ~> 3.0 |
| app_eks | ./modules/app_eks | n/a |
| app_lb | ./modules/app_lb | n/a |
| database | ./modules/database | n/a |
| file_storage | ./modules/file_storage | n/a |
| iam_role | ./modules/iam_role | n/a |
| kms | ./modules/kms | n/a |
| networking | ./modules/networking | n/a |
| private_link | ./modules/private_link | n/a |
| redis | ./modules/redis | n/a |
| s3_endpoint | ./modules/endpoint | n/a |
| wandb | wandb/wandb/helm | 3.0.0 |
| Name | Type |
|---|---|
| aws_region.current | data source |
| aws_s3_bucket.file_storage | data source |
| aws_sqs_queue.file_storage | data source |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| acm_certificate_arn | The ARN of an existing ACM certificate. | string |
null |
no |
| allowed_inbound_cidr | CIDRs allowed to access wandb-server. | list(string) |
n/a | yes |
| allowed_inbound_ipv6_cidr | CIDRs allowed to access wandb-server. | list(string) |
n/a | yes |
| allowed_private_endpoint_cidr | Private CIDRs allowed to access wandb-server. | list(string) |
[] |
no |
| aws_loadbalancer_controller_image_repository | The image repository of the aws-loadbalancer-controller to deploy. | string |
"public.ecr.aws/eks/aws-load-balancer-controller" |
no |
| aws_loadbalancer_controller_image_tag | The tag of the aws-loadbalancer-controller to deploy. | string |
null |
no |
| aws_loadbalancer_controller_tags | (Optional) A map of AWS tags to apply to all resources managed by the load balancer controller | map(string) |
{} |
no |
| bucket_kms_key_arn | n/a | string |
"" |
no |
| bucket_name | ######################################### External Bucket # ######################################### Most users will not need these settings. They are ment for users who want a bucket and sqs that are in a different account. | string |
"" |
no |
| bucket_path | path of where to store data for the instance-level bucket | string |
"" |
no |
| bucket_permissions_mode | Defines the bucket permissions mode, which can be one of: strict, restricted, or public. | string |
"strict" |
no |
| bucket_restricted_accounts | List of allowed accounts when 'bucket_permissions_mode' is 'restricted'. | list(string) |
[] |
no |
| clickhouse_endpoint_service_id | The service ID of the VPC endpoint service for Clickhouse | string |
"" |
no |
| cluster_autoscaler_image_repository | The image repository of the cluster-autoscaler to deploy. | string |
"registry.k8s.io/autoscaling/cluster-autoscaler" |
no |
| cluster_autoscaler_image_tag | The tag of the cluster-autoscaler to deploy. | string |
null |
no |
| controller_image_tag | Tag of the controller image to deploy | string |
"1.20.0" |
no |
| create_elasticache | Boolean indicating whether to provision an elasticache instance (true) or not (false). | bool |
true |
no |
| create_vpc | Boolean indicating whether to deploy a VPC (true) or not (false). | bool |
true |
no |
| custom_domain_filter | A custom domain filter to be used by external-dns instead of the default FQDN. If not set, the local FQDN is used. | string |
null |
no |
| database_engine_version | Version for MySQL Aurora | string |
"8.0" |
no |
| database_instance_class | Instance type to use by database master instance. Defaults to null and value from deployment-size.tf is used | string |
null |
no |
| database_kms_key_arn | n/a | string |
"" |
no |
| database_master_username | Specifies the master_username value to set for the database | string |
"wandb" |
no |
| database_name | Specifies the name of the database | string |
"wandb_local" |
no |
| database_performance_insights_kms_key_arn | Specifies an existing KMS key ARN to encrypt the performance insights data if performance_insights_enabled is was enabled out of band | string |
"" |
no |
| database_snapshot_identifier | Specifies whether or not to create this cluster from a snapshot. You can use either the name or ARN when specifying a DB cluster snapshot, or the ARN when specifying a DB snapshot | string |
null |
no |
| database_sort_buffer_size | Specifies the sort_buffer_size value to set for the database | number |
67108864 |
no |
| deletion_protection | If the instance should have deletion protection enabled. The database / S3 can't be deleted when this value is set to true. |
bool |
true |
no |
| domain_name | Domain for accessing the Weights & Biases UI. | string |
n/a | yes |
| eks_addon_coredns_version | Override for the CoreDNS addon version. When null, the version is resolved via local.eks_addon_versions in modules/app_eks/add-ons.tf (default lookup by var.eks_cluster_version, with optional preroll override via var.eks_addons_preroll_version for preroll-eligible addons). | string |
null |
no |
| eks_addon_ebs_csi_driver_version | Override for the EBS CSI driver version. When null, the version is resolved via local.eks_addon_versions in modules/app_eks/add-ons.tf (default lookup by var.eks_cluster_version, with optional preroll override via var.eks_addons_preroll_version for preroll-eligible addons). | string |
null |
no |
| eks_addon_efs_csi_driver_version | Override for the EFS CSI driver version. When null, the version is resolved via local.eks_addon_versions in modules/app_eks/add-ons.tf (default lookup by var.eks_cluster_version, with optional preroll override via var.eks_addons_preroll_version for preroll-eligible addons). | string |
null |
no |
| eks_addon_kube_proxy_version | Override for the kube-proxy addon version. When null, the version is resolved via local.eks_addon_versions in modules/app_eks/add-ons.tf (default lookup by var.eks_cluster_version, with optional preroll override via var.eks_addons_preroll_version for preroll-eligible addons). | string |
null |
no |
| eks_addon_metrics_server_version | Override for the metrics-server addon version. When null, the version is resolved via local.eks_addon_versions in modules/app_eks/add-ons.tf (default lookup by var.eks_cluster_version, with optional preroll override via var.eks_addons_preroll_version for preroll-eligible addons). | string |
null |
no |
| eks_addon_vpc_cni_version | Override for the VPC CNI addon version. When null, the version is resolved via local.eks_addon_versions in modules/app_eks/add-ons.tf (default lookup by var.eks_cluster_version, with optional preroll override via var.eks_addons_preroll_version for preroll-eligible addons). | string |
null |
no |
| eks_addons_preroll_version | Optional Kubernetes minor version to roll preroll-eligible addons toward, while the cluster itself stays on var.eks_cluster_version. See local.eks_addons_preroll_versions in modules/app_eks/add-ons.tf for the addons covered. kube-proxy and metrics-server are intentionally excluded from preroll. | string |
null |
no |
| eks_cluster_tags | A map of AWS tags to apply to all resources managed by the EKS cluster | map(string) |
{} |
no |
| eks_cluster_version | EKS cluster kubernetes version | string |
n/a | yes |
| eks_policy_arns | Additional IAM policy to apply to the EKS cluster | list(string) |
[] |
no |
| elasticache_node_type | The type of the redis cache node to deploy. Defaults to null and value from deployment-size.tf is used | string |
null |
no |
| enable_clickhouse | Provision clickhouse resources | bool |
false |
no |
| enable_flow_log | Controls whether VPC Flow Logs are enabled | bool |
false |
no |
| enable_helm_operator | Enable or disable applying and releasing W&B Operator chart | bool |
true |
no |
| enable_helm_wandb | Enable or disable applying and releasing CR chart | bool |
true |
no |
| enable_s3_https_only | Controls whether HTTPS-only is enabled for s3 buckets | bool |
false |
no |
| enable_yace | deploy yet another cloudwatch exporter to fetch aws resources metrics | bool |
true |
no |
| external_dns | Using external DNS. A subdomain must also be specified if this value is true. |
bool |
false |
no |
| external_dns_image_repository | The image repository of the external-dns to deploy. | string |
"registry.k8s.io/external-dns/external-dns" |
no |
| external_dns_image_tag | The tag of the external-dns to deploy. | string |
null |
no |
| external_redis_host | host for the redis instance created externally | string |
null |
no |
| external_redis_params | queryVar params for redis instance created externally | object({}) |
null |
no |
| external_redis_port | port for the redis instance created externally | string |
null |
no |
| extra_fqdn | Additional fqdn's must be in the same hosted zone as domain_name. |
list(string) |
[] |
no |
| k8s_namespace | The Kubernetes namespace where W&B resources will be deployed | string |
"default" |
no |
| keep_flow_log_bucket | Controls whether S3 bucket storing VPC Flow Logs will be kept | bool |
true |
no |
| kms_clickhouse_key_alias | KMS key alias for AWS KMS Customer managed key used by Clickhouse CMEK. | string |
null |
no |
| kms_clickhouse_key_policy | The policy that will define the permissions for the clickhouse kms key. | string |
"" |
no |
| kms_key_alias | KMS key alias for AWS KMS Customer managed key. | string |
null |
no |
| kms_key_deletion_window | Duration in days to destroy the key after it is deleted. Must be between 7 and 30 days. | number |
7 |
no |
| kms_key_policy | The policy that will define the permissions for the kms key. | string |
"" |
no |
| kms_key_policy_administrator_arn | The principal that will be allowed to manage the kms key. | string |
"" |
no |
| kubernetes_alb_internet_facing | Indicates whether or not the ALB controlled by the Amazon ALB ingress controller is internet-facing or internal. | bool |
true |
no |
| kubernetes_alb_subnets | List of subnet ID's the ALB will use for ingress traffic. | list(string) |
[] |
no |
| kubernetes_instance_types | EC2 Instance type for primary node group. Defaults to null and value from deployment-size.tf is used | list(string) |
null |
no |
| kubernetes_map_accounts | REMOVED. AWS account numbers for the aws-auth ConfigMap. EKS module v20 uses access entries, which require a per-principal ARN — account-wide trust is no longer expressible. See docs/v8-upgrade-guide.md for migration paths. The variable is retained as a tripwire and will be removed in a future release. | list(string) |
[] |
no |
| kubernetes_map_roles | Additional IAM roles to add to the aws-auth configmap. | list(object({ |
[] |
no |
| kubernetes_map_users | Additional IAM users to add to the aws-auth configmap. | list(object({ |
[] |
no |
| kubernetes_max_nodes_per_az | Maximum number of nodes for the EKS cluster. Defaults to null and value from deployment-size.tf is used | number |
null |
no |
| kubernetes_min_nodes_per_az | Minimum number of nodes for the EKS cluster. Defaults to null and value from deployment-size.tf is used | number |
null |
no |
| kubernetes_node_disk_size_gb | Size of the node root volume in GB. | number |
null |
no |
| kubernetes_public_access | Indicates whether or not the Amazon EKS public API server endpoint is enabled. | bool |
false |
no |
| kubernetes_public_access_cidrs | List of CIDR blocks which can access the Amazon EKS public API server endpoint. | list(string) |
[] |
no |
| kubernetes_legacy_cluster_creator_admin | Whether the cluster-creator admin access entry is a legacy AWS-managed resource carried over from a prior v17 installation. In both paths the cluster creator ends up with admin permissions — the variable only controls who manages the entry, not whether the permissions exist. Required — no default. Set explicitly per scenario: - true — for terraform-aws-wandb v7 -> v8 in-place upgrades.AWS auto-migrates the legacy cluster-creator binding into an access entry as part of the CONFIG_MAP ->API_AND_CONFIG_MAP transition; that entry is AWS-owned, notterraform-state-owned. Setting this false causes terraformto try creating its own entry, resulting in a 409 ResourceInUseException. - false — for fresh terraform-aws-wandb v8 installs. AWS doesnot auto-create a cluster-creator access entry for clusters created at API_AND_CONFIG_MAP without a CONFIG_MAP-onlypredecessor; terraform must create the entry itself to bootstrap the in-apply kubernetes/helm providers. Forwarded to the community EKS module's enable_cluster_creator_admin_permissions input (inverted) viamodules/app_eks. See docs/v8-upgrade-guide.md for the fullrationale. |
bool |
n/a | yes |
| license | Weights & Biases license key. | string |
n/a | yes |
| namespace | String used for prefix resources. | string |
n/a | yes |
| network_cidr | CIDR block for VPC. | string |
"10.10.0.0/16" |
no |
| network_database_subnet_cidrs | List of private subnet CIDR ranges to create in VPC. | list(string) |
[ |
no |
| network_database_subnets | A list of the identities of the database subnetworks in which resources will be deployed. | list(string) |
[] |
no |
| network_elasticache_subnet_cidrs | List of private subnet CIDR ranges to create in VPC. | list(string) |
[ |
no |
| network_elasticache_subnets | A list of the identities of the subnetworks in which elasticache resources will be deployed. | list(string) |
[] |
no |
| network_id | The identity of the VPC in which resources will be deployed. | string |
"" |
no |
| network_private_subnet_cidrs | List of private subnet CIDR ranges to create in VPC. | list(string) |
[ |
no |
| network_private_subnets | A list of the identities of the private subnetworks in which resources will be deployed. | list(string) |
[] |
no |
| network_public_subnet_cidrs | List of private subnet CIDR ranges to create in VPC. | list(string) |
[ |
no |
| operator_chart_version | Version of the operator chart to deploy | string |
"1.4.2" |
no |
| other_wandb_env | Extra environment variables for W&B | map(any) |
{} |
no |
| parquet_wandb_env | Extra environment variables for W&B | map(string) |
{} |
no |
| preserve_aws_auth_configmap | v17 -> v20 in-place upgrade transition flag. See modules/app_eks/aws_auth_legacy.tf and docs/v8-upgrade-guide.md. | bool |
false |
no |
| private_link_allowed_account_ids | List of AWS account IDs allowed to access the VPC Endpoint Service | list(string) |
[] |
no |
| private_only_traffic | Enable private only traffic from customer private network | bool |
false |
no |
| public_access | Is this instance accessable a public domain. | bool |
false |
no |
| size | Deployment size for the instance | string |
"small" |
no |
| subdomain | Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route. | string |
null |
no |
| system_reserved_cpu_millicores | (Optional) The amount of 'system-reserved' CPU millicores to pass to the kubelet. For example: 100. A value of -1 disables the flag. | number |
70 |
no |
| system_reserved_ephemeral_megabytes | (Optional) The amount of 'system-reserved' ephemeral storage in megabytes to pass to the kubelet. For example: 1000. A value of -1 disables the flag. | number |
750 |
no |
| system_reserved_memory_megabytes | (Optional) The amount of 'system-reserved' memory in megabytes to pass to the kubelet. For example: 100. A value of -1 disables the flag. | number |
100 |
no |
| system_reserved_pid | (Optional) The amount of 'system-reserved' process ids [pid] to pass to the kubelet. For example: 1000. A value of -1 disables the flag. | number |
500 |
no |
| use_chainguard_redis | Whether CHAINGUARD redis is deployed in the cluster | bool |
false |
no |
| use_ctrlplane_redis | Whether redis is deployed in the cluster via ctrlplane | bool |
false |
no |
| use_external_redis | Boolean indicating whether to use the redis instance created externally | bool |
false |
no |
| use_internal_queue | n/a | bool |
false |
no |
| weave_wandb_env | Extra environment variables for W&B | map(string) |
{} |
no |
| yace_sa_name | n/a | string |
"wandb-yace" |
no |
| zone_id | Domain for creating the Weights & Biases subdomain on. | string |
n/a | yes |
| Name | Description |
|---|---|
| bucket_name | n/a |
| bucket_path | n/a |
| bucket_queue_name | n/a |
| bucket_region | n/a |
| cluster_certificate_authority_data | n/a |
| cluster_endpoint | Surfaced so callers can configure kubernetes/helm providers directly from module outputs instead of data "aws_eks_cluster". See modules/app_eks/outputs.tf for the v18+/v20 upgrade rationale. |
| cluster_name | n/a |
| cluster_node_role | n/a |
| database_connection_string | n/a |
| database_instance_type | n/a |
| database_password | n/a |
| database_username | n/a |
| eks_max_nodes_per_az | n/a |
| eks_min_nodes_per_az | n/a |
| eks_node_instance_type | n/a |
| elasticache_connection_string | n/a |
| kms_clickhouse_key_arn | The Amazon Resource Name of the KMS key used to encrypt Weave data at rest in Clickhouse. |
| kms_key_arn | The Amazon Resource Name of the KMS key used to encrypt data at rest. |
| network_id | The identity of the VPC in which resources are deployed. |
| network_private_subnets | The identities of the private subnetworks deployed within the VPC. |
| network_public_subnets | The identities of the public subnetworks deployed within the VPC. |
| private_link_availability_zones | The Availability Zones where the Private Link NLB endpoints are available |
| private_link_service_id | The ID of the VPC Endpoint Service for Private Link |
| private_link_service_name | The service name of the VPC Endpoint Service for Private Link |
| redis_instance_type | n/a |
| standardized_size | n/a |
| url | The URL to the W&B application |
| wandb_spec | n/a |
See our upgrade guide here
The terraform-aws-modules/eks/aws pin moves from ~> 17.23 to ~> 20.37
on this branch. aws-eks v18 (and again in aws-eks v20) reorganized inputs, outputs, and
internal resource addresses, so a plain terraform apply against an
existing aws-eks-v17-managed cluster wants to destroy and recreate the EKS cluster,
node groups, IAM roles, and KMS key. To make this an in-place upgrade
instead — preserving the cluster control plane, its OIDC issuer, IAM
roles, security groups, and KMS key — this branch carries:
- Five name-preservation inputs on the
module "eks"invocation inmodules/app_eks/main.tf—iam_role_name,iam_role_use_name_prefix,cluster_security_group_name,cluster_security_group_use_name_prefix,cluster_security_group_description, plusprefix_separator = ""— to match v17-era resource names. - Twelve
moved {}blocks inmodules/app_eks/moved.tfandmodules/app_eks/aws_auth_legacy.tffor the v17 -> v20 address renames. - A
var.preserve_aws_auth_configmapflag, whentrue, adopts the v17-erakube-system/aws-authConfigMap into wandb-side state for the authentication-mode cutover, then cleanly destroys it on a follow-up apply when set tofalse.
Notes The per-AZ aws_launch_template and aws_eks_node_group resources
are replaced on the upgrade apply for two reasons:
- v20 naming change. The community module hardcodes a
"-"separator inname_prefixthat v17 did not have, making the old name aForceNewdrift. - AL2023 migration. This module now mandates
ami_type = "AL2023_x86_64_STANDARD"on all node groups. Amazon Linux 2 reached end-of-life June 2025.ami_typeis aForceNewattribute onaws_eks_node_group, so any cluster whose node groups were running AL2 will have its node groups replaced when upgrading to this module version. Clusters already on AL2023 see an in-place update only.
Both replacements are graceful: v20 sets lifecycle.create_before_destroy = true on both resources, so new AL2023 nodes come up and go Ready before old
nodes drain and terminate. EC2 quota must accommodate briefly 2× steady-state
capacity per AZ during the apply window. See
docs/v8-upgrade-guide.md for the full impact table,
capacity pre-flight checklist, and rollback procedure.
Recommended upgrade sequence:
- Upgrade terraform-aws-wandb v7.x to v8.x on the current K8s version with
preserve_aws_auth_configmap = true. Node groups are replaced in this apply (naming + AL2023, combined into one graceful CBD roll). - Re-run with
preserve_aws_auth_configmap = falseafter ~1 hour to retire the aws-auth ConfigMap. - Proceed with individual EKS minor-version bumps one at a time: 1.30 → 1.31 → … → 1.34.
5.0.0 introduced autoscaling to the EKS cluster and made the size variable the preferred way to set the cluster size.
Previously, unless the size variable was set explicitly, there were default values for the following variables:
kubernetes_instance_typeskubernetes_node_countelasticache_node_typedatabase_instance_class
The size variable is now defaulted to small, and the following values to can be used to partially override the values
set by the size variable:
kubernetes_instance_typeskubernetes_min_nodes_per_azkubernetes_max_nodes_per_azelasticache_node_typedatabase_instance_class
For more information on the available sizes, see the Cluster Sizing section.
If having the cluster scale nodes in and out is not desired, the kubernetes_min_nodes_per_az and
kubernetes_max_nodes_per_az can be set to the same value to prevent the cluster from scaling.
This upgrade is also intended to be used when upgrading eks to 1.30.
We have upgraded the following dependencies and Kubernetes addons:
- MySQL Aurora (8.0.mysql_aurora.3.07.1)
- redis (7.1)
- external-dns helm chart (v1.15.0)
- aws-efs-csi-driver (v2.0.7-eksbuild.1)
- aws-ebs-csi-driver (v1.35.0-eksbuild.1)
- coredns (v1.11.3-eksbuild.1)
- kube-proxy (v1.30.0-eksbuild.1)
- vpc-cni (v1.18.3-eksbuild.3)
⚠️ Please remove theenable_dummy_dnsandenable_operator_albvariables as they are no longer valid flags. They were provided to support older versions of the module that relied on an alb not created by the ingress controller.
- If egress access for retrieving the wandb/controller image is not available, Terraform apply may experience failures.
- It's necessary to supply a license variable within the module, as shown:
module "wandb" {
version = "4.x"
# ...
license = "<your license key>"
# ...
}- we can provide external kms key to encrypt database, redis and S3 buckets.
- To provide kms keys we need to provide kms arn values in
database_kms_key_arn
bucket_kms_key_arn
In order to allow cross account KMS keys. we need to allow kms keys to be accessed by WandB account.
This can be donw by adding the following policy document.
{
"Sid": "Allow use of the key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::<Account_id>:root"
]
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey",
"kms:CreateGrant"
],
"Resource": "*"
}
v7 changes how the module references storage from using terraform's count to always creating a "defaultBucket" which can be overidden latter or but providing some initial bucket.
We are considering this a major change because of the terraform moved block which migrates the resource. After moving to a v7 applying an earlier version of the module may result in terraform deleting your bucket.
removed the create_bucket var due to the above.
- No changes required by you
- ~>4.0 version required for AWS Provider