Skip to content

Migrate Element/Matrix from Ansible to Helm/ArgoCD#1556

Open
LukasCuperDT wants to merge 24 commits intomainfrom
Metrix_reloader
Open

Migrate Element/Matrix from Ansible to Helm/ArgoCD#1556
LukasCuperDT wants to merge 24 commits intomainfrom
Metrix_reloader

Conversation

@LukasCuperDT
Copy link
Contributor

@LukasCuperDT LukasCuperDT commented Mar 6, 2026

Summary

Replaces the legacy Ansible-managed Matrix/Synapse deployment with a fully declarative Helm chart deployment managed by ArgoCD. Migrates the existing matrix.otc-service.com instance (users, rooms, message history) from the old cluster (otcinfra) to the new cluster (otcinfra2).

Changes

Added — Helm Charts

  • Upstream Element chart (upstream/element/): Wraps matrix-stack v26.2.3 from oci://ghcr.io/element-hq/ess-helm, deploying:
    • Synapse v1.125.0 with external PostgreSQL RDS (SSL verify-ca)
    • Element Web v1.11.86 (dark theme, branded "OTC Chat")
    • MatrixRTC with LiveKit SFU v1.9.1 (forced TCP for OTC security group compatibility)
    • Well-Known delegation with base domain redirect
    • HAProxy + Redis
  • Local Element chart (local/element/): Supplementary templates for:
    • Element Call deployment with LiveKit integration
    • SFU TCP LoadBalancer service (workaround for OTC ELB PreferDualStack incompatibility)
    • Matrix media PersistentVolume (SFS Turbo NFS)
    • Vault-injected secrets (signing key, registration secret, DB credentials, SSL CA)
    • Init DB job for database bootstrapping
  • Maubot Kustomize (kustomize/maubot/): Standalone Maubot deployment with prod overlay

Removed — Ansible Roles & Playbooks

  • playbooks/roles/matrix/ (8 files, ~3200 lines): Synapse K8s role with homeserver.yaml.j2 template
  • playbooks/roles/maubot/ (5 files, ~340 lines): Maubot K8s role with config template
  • playbooks/service-matrix.yaml
  • playbooks/service-maubot.yaml

Key Configuration

Component Value
Server name matrix.otc-service.com (preserved from legacy)
Authentication Zitadel OIDC SSO (allow_existing_users: true)
Images All mirrored to quay.io/opentelekomcloud/
TLS cert-manager with letsencrypt-prod ClusterIssuer
WebRTC ICE forced to TCP — OTC SG blocks UDP
Database External RDS with SSL verify-ca

Migration Details

  • Database dump/restore from old PostgreSQL to new RDS
  • DNS records updated: matrix, element, call, matrixrtc .otc-service.com80.158.58.167
  • Original signing key preserved to maintain federation identity
  • 3 users, 24 rooms, 112,482 events migrated

… Element Call

Upstream chart (element):
- matrix-stack v26.2.3 (ESS Community) as dependency
- Synapse v1.125.0, Element Web v1.11.86, LiveKit v1.9.1
- All images mirrored to quay.io/opentelekomcloud with pullSecrets
- External OTC RDS PostgreSQL with SSL verify-ca
- NFS media storage via SFS Turbo
- MatrixRTC with LiveKit SFU + lk-jwt-service
- Well-known delegation with HAProxy
- AVP (ArgoCD Vault Plugin) placeholders for all secrets

Local chart (element-additional-manifests):
- Element Call standalone frontend (v0.17.0)
- NFS PersistentVolume/PVC for Synapse media
- K8s Secrets with AVP placeholders (synapse, db creds, db SSL CA)

DNS: *.eco.tsi-dev.otc-service.com (matrix, element, call, matrixrtc)
…renewal

Maubot:
- Kustomize base + prod overlay for Maubot bot framework
- StatefulSet, Service, Ingress (/_matrix/maubot on matrix.eco.tsi-dev.otc-service.com)
- NFS PV on shared SFS Turbo (192.168.171.186, subpath /maubot)
- Image: quay.io/opentelekomcloud/maubot:v1.0.0
- imagePullSecrets: quay-pull-secret

Element charts:
- All images switched to quay.io/opentelekomcloud with pullSecrets
- cert-manager annotations: duration 8760h (365d), renew-before 720h (30d)
- Element Call pinned to v0.17.0
…nit-db job

- upstream/element: Fix DB user to root, add nginx-cache emptyDir for element-web
- local/element: Add init-db-job PreSync hook to create synapse database
- maubot: Switch config from ConfigMap to Secret, fix NFS path, add config.yaml
Legacy 1:1 WebRTC calls fail without a TURN server.
Setting use_exclusively=true forces all calls (including 1:1)
through the LiveKit SFU, which is already deployed and reachable.
Element Call v0.17.0 preloads /config.json on startup. Without it,
nginx try_files returns index.html as HTML, causing JSON parse
errors that crash the app in both standalone and embedded (widget) mode.

The config provides:
- default_server_config: homeserver base URL
- livekit.livekit_service_url: auth service URL for JWT tokens
ICE connectivity fails with NodePort: kube-proxy SNATs outbound SFU
UDP packets to a random source port, so client STUN responses never
match the expected ICE candidate pair. All pairs show state=failed
with responsesReceived=0.

Fix: hostNetwork=true binds SFU directly to host network, so UDP
source port is the actual SFU port. Also set manualIP since STUN
discovery timed out in OTC environment.
The OTC CCE worker node security group allows inbound TCP on
NodePort range but blocks inbound UDP. This causes all UDP ICE
candidate pairs to fail (requestsReceived=0, responsesReceived=0).

TCP port 30881 is confirmed reachable externally. Setting
force_tcp=true makes LiveKit SFU advertise only TCP ICE candidates
so clients connect immediately via TCP without waiting for UDP
timeout.

To restore UDP: open inbound UDP 30882 in the CCE node security
group and remove force_tcp.
80.158.44.171 is the VPC SNAT gateway (outbound only) — it has
no inbound DNAT rules, so browsers cannot connect to NodePort
services through it. TCP nc from the Mac worked because it routes
through VPN internally.

Changes:
- rtcTcp: NodePort → LoadBalancer (creates OTC ELB with real
  public IP that accepts inbound connections)
- externalTrafficPolicy: Local (avoids kube-proxy SNAT)
- rtcMuxedUdp: disabled (UDP blocked by security group anyway)

Next step: once the ELB is provisioned, update manualIP to its IP.
The SFU TCP LoadBalancer now shares the ingress-nginx ELB
(510d12e5-a578-46e5-acb0-32bc0ffcb04c) which has public IP
80.158.58.167. This IP is confirmed reachable from the internet
and is the same IP used by all HTTPS ingresses.
The upstream matrix-stack chart hard-codes ipFamilyPolicy: PreferDualStack
on exposed services, but OTC union ELBs reject IPv6. Instead:

- Disable rtcTcp in upstream chart (exposedServices.rtcTcp.enabled: false)
- Create sfu-tcp-service.yaml in local chart with:
  - ipFamilyPolicy: SingleStack + ipFamilies: [IPv4]
  - kubernetes.io/elb.id annotation to share the ingress-nginx ELB
  - externalTrafficPolicy: Local
  - LoadBalancer type on port 30881

This ensures a clean 'argocd app sync' deploys the working SFU TCP
service without manual kubectl patches.
Bug A: Disabling exposedServices.rtcTcp also removed tcp_port from the
upstream chart ConfigMap, causing LiveKit to default to port 7881 while
the ELB forwards 30881. Add tcp_port: 30881 to the additional SFU config.

Bug B: The SFU TCP LB service in the local chart used .Release.Name in
its selector, which resolves to element-additional-manifests-otcinfra2
instead of element-otcinfra2 (the upstream chart release). Use a
configurable sfuReleaseName value instead.
- Configure Synapse OIDC provider pointing to Zitadel
- Client credentials injected via AVP from secret/element/oidc
- Request project roles scope for group-based access control
- Restrict login to users with element_users Zitadel role
- Map preferred_username to Matrix localpart
- Keep local password login enabled alongside SSO
Zitadel returns roles as nested JSON objects inside a list, which
Synapse's attribute_requirements cannot match with a simple string
check. Access control is handled at the Zitadel application level
instead — only authorised users can log in.
The config: block in synapse.additional is NOT processed through
Helm tpl — it is stored as-is via b64enc. The Helm escaping
{{ '{{' }} was never evaluated and ended up as literal text,
causing Synapse to use the template syntax as the actual username.

Use raw {{ user.preferred_username }} since values files are not
Helm-templatized.
@LukasCuperDT LukasCuperDT changed the title Metrix reloader Migrate Element/Matrix from Ansible to Helm/ArgoCD Mar 17, 2026
@LukasCuperDT LukasCuperDT marked this pull request as ready for review March 17, 2026 06:50
@LukasCuperDT LukasCuperDT requested a review from otcbot as a code owner March 17, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant