Migrate Element/Matrix from Ansible to Helm/ArgoCD#1556
Open
LukasCuperDT wants to merge 24 commits intomainfrom
Open
Migrate Element/Matrix from Ansible to Helm/ArgoCD#1556LukasCuperDT wants to merge 24 commits intomainfrom
LukasCuperDT wants to merge 24 commits intomainfrom
Conversation
… Element Call Upstream chart (element): - matrix-stack v26.2.3 (ESS Community) as dependency - Synapse v1.125.0, Element Web v1.11.86, LiveKit v1.9.1 - All images mirrored to quay.io/opentelekomcloud with pullSecrets - External OTC RDS PostgreSQL with SSL verify-ca - NFS media storage via SFS Turbo - MatrixRTC with LiveKit SFU + lk-jwt-service - Well-known delegation with HAProxy - AVP (ArgoCD Vault Plugin) placeholders for all secrets Local chart (element-additional-manifests): - Element Call standalone frontend (v0.17.0) - NFS PersistentVolume/PVC for Synapse media - K8s Secrets with AVP placeholders (synapse, db creds, db SSL CA) DNS: *.eco.tsi-dev.otc-service.com (matrix, element, call, matrixrtc)
…renewal Maubot: - Kustomize base + prod overlay for Maubot bot framework - StatefulSet, Service, Ingress (/_matrix/maubot on matrix.eco.tsi-dev.otc-service.com) - NFS PV on shared SFS Turbo (192.168.171.186, subpath /maubot) - Image: quay.io/opentelekomcloud/maubot:v1.0.0 - imagePullSecrets: quay-pull-secret Element charts: - All images switched to quay.io/opentelekomcloud with pullSecrets - cert-manager annotations: duration 8760h (365d), renew-before 720h (30d) - Element Call pinned to v0.17.0
…nit-db job - upstream/element: Fix DB user to root, add nginx-cache emptyDir for element-web - local/element: Add init-db-job PreSync hook to create synapse database - maubot: Switch config from ConfigMap to Secret, fix NFS path, add config.yaml
Legacy 1:1 WebRTC calls fail without a TURN server. Setting use_exclusively=true forces all calls (including 1:1) through the LiveKit SFU, which is already deployed and reachable.
Element Call v0.17.0 preloads /config.json on startup. Without it, nginx try_files returns index.html as HTML, causing JSON parse errors that crash the app in both standalone and embedded (widget) mode. The config provides: - default_server_config: homeserver base URL - livekit.livekit_service_url: auth service URL for JWT tokens
ICE connectivity fails with NodePort: kube-proxy SNATs outbound SFU UDP packets to a random source port, so client STUN responses never match the expected ICE candidate pair. All pairs show state=failed with responsesReceived=0. Fix: hostNetwork=true binds SFU directly to host network, so UDP source port is the actual SFU port. Also set manualIP since STUN discovery timed out in OTC environment.
The OTC CCE worker node security group allows inbound TCP on NodePort range but blocks inbound UDP. This causes all UDP ICE candidate pairs to fail (requestsReceived=0, responsesReceived=0). TCP port 30881 is confirmed reachable externally. Setting force_tcp=true makes LiveKit SFU advertise only TCP ICE candidates so clients connect immediately via TCP without waiting for UDP timeout. To restore UDP: open inbound UDP 30882 in the CCE node security group and remove force_tcp.
80.158.44.171 is the VPC SNAT gateway (outbound only) — it has no inbound DNAT rules, so browsers cannot connect to NodePort services through it. TCP nc from the Mac worked because it routes through VPN internally. Changes: - rtcTcp: NodePort → LoadBalancer (creates OTC ELB with real public IP that accepts inbound connections) - externalTrafficPolicy: Local (avoids kube-proxy SNAT) - rtcMuxedUdp: disabled (UDP blocked by security group anyway) Next step: once the ELB is provisioned, update manualIP to its IP.
The SFU TCP LoadBalancer now shares the ingress-nginx ELB (510d12e5-a578-46e5-acb0-32bc0ffcb04c) which has public IP 80.158.58.167. This IP is confirmed reachable from the internet and is the same IP used by all HTTPS ingresses.
The upstream matrix-stack chart hard-codes ipFamilyPolicy: PreferDualStack on exposed services, but OTC union ELBs reject IPv6. Instead: - Disable rtcTcp in upstream chart (exposedServices.rtcTcp.enabled: false) - Create sfu-tcp-service.yaml in local chart with: - ipFamilyPolicy: SingleStack + ipFamilies: [IPv4] - kubernetes.io/elb.id annotation to share the ingress-nginx ELB - externalTrafficPolicy: Local - LoadBalancer type on port 30881 This ensures a clean 'argocd app sync' deploys the working SFU TCP service without manual kubectl patches.
Bug A: Disabling exposedServices.rtcTcp also removed tcp_port from the upstream chart ConfigMap, causing LiveKit to default to port 7881 while the ELB forwards 30881. Add tcp_port: 30881 to the additional SFU config. Bug B: The SFU TCP LB service in the local chart used .Release.Name in its selector, which resolves to element-additional-manifests-otcinfra2 instead of element-otcinfra2 (the upstream chart release). Use a configurable sfuReleaseName value instead.
- Configure Synapse OIDC provider pointing to Zitadel - Client credentials injected via AVP from secret/element/oidc - Request project roles scope for group-based access control - Restrict login to users with element_users Zitadel role - Map preferred_username to Matrix localpart - Keep local password login enabled alongside SSO
Zitadel returns roles as nested JSON objects inside a list, which Synapse's attribute_requirements cannot match with a simple string check. Access control is handled at the Zitadel application level instead — only authorised users can log in.
5a4be68 to
163c3ad
Compare
The config: block in synapse.additional is NOT processed through
Helm tpl — it is stored as-is via b64enc. The Helm escaping
{{ '{{' }} was never evaluated and ended up as literal text,
causing Synapse to use the template syntax as the actual username.
Use raw {{ user.preferred_username }} since values files are not
Helm-templatized.
163c3ad to
2ffeef8
Compare
9fb4e4b to
94c94ef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the legacy Ansible-managed Matrix/Synapse deployment with a fully declarative Helm chart deployment managed by ArgoCD. Migrates the existing
matrix.otc-service.cominstance (users, rooms, message history) from the old cluster (otcinfra) to the new cluster (otcinfra2).Changes
Added — Helm Charts
upstream/element/): Wrapsmatrix-stackv26.2.3 fromoci://ghcr.io/element-hq/ess-helm, deploying:local/element/): Supplementary templates for:PreferDualStackincompatibility)kustomize/maubot/): Standalone Maubot deployment with prod overlayRemoved — Ansible Roles & Playbooks
playbooks/roles/matrix/(8 files, ~3200 lines): Synapse K8s role with homeserver.yaml.j2 templateplaybooks/roles/maubot/(5 files, ~340 lines): Maubot K8s role with config templateplaybooks/service-matrix.yamlplaybooks/service-maubot.yamlKey Configuration
matrix.otc-service.com(preserved from legacy)allow_existing_users: true)quay.io/opentelekomcloud/letsencrypt-prodClusterIssuerMigration Details
matrix,element,call,matrixrtc.otc-service.com→80.158.58.167