Skip to content

Hasura Enterprise 2.48.12 appears to show sustained in-process memory growth under steady workload #10821

@matthew-gizmo

Description

@matthew-gizmo

Version Information

Server Version:

  • Hasura Enterprise v2.48.12

CLI Version (for CLI related issue):

  • N/A

Environment

  • Hasura Enterprise Edition
  • Kubernetes deployment
  • Pod resources:
    • requests:
      • memory: 4608Mi
      • cpu: 1000m
    • limits:
      • memory: 8048Mi
      • cpu: 8000m

Database source configuration:

- name: default
  kind: postgres
  configuration:
    connection_info:
      database_url:
        from_env: HASURA_GRAPHQL_DATABASE_URL
      isolation_level: read-committed
      pool_settings:
        connection_lifetime: 120
        idle_timeout: 180
        max_connections: 5
        retries: 1
      use_prepared_statements: false
    read_replicas:
      - database_url:
          from_env: HASURA_GRAPHQL_READ_REPLICA_URL
        isolation_level: read-committed
        pool_settings:
          connection_lifetime: 120
          max_connections: 20
        use_prepared_statements: false
  tables: "!include default/tables/tables.yaml"
  functions: "!include default/functions/functions.yaml"

What is the current behaviour?

We are seeing sustained memory growth over time in Hasura Enterprise v2.48.12.

Observed behaviour:

  • pod memory rises steadily over time
  • latency worsens as memory rises
  • restarting the pod (OOM) drops memory back down
  • the pattern then repeats

We initially wanted to determine whether this was just container working-set / page-cache growth, but inspection inside the pod suggests the memory is genuinely in the graphql-engine process itself.

Evidence collected from inside the pod:

cgroup memory:

$ cat /sys/fs/cgroup/memory.current
5790588928

$ cat /sys/fs/cgroup/memory.peak
7086899200

cgroup breakdown:

$ cat /sys/fs/cgroup/memory.stat
anon 5637496832
file 77258752
kernel 29048832
shmem 0
inactive_anon 5637431296
active_anon 28672
inactive_file 43024384
active_file 34234368
...

process memory:

$ cat /proc/101/smaps_rollup
Rss:             5520072 kB
Pss:             5518638 kB
Pss_Dirty:       5507744 kB
Pss_Anon:        5507744 kB
Pss_File:          10894 kB
Private_Dirty:   5507744 kB
Anonymous:       5507744 kB
Swap:                  0 kB

Interpretation of the above:

  • container memory is about 5.79 GB at sample time
  • Hasura process RSS is about 5.52 GB
  • almost all of that is anonymous private dirty memory
  • file cache is small
  • shmem is zero
  • this does not look like page cache, tmpfs, kernel memory, or another helper process

So this appears to be real in-process memory growth in graphql-engine, not just a misleading container metric.

What is the expected behaviour?

We would expect Hasura memory usage to remain broadly stable under steady workload, or at least not show repeated sustained growth that:

  • increases latency over time
  • requires pod restarts to recover memory
  • appears as mostly anonymous private dirty process memory

How to reproduce the issue?

  1. Run Hasura Enterprise v2.48.12 in Kubernetes with the resource settings and database pool configuration above.
  2. Allow normal production-like workload to run over time.
  3. Observe that pod/process memory rises steadily, latency worsens as memory rises, and restarting the pod resets memory usage.

Screenshots or Screencast

Image

Please provide any traces or logs that could help here.

Memory diagnostics gathered from inside the pod:

$ cat /sys/fs/cgroup/memory.current
5790588928

$ cat /sys/fs/cgroup/memory.peak
7086899200
$ cat /sys/fs/cgroup/memory.stat
anon 5637496832
file 77258752
kernel 29048832
kernel_stack 1130496
pagetables 17580032
sec_pagetables 0
percpu 2448
sock 0
vmalloc 65536
shmem 0
file_mapped 14290944
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 5637431296
active_anon 28672
inactive_file 43024384
active_file 34234368
unevictable 0
slab_reclaimable 4948768
slab_unreclaimable 5222592
slab 10171360
workingset_refault_anon 0
workingset_refault_file 10913
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 0
pgsteal 0
pgscan_kswapd 0
pgscan_direct 0
pgscan_khugepaged 0
pgsteal_kswapd 0
pgsteal_direct 0
pgsteal_khugepaged 0
pgfault 59022249
pgmajfault 191
pgrefill 0
pgactivate 9843
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 26
thp_collapse_alloc 0
$ cat /proc/101/smaps_rollup
00200000-7ffdcf679000 ---p 00000000 00:00 0                              [rollup]
Rss:             5520072 kB
Pss:             5518638 kB
Pss_Dirty:       5507744 kB
Pss_Anon:        5507744 kB
Pss_File:          10894 kB
Pss_Shmem:             0 kB
Shared_Clean:       2064 kB
Shared_Dirty:          0 kB
Private_Clean:     10264 kB
Private_Dirty:   5507744 kB
Referenced:      5520072 kB
Anonymous:       5507744 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB

Any possible solutions/workarounds you're aware of?

Current workaround:

  • restarting the pod drops memory back down temporarily

Keywords

hasura enterprise 2.48.12 memory growth
hasura graphql-engine memory leak
hasura high anonymous memory
hasura private dirty memory
hasura kubernetes memory rises over time
hasura latency increases with memory
hasura smaps_rollup anonymous memory
hasura cgroup memory anon

Metadata

Metadata

Assignees

No one assigned

    Labels

    k/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions