Skip to content

Restore fails after a while with connection error #1991

@dobesv

Description

@dobesv

Report

We are trying to test backup and restore. We ran a restore. The restore was initially able to connect to mongodb and download files from S3. However, after some time it failed with an error:

Fatal assertion / 2025-06-30T20:39:40.150+00:00, connect err: ping: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: localhost:27872, Type: Unknown, Last error: dial tcp [::1]:27872: connect: connection refused }, ] }

More about the problem

These are the final logs/

2025-06-30T20:44:39.000+0000 D [restore/2025-06-30T20:36:00.578371215Z] remove /data/db/index-449-11338991585176808542.wt
2025-06-30T20:44:39.000+0000 D [restore/2025-06-30T20:36:00.578371215Z] remove /data/db/index-477-11338991585176808542.wt
2025-06-30T20:44:39.000+0000 D [restore/2025-06-30T20:36:00.578371215Z] remove /data/db/collection-454-11338991585176808542.wt
2025-06-30T20:44:39.000+0000 D [restore/2025-06-30T20:36:00.578371215Z] remove /data/db/index-496-11338991585176808542.wt
2025-06-30T20:44:39.000+0000 E [restore/2025-06-30T20:36:00.578371215Z] restore: prepare data: connect to mongo: mongo failed with [F] Fatal assertion / 2025-06-30T20:39:40.150+00:00, connect err: ping: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: localhost:27872, Type: Unknown, Last error: dial tcp [::1]:27872: connect: connection refused }, ] }
2025-06-30T20:44:39.000+0000 I change stream was closed
2025-06-30T20:44:39.000+0000 D [restore/2025-06-30T20:36:00.578371215Z] hearbeats stopped
2025-06-30T20:44:39.000+0000 D [restore/2025-06-30T20:36:00.578371215Z] uploading ".pbm.restore/2025-06-30T20:36:00.578371215Z/rs.rs0/log/stagingdb-rs0-0.stagingdb-rs0.stagingdb.svc.cluster.local:27017.0.log" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]
2025-06-30T20:44:40.000+0000 D [agentCheckup] deleting agent status
2025-06-30T20:44:40.000+0000 E [pitr] init: get conf: get: server selection error: context canceled, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: stagingdb-rs0-0.stagingdb-rs0.stagingdb.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp 100.96.145.43:27017: connect: connection refused }, { Addr: stagingdb-rs0-1.stagingdb-rs0.stagingdb.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp 100.109.101.41:27017: connect: connection refused }, { Addr: stagingdb-rs0-2.stagingdb-rs0.stagingdb.svc.cluster.local:27017, Type: Unknown, Last error: dial tcp 100.109.142.172:27017: connect: connection refused }, ] }
2025-06-30T20:44:40.000+0000 I Exit: <nil>

Steps to reproduce

  1. Create a replset cluster (no sharing or mongos)
  2. Put some data in it
  3. Create a physical manual backup
  4. Delete the cluster
  5. Create the cluster empty again
  6. Try to run a restore of the backup
  7. Notice whether the restore completes successfully

Versions

  1. Kubernetes v1.30.12
  2. Operator v1.20.1
  3. Database mongodb

Anything else?

I found it strange that it was trying to connect to mongod on a different port than usual - 27872. Clearly there's something going on behind the scenes I don't currently understand.

CR YAML: https://gist.github.com/dobesv/c2727a9ee382ce80638d61bd0d64ca30

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions