Skip to content

docker/k8s/batch: increase /dev/shm size for larger datasets #428

@d4l3k

Description

@d4l3k

🐛 Bug

When running models that need to load large datasets via PyTorch dataloaders they need /dev/shm to be sufficiently sized for data to be transferred between processes. Docker/K8S has a default /dev/shm size of 64MB which is much too small. Increasing the size doesn't eat up memory until it's allocated so we should be safe to set the size to be the full memory allocated for the container.

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

import torch
from torch.utils.data import Dataset, DataLoader

class BigDataset(Dataset):

    def __init__(self, size):
        self.size = size

    def __len__(self):
        return 100

    def __getitem__(self, idx):
        return torch.zeros((1,self.size))

dataset = BigDataset(100_000_000)
dataloader = DataLoader(dataset, batch_size=4, num_workers=4)

for i, x in enumerate(dataloader):
    print(i, x.shape)
torchx run --scheduler local_docker --wait --log dist.ddp -j 1x1 --script large-shm.py

Expected behavior

It runs

Environment

tristanr@tristanr-arch2 ~> docker version
Client:
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.17.5
 Git commit:        e91ed5707e
 Built:             Mon Dec 13 22:31:40 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.5
  Git commit:       459d0dfbbb
  Built:            Mon Dec 13 22:30:43 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.0
  GitCommit:        39259a8f35919a0d02c9ecc2871ddd6ccf6a7c6e.m
 runc:
  Version:          1.1.0
  GitCommit:        v1.1.0-0-g067aaf85
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Additional context

https://stackoverflow.com/questions/46085748/define-size-for-dev-shm-on-container-engine/46434614#46434614

Metadata

Metadata

Assignees

Labels

aws_batchbugSomething isn't workingdockergood first issueGood for newcomerskuberneteskubernetes and volcano schedulersmodule: runnerissues related to the torchx.runner and torchx.scheduler modules

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions