-
Notifications
You must be signed in to change notification settings - Fork 136
Closed
Labels
aws_batchbugSomething isn't workingSomething isn't workingdockergood first issueGood for newcomersGood for newcomerskuberneteskubernetes and volcano schedulerskubernetes and volcano schedulersmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesissues related to the torchx.runner and torchx.scheduler modules
Milestone
Description
🐛 Bug
When running models that need to load large datasets via PyTorch dataloaders they need /dev/shm
to be sufficiently sized for data to be transferred between processes. Docker/K8S has a default /dev/shm
size of 64MB
which is much too small. Increasing the size doesn't eat up memory until it's allocated so we should be safe to set the size to be the full memory allocated for the container.
Module (check all that applies):
-
torchx.spec
-
torchx.component
-
torchx.apps
-
torchx.runtime
-
torchx.cli
-
torchx.schedulers
-
torchx.pipelines
-
torchx.aws
-
torchx.examples
-
other
To Reproduce
Steps to reproduce the behavior:
import torch
from torch.utils.data import Dataset, DataLoader
class BigDataset(Dataset):
def __init__(self, size):
self.size = size
def __len__(self):
return 100
def __getitem__(self, idx):
return torch.zeros((1,self.size))
dataset = BigDataset(100_000_000)
dataloader = DataLoader(dataset, batch_size=4, num_workers=4)
for i, x in enumerate(dataloader):
print(i, x.shape)
torchx run --scheduler local_docker --wait --log dist.ddp -j 1x1 --script large-shm.py
Expected behavior
It runs
Environment
tristanr@tristanr-arch2 ~> docker version
Client:
Version: 20.10.12
API version: 1.41
Go version: go1.17.5
Git commit: e91ed5707e
Built: Mon Dec 13 22:31:40 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.12
API version: 1.41 (minimum version 1.12)
Go version: go1.17.5
Git commit: 459d0dfbbb
Built: Mon Dec 13 22:30:43 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.6.0
GitCommit: 39259a8f35919a0d02c9ecc2871ddd6ccf6a7c6e.m
runc:
Version: 1.1.0
GitCommit: v1.1.0-0-g067aaf85
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Additional context
Metadata
Metadata
Assignees
Labels
aws_batchbugSomething isn't workingSomething isn't workingdockergood first issueGood for newcomersGood for newcomerskuberneteskubernetes and volcano schedulerskubernetes and volcano schedulersmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesissues related to the torchx.runner and torchx.scheduler modules