Open
Description
tl;dr: SQLite will replace logs.json
Our current implementation
We use a Logger object that stores data as lists of "values" associated to "keys" in a python dictionary. This dictionary is stored in RAM. At the end of a train epoch or eval epoch, Logger creates/flushes a logs.json
file in the experiment directory.
logs/myexperiment/logs.json
{
'train_epoch.epoch': [0, 1, 2, 3, 4, 5],
'train_epoch.acc_top1': [0.0, 5.7, 13.8, 20.4, 28.1, 37.9]
}
Its problems
- If the code crashes before a flush, the data is lost and we want to use Logger to monitor stuff such as CPU memory usage or GPU memory usage before a crash!
- We need to write the full json files each time a new value has been added.
- We need to load the full json files each time a new value has been added to visualize stuff.
Our constraints
- We want to keep our logs in the experiment directory (no SQL/NoSQL datasets, SQLite maybe?).
- We want to write new values only (For instance, we only write values of epoch 10 at epoch 10).
- We want concurrent reads and writes (at least in differrent keys).
Some propositions
The following tools store the data on the file system (not in RAM).
H5PY one file
logs/myexperiment/logs.h5py
Pros:
- Use numpy
- Easy to access
data['train_epoch.epoch'][10]
Cons:
- Extendible datasets (when you don't specify the number of size) seems to need "resize" see.
- We encountered a lot of bugs in the past due to HDF5 when we multi-thread/multi-process reading or writing
LMDB
logs/myexperiment/logs/train_epoch.epoch.lmdb
Pros:
Cons:
- Cumbersome to use
netCDF
logs/myexperiment/logs.nc
One CSV per key / or binary file
logs/myexperiment/logs/train_epoch.epoch.csv
Pros:
- Very easy to understand, and track
Cons:
- Creates one file per tracked variable
- Associating different variables for the same time step requires reading different files and aligning them
- Difficult to implement (reinvent the wheel)
SQLite
logs/myexperiment/logs.sqlite
Pros:
- Can grow big enough
- Allow easy concurrent read/write
- Caching system (TODO source)
- Binary encoding
- Indexing (easy to read only what we want)
- Meta-data: timestamp, epoch_id, iteration_id
- Fault-tolerant (if crash happen)
Cons:
- Requires library to read, user must know SQL to do custom queries/applications (We could add a wrapper over SQLite in Logger)
Experiments comparison in SQLite
databases = []
for experiment in all_experiments:
databases.append(open...)
for experiment, database in zip(all_experiment, databases):
for metric in list_of_metrics:
min_metric = select... # may be already in cache
max_metric = select... # may be already in cache
(use it here to agglomerate in python)