Managing Deep Learning Experiments¶

Deep Learning experiment lifecycle generates a rich set of data artifacts, e.g., expansive datasets, complex model architectures, varied hyperparameters, learned weights, and training logs. To produce an effective model, a researcher often has to iterate over multiple scripts, making it challenging to reproduce complex experiments.

Lab functionality offers a clean and standardised interface for managing the many moving parts of a Deep Learning experiment.

MNIST Example¶

Consider the following lab training script. Let’s set up our hyperparameters and training, validation, testing sets:

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from keras.callbacks import TensorBoard

import tempfile

from sklearn.metrics import accuracy_score, precision_score

from lab.experiment import Experiment

BATCH_SIZE = 128
EPOCHS = 20
CHECKPOINT_PATH = 'tf/weights'
num_classes = 10


# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Set up a simple model and train:

e = Experiment()


@e.start_run
def train():

    # Create a temporary directory for tensorboard logs
    output_dir = tempfile.mkdtemp()
    print("Writing TensorBoard events locally to %s\n" % output_dir)
    tensorboard = TensorBoard(log_dir=output_dir)

    # During Experiment execution, tensorboard can be viewed through:
    # tensorboard --logdir=[output_dir]

    model.fit(x_train, y_train,
              batch_size=BATCH_SIZE,
              epochs=EPOCHS,
              verbose=1,
              validation_data=(x_test, y_test),
              callbacks=[tensorboard])

    model.save_weights(CHECKPOINT_PATH)

    y_prob = model.predict(x_test)
    y_classes = y_prob.argmax(axis=-1)
    actual = y_test.argmax(axis=-1)

    accuracy = accuracy_score(y_true=actual, y_pred=y_classes)
    precision = precision_score(y_true=actual, y_pred=y_classes,
                                average='macro')

    # Log tensorboard
    e.log_artifacts('tensorboard', output_dir)
    e.log_artifacts('weights', CHECKPOINT_PATH)

    # Log all metrics
    e.log_metric('accuracy_score', accuracy)
    e.log_metric('precision_score', precision)

    # Log parameters
    e.log_parameter('batch_size', BATCH_SIZE)
    e.log_parameter('epochs', EPOCHS)

When training on distributed systems with Horovod, model.fit element can be abstracted into a file, say horovod-train.py and called directly from the train() method:

import subprocess

args = ['-np', str(8), # 8 GPUs
        '-H', 'localhost:8', 'python',
        'horovod-train.py',
        '--checkpoint', CHECKPOINT_PATH,
        '--batch-size', BATCH,
        '--epochs', EPOCHS]

Note that you need to enable your Horovod script to accept some basic model hyperparameters that you wish to log downstream.