Managing Deep Learning Experiments¶
Deep Learning experiment lifecycle generates a rich set of data artifacts, e.g., expansive datasets, complex model architectures, varied hyperparameters, learned weights, and training logs. To produce an effective model, a researcher often has to iterate over multiple scripts, making it challenging to reproduce complex experiments.
Lab functionality offers a clean and standardised interface for managing the many moving parts of a Deep Learning experiment.
MNIST Example¶
Consider the following lab training script. Let’s set up our hyperparameters and training, validation, testing sets:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from keras.callbacks import TensorBoard
import tempfile
from sklearn.metrics import accuracy_score, precision_score
from lab.experiment import Experiment
BATCH_SIZE = 128
EPOCHS = 20
CHECKPOINT_PATH = 'tf/weights'
num_classes = 10
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
Set up a simple model and train:
e = Experiment()
@e.start_run
def train():
# Create a temporary directory for tensorboard logs
output_dir = tempfile.mkdtemp()
print("Writing TensorBoard events locally to %s\n" % output_dir)
tensorboard = TensorBoard(log_dir=output_dir)
# During Experiment execution, tensorboard can be viewed through:
# tensorboard --logdir=[output_dir]
model.fit(x_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
verbose=1,
validation_data=(x_test, y_test),
callbacks=[tensorboard])
model.save_weights(CHECKPOINT_PATH)
y_prob = model.predict(x_test)
y_classes = y_prob.argmax(axis=-1)
actual = y_test.argmax(axis=-1)
accuracy = accuracy_score(y_true=actual, y_pred=y_classes)
precision = precision_score(y_true=actual, y_pred=y_classes,
average='macro')
# Log tensorboard
e.log_artifacts('tensorboard', output_dir)
e.log_artifacts('weights', CHECKPOINT_PATH)
# Log all metrics
e.log_metric('accuracy_score', accuracy)
e.log_metric('precision_score', precision)
# Log parameters
e.log_parameter('batch_size', BATCH_SIZE)
e.log_parameter('epochs', EPOCHS)
When training on distributed systems with Horovod, model.fit element can be abstracted into a file, say horovod-train.py and called directly from the train() method:
import subprocess
args = ['-np', str(8), # 8 GPUs
'-H', 'localhost:8', 'python',
'horovod-train.py',
'--checkpoint', CHECKPOINT_PATH,
'--batch-size', BATCH,
'--epochs', EPOCHS]
Note that you need to enable your Horovod script to accept some basic model hyperparameters that you wish to log downstream.