.. _dlexperiments: Managing Deep Learning Experiments ================================== Deep Learning experiment lifecycle generates a rich set of data artifacts, e.g., expansive datasets, complex model architectures, varied hyperparameters, learned weights, and training logs. To produce an effective model, a researcher often has to iterate over multiple scripts, making it challenging to reproduce complex experiments. Lab functionality offers a clean and standardised interface for managing the many moving parts of a Deep Learning experiment. MNIST Example ~~~~~~~~~~~~~~~~ Consider the following lab training script. Let's set up our hyperparameters and training, validation, testing sets: .. code-block:: python import keras from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense, Dropout from keras.optimizers import RMSprop from keras.callbacks import TensorBoard import tempfile from sklearn.metrics import accuracy_score, precision_score from lab.experiment import Experiment BATCH_SIZE = 128 EPOCHS = 20 CHECKPOINT_PATH = 'tf/weights' num_classes = 10 # the data, split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = x_train.reshape(60000, 784) x_test = x_test.reshape(10000, 784) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 print(x_train.shape[0], 'train samples') print(x_test.shape[0], 'test samples') # convert class vectors to binary class matrices y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) Set up a simple model and train: .. code-block:: python e = Experiment() @e.start_run def train(): # Create a temporary directory for tensorboard logs output_dir = tempfile.mkdtemp() print("Writing TensorBoard events locally to %s\n" % output_dir) tensorboard = TensorBoard(log_dir=output_dir) # During Experiment execution, tensorboard can be viewed through: # tensorboard --logdir=[output_dir] model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=1, validation_data=(x_test, y_test), callbacks=[tensorboard]) model.save_weights(CHECKPOINT_PATH) y_prob = model.predict(x_test) y_classes = y_prob.argmax(axis=-1) actual = y_test.argmax(axis=-1) accuracy = accuracy_score(y_true=actual, y_pred=y_classes) precision = precision_score(y_true=actual, y_pred=y_classes, average='macro') # Log tensorboard e.log_artifacts('tensorboard', output_dir) e.log_artifacts('weights', CHECKPOINT_PATH) # Log all metrics e.log_metric('accuracy_score', accuracy) e.log_metric('precision_score', precision) # Log parameters e.log_parameter('batch_size', BATCH_SIZE) e.log_parameter('epochs', EPOCHS) When training on distributed systems with Horovod, `model.fit` element can be abstracted into a file, say `horovod-train.py` and called directly from the `train()` method: .. code-block:: python import subprocess args = ['-np', str(8), # 8 GPUs '-H', 'localhost:8', 'python', 'horovod-train.py', '--checkpoint', CHECKPOINT_PATH, '--batch-size', BATCH, '--epochs', EPOCHS] Note that you need to enable your Horovod script to accept some basic model hyperparameters that you wish to log downstream.