Saving model progress is fundamental for any realistic machine learning workflow. Understanding the different ways to save and load effectively is essential, whether to recover from an interruption, deploy a finished model, or simply save the best version during a long training run.In this practice section, we'll walk through common scenarios:Using ModelCheckpoint to automatically save weights during training.Loading those weights into a fresh model instance.Saving the entire model after training.Loading the complete model for inference or further training.We'll use a simple model and synthetic data so we can focus purely on the mechanics of saving and loading.SetupFirst, let's import TensorFlow and other necessary libraries, and generate some simple data for a binary classification problem.import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers import numpy as np import os import shutil # For cleaning up saved files print(f"Using TensorFlow version: {tf.__version__}") # Generate synthetic data def generate_data(num_samples=1000): # Simple 2D features, linearly separable for simplicity np.random.seed(42) X = np.random.rand(num_samples, 2) * 10 - 5 # Simple linear boundary: y > 0.5*x - 1 y = (X[:, 1] > 0.5 * X[:, 0] - 1).astype(int) return X, y X_train, y_train = generate_data(1000) X_val, y_val = generate_data(200) # Define a simple Sequential model def build_model(): model = keras.Sequential( [ layers.Dense(16, activation="relu", input_shape=(2,)), layers.Dense(8, activation="relu"), layers.Dense(1, activation="sigmoid"), # Binary classification ] ) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]) return model # Create directories for saving models/weights checkpoint_dir = "./training_checkpoints" saved_model_dir = "./saved_model" # Clean up previous runs if they exist if os.path.exists(checkpoint_dir): shutil.rmtree(checkpoint_dir) if os.path.exists(saved_model_dir): shutil.rmtree(saved_model_dir) os.makedirs(checkpoint_dir) # saved_model_dir will be created by model.save()1. Saving Checkpoints During TrainingThe ModelCheckpoint callback is incredibly useful for automatically saving your model during training. You can configure it to save only the weights or the entire model, and decide whether to save at every epoch or only when performance improves. Here, we'll save only the weights whenever the validation loss improves.model = build_model() # Configure the ModelCheckpoint callback # We'll save weights only, based on validation loss # The filename includes the epoch number and validation loss checkpoint_path = os.path.join(checkpoint_dir, "ckpt_epoch_{epoch:02d}_val_loss_{val_loss:.2f}.weights.h5") checkpoint_callback = keras.callbacks.ModelCheckpoint( filepath=checkpoint_path, save_weights_only=True, # Save only the model's weights monitor='val_loss', # Monitor validation loss mode='min', # Save when validation loss decreases save_best_only=True, # Only save the 'best' model seen so far verbose=1 # Print messages when saving ) print("Starting training with ModelCheckpoint callback...") history = model.fit( X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val), callbacks=[checkpoint_callback], verbose=0 # Set to 0 to avoid cluttering output, verbose=1 in callback shows saving ) print("\nTraining finished.") print(f"Checkpoints saved in: {checkpoint_dir}") print("Files:", os.listdir(checkpoint_dir)) # Find the latest checkpoint (which should be the best one due to save_best_only=True) latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir) print(f"\nLatest (best) checkpoint found: {latest_checkpoint}")You should see output indicating that checkpoints were saved when the validation loss improved. The tf.train.latest_checkpoint utility helps find the path to the most recently saved checkpoint file in a directory, which corresponds to the best performing model in our case because we set save_best_only=True.2. Loading Weights from a CheckpointNow, imagine your training was interrupted, or you simply want to use the best weights you saved. You need to:Create a new instance of your model with the exact same architecture.Load the saved weights into this new model instance using model.load_weights().# Build a new, untrained model instance with the same architecture new_model = build_model() # Evaluate the untrained model (should have poor performance) print("\nEvaluating the new, untrained model:") loss_untrained, acc_untrained = new_model.evaluate(X_val, y_val, verbose=0) print(f"Untrained model - Loss: {loss_untrained:.4f}, Accuracy: {acc_untrained:.4f}") # Load the weights from the best checkpoint saved earlier if latest_checkpoint: print(f"\nLoading weights from: {latest_checkpoint}") new_model.load_weights(latest_checkpoint) # Evaluate the model with loaded weights (should have good performance) print("Evaluating the model with loaded weights:") loss_loaded, acc_loaded = new_model.evaluate(X_val, y_val, verbose=0) print(f"Model with loaded weights - Loss: {loss_loaded:.4f}, Accuracy: {acc_loaded:.4f}") else: print("\nNo checkpoint found to load.") Notice the significant improvement in accuracy after loading the weights compared to the freshly initialized new_model. This confirms that the learned parameters were successfully restored. Remember, load_weights only restores the parameters; it doesn't restore the optimizer's state.3. Saving the Entire ModelSaving only weights is useful, but sometimes you need the whole package: architecture, weights, and the optimizer's state (e.g., to resume training exactly where you left off). The model.save() method handles this, saving everything into a directory using the TensorFlow SavedModel format.# Let's assume 'model' is the trained model from step 1 # Or we could use 'new_model' which has the loaded weights print(f"\nSaving the entire model to: {saved_model_dir}") model.save(saved_model_dir) # Use the originally trained model instance print("Model saved successfully.") print("Contents of the saved model directory:") # List the contents to show the SavedModel structure for item in os.listdir(saved_model_dir): print(f"- {item}") Executing model.save() creates a directory containing files like saved_model.pb (the graph definition and metadata), a variables directory (containing the weights), and possibly an assets directory. This format is language-neutral and suitable for serving models via TensorFlow Serving or using them in other TensorFlow environments.4. Loading the Entire ModelLoading a SavedModel is straightforward using tf.keras.models.load_model(). This restores the architecture, weights, and optimizer state, making the model ready for inference or continued training.print(f"\nLoading the entire model from: {saved_model_dir}") loaded_full_model = tf.keras.models.load_model(saved_model_dir) # Verify the loaded model's architecture print("\nLoaded model summary:") loaded_full_model.summary() # Evaluate the loaded model to confirm it performs as expected print("\nEvaluating the loaded full model:") loss_full, acc_full = loaded_full_model.evaluate(X_val, y_val, verbose=0) print(f"Loaded full model - Loss: {loss_full:.4f}, Accuracy: {acc_full:.4f}") # You can also make predictions directly print("\nMaking a prediction with the loaded model:") sample_prediction = loaded_full_model.predict(X_val[:5]) # Predict on first 5 validation samples print("Predictions:", sample_prediction.flatten()) print("Actual labels:", y_val[:5])The loaded model performs identically to the one we saved, and we didn't need to rebuild the architecture or compile it again (though recompiling might be desired if you want to change the optimizer or metrics for further training).Resuming Training (Optional)Because model.save() also saves the optimizer's state, you can resume training. TensorFlow will pick up where it left off, including the learning rate schedule and other optimizer parameters like momentum.# Resume training the loaded model for a few more epochs print("\nResuming training on the loaded model...") history_resumed = loaded_full_model.fit( X_train, y_train, epochs=5, # Train for 5 more epochs initial_epoch=history.epoch[-1] + 1, # Start epoch numbering correctly batch_size=32, validation_data=(X_val, y_val), verbose=1 ) print("\nResumed training finished.")This demonstrates how loading a full SavedModel allows you to continue the training process precisely, which is invaluable for long-running experiments.SummaryIn this practice session, you've worked through the essential workflows for saving and loading models in TensorFlow/Keras:ModelCheckpoint Callback: Ideal for automatically saving the best weights (or full models) during training runs, providing fault tolerance and capturing optimal states.model.load_weights(): Used to restore learned parameters into a model instance that has the same architecture. Useful when you only need the weights, like for transfer learning or inference when you rebuild the model structure yourself.model.save(): Saves the entire model (architecture, weights, optimizer state) in the SavedModel format. This is the standard way to save a model for deployment or for resuming training later.tf.keras.models.load_model(): Loads a model previously saved using model.save(), restoring its complete state.Mastering these techniques ensures that your training efforts are preserved and your models are ready for evaluation, deployment, or further development.