After training your model using model.fit()
, the next important step is to assess how well it performs on data it has never seen before. While the validation data used during fit()
gives you an indication of generalization during training, a final evaluation on a dedicated test set provides a less biased measure of the model's performance in real-world scenarios. This is where the model.evaluate()
method comes in.
The model.evaluate()
method computes the loss and any other metrics you specified when compiling the model, using the provided test dataset. It performs a single pass over the entire dataset in batches, calculates the metrics, and returns the final values. Unlike model.fit()
, it does not perform any weight updates, it purely measures performance.
model.evaluate()
To use model.evaluate()
, you need two things:
The method can accept test data in several formats, similar to model.fit()
:
model.evaluate(x_test, y_test)
.tf.data.Dataset
: Pass a tf.data.Dataset
object that yields tuples of (features, labels)
: model.evaluate(test_dataset)
. Using a tf.data.Dataset
is often preferred for large datasets as it integrates well with the input pipelines discussed in the next chapter.Let's look at a typical usage pattern assuming you have test data x_test
and y_test
:
# Assume 'model' is your compiled and trained Keras model
# Assume 'x_test' and 'y_test' are your test features and labels
print("Evaluating model on test data...")
results = model.evaluate(x_test, y_test, batch_size=128, verbose=1)
# The 'results' variable contains the loss and metric values
# The order corresponds to model.metrics_names
print("Test Loss:", results[0])
print("Test Accuracy:", results[1]) # Assuming accuracy was the first metric specified
# Alternatively, if you compiled with named metrics:
# results_dict = model.evaluate(x_test, y_test, return_dict=True)
# print(results_dict)
The model.evaluate()
method returns a scalar loss value by default, along with any additional metric values that were included when the model was compiled.
model.compile()
, evaluate()
returns a list where the first element is the test loss, and subsequent elements are the computed values for each metric in the order they were provided. You can check model.metrics_names
to see the names corresponding to the returned values.return_dict=True
to evaluate()
, it returns a dictionary mapping metric names (including 'loss') to their computed scalar values. This can often be more readable.The verbose
argument controls the output during evaluation:
verbose=0
: Silent mode, no output.verbose=1
: Progress bar (default).verbose=2
: One line per epoch (less relevant here as it's a single evaluation pass, but consistent with fit
).The primary goal of evaluation is to estimate the model's generalization ability. Compare the test loss and metrics obtained from model.evaluate()
with the validation loss and metrics observed during the final stages of model.fit()
.
Example comparison showing training loss/accuracy, validation loss/accuracy (potentially from the end of training), and the final test loss/accuracy obtained via
model.evaluate()
. The closeness of validation and test metrics indicates reasonable generalization, although performance is lower than on the training data itself, which is expected.
Remember, the test set should only be used for the final evaluation after all training and hyperparameter tuning (based on the validation set) are complete. Using the test set results to guide further model development invalidates its purpose as an unbiased performance estimate. Evaluating your model rigorously provides confidence in its expected performance when deployed.
© 2025 ApX Machine Learning