After defining your model's architecture and training it using the fit()
method, monitoring performance on validation data gives you valuable insights during the learning process. However, the validation set is often used implicitly or explicitly to make decisions during training, such as adjusting hyperparameters or applying early stopping. Therefore, its performance metric might not be a completely unbiased estimate of how well your model will perform on entirely new, unseen data.
To get a final, objective assessment of your model's generalization ability, you need to evaluate it on a separate dataset that was held out and never used during training or validation tuning. This is the role of the test set, and Keras provides the evaluate()
method for this specific purpose.
evaluate()
The evaluate()
method computes the loss and any other metrics you specified during the compile()
step on a dataset provided as input. It operates in a single pass over the data, making predictions and comparing them to the true labels to calculate the performance metrics. Unlike fit()
, it does not perform any weight updates. Its sole purpose is assessment.
Think of it like this:
Using evaluate()
on the test set provides this "final exam" score for your model.
evaluate()
Using the method is straightforward. Assuming you have a trained model
and your test data split into features (x_test
) and labels (y_test
), you call evaluate()
like this:
# Assume 'model' is your trained Keras model
# Assume 'x_test' and 'y_test' are your test features and labels
# Evaluate the model on the test data
results = model.evaluate(x_test, y_test, batch_size=128) # You can optionally specify a batch size
# Print the results
print("Test Loss:", results[0])
print("Test Accuracy:", results[1]) # Assuming accuracy was the second metric compiled
# If you have more metrics, access them by index:
# print("Test Metric 2:", results[2])
# ...and so on.
The evaluate()
method requires the test features and the corresponding true labels. It returns a list (or a scalar if only loss was compiled) containing the loss value followed by the values of any metrics specified in model.compile()
. The order of the metrics in the returned list matches the order provided in the metrics
argument during compilation.
For example, if you compiled your model like this:
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy', tf.keras.metrics.Precision()])
Then the list returned by model.evaluate(x_test, y_test)
would contain:
results[0]
: The categorical crossentropy loss calculated on the test set.results[1]
: The accuracy calculated on the test set.results[2]
: The precision calculated on the test set.The values returned by evaluate()
represent your model's performance on data it has never encountered before. This is often considered the most reliable measure of how the model is likely to perform in a real application scenario.
It's informative to compare the test metrics (from evaluate()
) with the final validation metrics observed during training (from the output of fit()
).
The following chart illustrates a typical scenario where test performance is slightly lower than validation performance, which itself is lower than training performance.
Comparison of final accuracy scores achieved on the training, validation, and test datasets. A slight drop from validation to test accuracy is common and expected.
In summary, model.evaluate()
is the standard Keras function for obtaining a final performance assessment of your trained model on unseen data. It provides the critical metrics needed to understand how well your model generalizes, serving as a benchmark before deploying the model or deciding on next steps for improvement.
© 2025 ApX Machine Learning