Evaluating a model solely on the data it was trained on provides an overly optimistic and often misleading picture of its capabilities. A model might achieve high accuracy on the training set simply by memorizing the data points, including noise and specific quirks, rather than learning the underlying general patterns. This phenomenon, known as overfitting, results in poor performance when the model encounters new, unseen data."To get a more realistic estimate of how our model will perform in a practical scenario, we need to evaluate it on data it hasn't seen during training. The most straightforward way to achieve this is by partitioning our dataset into two separate subsets:"Training Set: Used to train the model. The model learns patterns, relationships, and parameters from this data.Testing Set (or Hold-out Set): Used to evaluate the trained model's performance. This data is kept separate during training and acts as a proxy for new, unseen data.By training on one subset and testing on another, we simulate the process of deploying the model and assessing its generalization ability, its ability to perform well on data not used during its development.Implementing Train-Test Split with Scikit-learnScikit-learn provides a convenient utility function, train_test_split, within the model_selection module to perform this partitioning. It handles shuffling the data (important if the data has some inherent order) and splitting it into the desired proportions.Let's see how to use it. Assume you have your features stored in a NumPy array or Pandas DataFrame called X and your corresponding target variable (labels or values) in a NumPy array or Pandas Series called y.# Import the function from sklearn.model_selection import train_test_split import numpy as np # Assume X and y are already defined # For demonstration: # X = np.arange(20).reshape(10, 2) # Example features (10 samples, 2 features) # y = np.arange(10) # Example target variable # Perform the split # Let's allocate 25% of the data for testing X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42 ) # Check the shapes of the resulting arrays (example output) # print(f"X_train shape: {X_train.shape}") # (7, 2) if X was (10, 2) # print(f"X_test shape: {X_test.shape}") # (3, 2) if X was (10, 2) # print(f"y_train shape: {y_train.shape}") # (7,) if y was (10,) # print(f"y_test shape: {y_test.shape}") # (3,) if y was (10,)Parameters of train_test_splitX, y: These are the arrays or dataframes containing your features and target variable, respectively. You can also pass multiple arrays if needed (e.g., if you have different feature sets).test_size: This parameter determines the proportion of the dataset to include in the test split. It's typically a float between 0.0 and 1.0. For example, test_size=0.25 means 25% of the data will be used for testing, and the remaining 75% for training. Alternatively, you can use train_size to specify the proportion for the training set. If only one is specified, the other is inferred. You can also pass an integer representing the absolute number of test samples.random_state: This is an important parameter for reproducibility. The function shuffles the data before splitting by default. Setting random_state to an integer (e.g., 42, 0, 123) ensures that the same random split is generated every time you run the code. This is essential for debugging, comparing different models fairly, and ensuring others can reproduce your results. If you omit random_state or set it to None, you'll get a different split each time.shuffle: A boolean parameter (defaulting to True) indicating whether to shuffle the data before splitting. Shuffling is generally recommended to ensure that the training and test sets are representative samples, especially if the original dataset is ordered in some way (e.g., sorted by time or class).stratify: This parameter is particularly useful for classification tasks, especially when dealing with imbalanced datasets (where some classes are much less frequent than others). By setting stratify=y, the function ensures that the proportion of values in the target variable y is preserved in both the training and testing sets. For example, if your target variable has 80% Class A and 20% Class B, stratification ensures both y_train and y_test maintain this approximate 80/20 split. Without stratification, a random split might, by chance, put almost all samples of a rare class into either the training or the testing set, leading to biased training or evaluation.The Workflow Using Train-Test SplitOnce you have split your data, the typical machine learning workflow proceeds as follows:Choose and Instantiate Model: Select a Scikit-learn estimator (e.g., LinearRegression, LogisticRegression, KNeighborsClassifier).from sklearn.linear_model import LogisticRegression model = LogisticRegression()Train the Model: Fit the model using only the training data (X_train and y_train). The model should never see the test data during this phase.# Fit the model ONLY on the training data model.fit(X_train, y_train)Make Predictions: Use the trained model to make predictions on the test features (X_test).# Make predictions ONLY on the test data y_pred = model.predict(X_test)Evaluate: Compare the predictions (y_pred) with the actual target values from the test set (y_test) using appropriate evaluation metrics (e.g., accuracy, precision, recall for classification; MAE, MSE, R² for regression).from sklearn.metrics import accuracy_score # Example for classification # Evaluate the model's performance on the test set accuracy = accuracy_score(y_test, y_pred) print(f"Model Accuracy on Test Set: {accuracy:.4f}")This evaluation on the test set provides a much more reliable indication of how the model is likely to perform on new, unseen data compared to evaluating on the training data.Limitations and Next StepsWhile the train-test split is a fundamental technique for model evaluation, it has a limitation: the performance estimate depends heavily on which specific data points happen to end up in the training versus the testing set. If you get a "lucky" or "unlucky" split, your evaluation metric might be overly optimistic or pessimistic. Furthermore, by holding out a portion of the data for testing, you are reducing the amount of data available for training the model.To obtain a more reliable performance estimate and make better use of your available data, we often turn to cross-validation techniques, which we will discuss in the next section. Cross-validation involves multiple train-test splits, providing a more averaged and stable measure of model generalization.