You now understand why we need separate training and test sets: to get an honest assessment of how well our model will perform on new, unseen data. You also know the purpose of each set: the training set is for teaching the model, and the test set is for evaluating it afterward.But how do we actually perform this split? The process itself is straightforward. Here's a step-by-step guide to the standard train-test split procedure:The Splitting ProcessStart with Your Entire Labeled Dataset: Imagine you have collected all your data, complete with the features (inputs) and the target variable (what you want to predict, like 'spam'/'not spam' or a house price). This complete dataset is your starting point.Shuffle Your Data (Usually Recommended): Before splitting, it's generally a very good practice to randomly shuffle the rows (the individual data points or examples) in your dataset. Why? Sometimes data is collected or stored in a specific order. For example, maybe all the 'spam' emails are listed first, or house price data is ordered by neighborhood. If you split ordered data without shuffling, your training set might contain only one type of example, and your test set another, leading to poor training and misleading evaluation. Shuffling ensures that different types of examples are likely distributed randomly across both the training and test sets. We'll touch more on the importance of this randomness later in this chapter.Choose a Split Ratio: You need to decide what proportion of your data will be used for training and what proportion for testing. This ratio is often expressed as percentages, like 80/20 (80% for training, 20% for testing) or 70/30. The choice depends on several factors, including the total size of your dataset. We'll discuss common ratios in the next section.Perform the Split: Divide your shuffled dataset into two distinct, non-overlapping subsets according to the chosen ratio.Training Set: Contains the larger portion of the data (e.g., 80%). This data will be used to train your machine learning model. The model learns patterns, relationships, and rules from this set.Test Set: Contains the remaining smaller portion (e.g., 20%). This data is set aside and not shown to the model during training. It acts as the unseen data for evaluation purposes.Keep the Test Set Separate: This is a fundamentally important step. Once the split is done, you should treat the test set like it doesn't exist until you have a final, trained model ready for evaluation. Do not use the test set to make decisions about how to build or tune your model (like choosing which features to use or adjusting model parameters). Using information from the test set during the model building process contaminates it, and your final evaluation won't reflect true performance on genuinely new data.Visualizing the SplitThink of it like taking your full deck of data cards, shuffling them thoroughly, and then dealing out a certain percentage into a 'training pile' and the rest into a 'testing pile'.digraph G { rankdir=LR; node [shape=box, style=filled, fontname="sans-serif", margin=0.2]; bgcolor="transparent"; subgraph cluster_0 { label = "Original Dataset"; style=filled; color="#e9ecef"; node [style=filled, color="#adb5bd"]; OrigData [label="Full Dataset\n(e.g., 1000 examples)", shape=folder]; } subgraph cluster_1 { label = "Shuffle"; style=filled; color="#e9ecef"; node [style=filled, color="#ced4da"]; ShuffledData [label="Shuffled Dataset\n(Random Order)", shape=folder]; } subgraph cluster_2 { label = "Split (e.g., 80/20)"; style=filled; color="#e9ecef"; subgraph cluster_train { label = "Training Set"; style=filled; color="#a5d8ff"; node [style=filled, color="#74c0fc", fontcolor="#1c7ed6"]; TrainSet [label="Training Data\n(80% = 800 examples)\n\nUsed to BUILD the model"]; } subgraph cluster_test { label = "Test Set"; style=filled; color="#ffc9c9"; node [style=filled, color="#ffa8a8", fontcolor="#f03e3e"]; TestSet [label="Test Data\n(20% = 200 examples)\n\nUsed to EVALUATE the final model"]; } } OrigData -> ShuffledData [label=" Shuffle Rows"]; ShuffledData -> TrainSet [label=" Take 80% "]; ShuffledData -> TestSet [label=" Take 20% "]; TrainSet -> Model [style=invis]; # Placeholder to structure layout if needed TestSet -> ModelEval [style=invis]; # Placeholder }A flow showing the dataset being shuffled and then divided into separate training and testing sets.Most machine learning libraries provide functions to perform this shuffle-and-split operation easily. For instance, in Python's scikit-learn library, the train_test_split function handles shuffling and splitting in one command, taking your features, target variable, and the desired test set size as inputs.By following this procedure, you create the necessary separation between the data used for learning and the data used for unbiased evaluation, which is essential for understanding how your model is likely to perform in practice.