While linear models provide a strong baseline, many real-world relationships aren't linear. Tree-based models offer a different approach, partitioning the feature space into rectangular regions to make predictions. They are fundamental to many state-of-the-art machine learning techniques. This section covers two core tree-based algorithms: Decision Trees and their powerful ensemble extension, Random Forests.
Imagine making a prediction by asking a sequence of yes/no questions based on the features. That's the intuition behind a Decision Tree. It learns a hierarchy of feature-based splits that lead to a final prediction (a class label in classification or a continuous value in regression).
How They Work:
max_depth
), contains fewer samples than a minimum threshold (min_samples_split
or min_samples_leaf
), or when no split can improve the purity/reduce error further.Implementation with scikit-learn:
Let's train a DecisionTreeClassifier
on a simple dataset.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import graphviz # Optional, for visualization
# Sample Data (replace with your actual data)
data = {
'Feature1': [2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2.0, 1.0, 1.5, 1.1],
'Feature2': [1.7, 0.8, 1.5, 1.0, 2.5, 0.9, 2.0, 1.1, 0.5, 1.3],
'Target': [1, 0, 1, 0, 1, 0, 1, 0, 0, 0]
}
df = pd.DataFrame(data)
X = df[['Feature1', 'Feature2']]
y = df['Target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the Decision Tree Classifier
# Limit depth to prevent overfitting and for visualization
dt_clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt_clf.fit(X_train, y_train)
# Make predictions
y_pred = dt_clf.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy:.4f}")
# Visualize the tree (requires matplotlib and optionally graphviz)
plt.figure(figsize=(12, 8))
plot_tree(dt_clf, filled=True, feature_names=X.columns.tolist(), class_names=['Class 0', 'Class 1'], rounded=True)
plt.title("Simple Decision Tree Structure (max_depth=3)")
plt.show()
# Alternative visualization using graphviz (if installed)
# dot_data = export_graphviz(dt_clf, out_file=None,
# feature_names=X.columns.tolist(),
# class_names=['Class 0', 'Class 1'],
# filled=True, rounded=True,
# special_characters=True)
# graph = graphviz.Source(dot_data)
# graph.render("decision_tree") # Saves tree to decision_tree.pdf
# print("Decision tree saved to decision_tree.pdf")
For regression tasks, you would use DecisionTreeRegressor
and evaluate using regression metrics like Mean Squared Error (MSE).
Pros and Cons:
max_depth
or min_samples_leaf
helps, but finding the right balance can be tricky.The tendency to overfit is the most significant drawback, which leads us to ensemble methods like Random Forests.
Instead of relying on a single, potentially unstable and overfit tree, why not build many diverse trees and combine their predictions? This is the core idea behind Random Forests. It's an ensemble method that uses bagging and feature randomness to create a collection of decision trees.
How They Work:
max_features
parameter). This decorrelates the trees. If one feature is very predictive, it won't dominate the splits in all trees.This combination of bootstrapping and feature randomness results in trees that are different from each other. Averaging their predictions reduces variance and improves the model's generalization ability compared to a single decision tree.
Implementation with scikit-learn:
Training a RandomForestClassifier
or RandomForestRegressor
is straightforward.
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
import numpy as np
# Using the same data split as before for classification
rf_clf = RandomForestClassifier(n_estimators=100, # Number of trees in the forest
max_depth=5, # Max depth of individual trees
max_features='sqrt', # Number of features to consider at each split
random_state=42,
n_jobs=-1) # Use all available CPU cores
rf_clf.fit(X_train, y_train)
# Make predictions
y_pred_rf = rf_clf.predict(X_test)
# Evaluate
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
# Feature Importances (useful for understanding feature contribution)
importances = rf_clf.feature_importances_
feature_names = X.columns
indices = np.argsort(importances)[::-1]
print("\nFeature Importances:")
for i in indices:
print(f"{feature_names[i]}: {importances[i]:.4f}")
# Example for Regression (requires different target 'y')
# Generate some sample regression data
# np.random.seed(42)
# X_reg = np.random.rand(100, 2) * 10
# y_reg = 2 * X_reg[:, 0] - 3 * X_reg[:, 1] + np.random.randn(100) * 2 + 5
# X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)
# rf_reg = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
# rf_reg.fit(X_train_reg, y_train_reg)
# y_pred_reg = rf_reg.predict(X_test_reg)
# mse = mean_squared_error(y_test_reg, y_pred_reg)
# print(f"\nRandom Forest Regressor MSE: {mse:.4f}")
We can visualize the feature importances, often derived from how much each feature contributes to reducing impurity across all trees in the forest.
Feature importances derived from the Random Forest classifier example. Higher values indicate greater contribution to the model's predictions. Note: Exact values depend on the training data and model parameters.
Pros and Cons:
n_estimators
).n_estimators
, max_depth
, and max_features
for optimal performance.Tree-based models, particularly Random Forests, are powerful and widely used tools in supervised learning. While single decision trees offer interpretability, their tendency to overfit often makes Random Forests a more practical choice for achieving higher predictive accuracy. Understanding how they work and how to implement them using libraries like scikit-learn is an essential skill for applied data science. The next sections will explore other powerful ensemble methods like Gradient Boosting and delve into systematic ways to tune model hyperparameters.
© 2025 ApX Machine Learning