By Wei Ming T. on Jan 13, 2025
Scikit-learn is a cornerstone library in the Python ecosystem, offering a wide range of machine learning models for supervised and unsupervised learning. Choosing the right model is essential to achieve accurate, efficient, and interpretable results. This guide provides an overview of every scikit-learn model, categorized for convenience, with placeholders for detailed descriptions.
This guide covers all 70+ Scikit-learn models, but you don’t need to know them all. Models marked with a ⭐️ are foundational and widely used, perfect for building your core machine learning skills. Start with these essentials and explore others as needed for specific tasks.
Just remember these...
Ordinary Least Squares (OLS) is a fundamental linear regression model used to estimate the relationship between a dependent variable and one or more independent variables. It works by minimizing the sum of the squared differences between the observed values and the values predicted by the model. OLS is often a starting point in regression analysis due to its simplicity and interpretability.
OLS may not be suitable in the following scenarios:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
Ridge Regression is a type of linear regression that includes a regularization term (L2 regularization) to penalize large coefficients. This regularization helps address issues like multicollinearity and overfitting in the model. Ridge regression is commonly used when there are many features, or the features are highly correlated.
In classification tasks, the Ridge classifier adapts the same principles of Ridge regression to predict discrete class labels instead of continuous outcomes.
Ridge regression and classification may not be ideal in the following scenarios:
from sklearn.linear_model import Ridge
ridge_regression = Ridge(alpha=1.0) # Adjust alpha to control regularization strength
ridge_regression.fit(X, y)
Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that adds an L1 regularization term to the cost function. Unlike Ridge regression, Lasso not only penalizes large coefficients but can also shrink some of them to zero, effectively performing feature selection. This makes it especially useful when working with datasets with many features, some of which may be irrelevant.
Lasso may not be suitable in the following situations:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1) # Adjust alpha to control regularization strength
lasso.fit(X, y)
Multi-task Lasso is an extension of Lasso regression designed for problems where multiple dependent variables (targets) need to be predicted simultaneously. It adds L1 regularization to encourage sparsity in the coefficients, but it also ensures that the selected features are consistent across all tasks. This is particularly useful when the tasks share common underlying patterns.
Multi-task Lasso might not be ideal in the following scenarios:
from sklearn.linear_model import MultiTaskLasso
multi_task_lasso = MultiTaskLasso(alpha=0.1) # Adjust alpha to control regularization strength
multi_task_lasso.fit(X, y)
Elastic-Net is a linear regression model that combines both L1 (Lasso) and L2 (Ridge) regularization. By balancing these two penalties, it addresses some of the limitations of Lasso (e.g., selecting one feature from a group of highly correlated features) while still allowing for feature selection. This makes Elastic-Net especially effective in datasets with many correlated features or when feature selection is needed alongside smooth coefficient shrinkage.
Elastic-Net may not be suitable in the following scenarios:
from sklearn.linear_model import ElasticNet
# alpha controls overall regularization strength
# l1_ratio controls the mix of L1 and L2 penalties (0 = Ridge, 1 = Lasso)
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
Multi-task Elastic-Net is an extension of Elastic-Net designed for problems with multiple dependent variables (targets) that need to be predicted simultaneously. It combines L1 and L2 regularization, ensuring sparsity (via L1) while promoting smoothness and shared feature selection across tasks (via L2). This makes it effective when tasks share common structures or patterns.
Multi-task Elastic-Net may not be ideal in the following situations:
from sklearn.linear_model import MultiTaskElasticNet
# alpha controls overall regularization strength
# l1_ratio adjusts the balance between L1 and L2 penalties (0 = Ridge, 1 = Lasso)
multi_task_elastic_net = MultiTaskElasticNet(alpha=0.1, l1_ratio=0.5)
multi_task_elastic_net.fit(X, y)
Least Angle Regression (LARS) is a regression algorithm designed for high-dimensional data with more features than samples. It is an efficient alternative to traditional methods, particularly for datasets where the features are highly correlated. LARS incrementally selects features to include in the model, ensuring a parsimonious solution while maintaining interpretability.
Unlike Lasso, LARS does not impose a regularization penalty, but it provides a path of solutions where the coefficients are gradually adjusted. This makes it useful for exploring relationships between features and the target variable.
LARS may not be suitable in the following cases:
from sklearn.linear_model import Lars
lars = Lars()
lars.fit(X, y)
LARS Lasso is a variant of the Least Angle Regression (LARS) algorithm that incorporates L1 regularization (similar to Lasso regression). It is specifically designed to handle high-dimensional data where the number of features exceeds the number of samples. LARS Lasso efficiently computes the entire Lasso regularization path, making it an excellent choice for problems involving feature selection.
By combining the strengths of LARS and Lasso, this method provides a sparse solution where some coefficients are exactly zero, leading to improved interpretability and reduced overfitting.
LARS Lasso may not be suitable in the following scenarios:
from sklearn.linear_model import LassoLars
# alpha controls the strength of L1 regularization
lasso_lars = LassoLars(alpha=0.1)
lasso_lars.fit(X, y)
Orthogonal Matching Pursuit (OMP) is a greedy algorithm used for linear regression that iteratively selects features to include in the model. It is particularly effective for high-dimensional data where the number of features exceeds the number of samples. OMP works by identifying the feature most correlated with the current residual, updating the model, and repeating until a stopping criterion (e.g., number of non-zero coefficients or error threshold) is met.
OMP is similar to LARS but differs in that it enforces orthogonality between the selected features.
OMP might not be ideal in the following cases:
from sklearn.linear_model import OrthogonalMatchingPursuit
# n_nonzero_coefs controls the maximum number of features selected
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=1)
omp.fit(X, y)
Bayesian Regression is a probabilistic approach to linear regression that incorporates prior knowledge or beliefs into the model through Bayesian inference. Instead of finding a single set of optimal coefficients, Bayesian regression provides a distribution over possible coefficients, allowing for a more nuanced understanding of uncertainty in the predictions.
Two commonly used Bayesian regression models in scikit-learn
are Bayesian Ridge Regression and Automatic Relevance Determination (ARD), both of which add priors to the coefficients to control overfitting and provide interpretable results.
Bayesian regression may not be suitable in the following cases:
from sklearn.linear_model import BayesianRidge
bayesian_ridge = BayesianRidge()
bayesian_ridge.fit(X, y)
Logistic Regression is a supervised learning algorithm used for binary and multi-class classification problems. It models the probability of a categorical outcome based on one or more predictor variables using a logistic (sigmoid) function. The algorithm estimates the likelihood that a given input belongs to a particular class and outputs probabilities, which can be converted to class labels.
Unlike linear regression, logistic regression is specifically designed to handle classification tasks, making it a fundamental and widely used algorithm in machine learning.
Logistic Regression may not be suitable in the following situations:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()
logistic_regression.fit(X, y)
# Make predictions
predictions = logistic_regression.predict(X)
print("Predictions:", predictions)
# Access probabilities
probabilities = logistic_regression.predict_proba(X)
print("Probabilities:", probabilities)
Generalized Linear Models (GLMs) extend linear regression to handle a wider range of response variable types (e.g., binary, count, or continuous data) by using a link function to relate the linear predictor to the response variable. GLMs are a flexible tool for regression analysis, making them useful for tasks beyond simple linear regression.
The most common types of GLMs include:
In scikit-learn
, GLMs are implemented as part of the GeneralizedLinearRegressor
and specific variants like Logistic Regression.
GLMs may not be suitable in the following scenarios:
from sklearn.linear_model import PoissonRegressor
glm = PoissonRegressor(alpha=1.0) # Regularization strength controlled by alpha
glm.fit(X, y)
predictions = glm.predict(X)
Stochastic Gradient Descent (SGD) is an optimization algorithm often used for training linear models and neural networks. In machine learning, SGD can be used for a variety of regression and classification tasks by iteratively updating model parameters based on small batches of data, rather than the entire dataset. This makes it efficient for large-scale datasets.
In scikit-learn
, SGD is implemented as a flexible tool for fitting linear classifiers, regressors, and support vector machines under the SGDClassifier
and SGDRegressor
modules.
SGD may not be suitable in the following situations:
from sklearn.linear_model import SGDClassifier
sgd_classifier = SGDClassifier(loss='log') # 'log' loss corresponds to logistic regression
sgd_classifier.fit(X, y)
predictions = sgd_classifier.predict(X)
The Perceptron is one of the simplest types of artificial neural networks, designed for binary classification problems. It is a linear classifier that updates its weights iteratively based on misclassified examples. The Perceptron algorithm is particularly effective for linearly separable datasets but does not work well for non-linear problems.
The model predicts a class label using a simple decision rule based on a weighted sum of input features. If the weighted sum exceeds a threshold, the model outputs one class; otherwise, it outputs the other.
The Perceptron may not be suitable in the following scenarios:
from sklearn.linear_model import Perceptron
perceptron = Perceptron()
perceptron.fit(X, y)
Passive Aggressive algorithms are online learning algorithms designed for both classification and regression tasks. They are particularly effective for large-scale and real-time datasets. The name "Passive Aggressive" reflects their behavior: they remain passive if the prediction is correct but aggressively update the model parameters if the prediction is incorrect or the error is significant.
Passive Aggressive algorithms are margin-based, meaning they seek to adjust the model only enough to correct the current mistake, making them computationally efficient for streaming data or situations where the dataset grows incrementally.
Passive Aggressive algorithms may not be suitable in the following cases:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier()
pac.fit(X, y)
Quantile Regression is a type of regression analysis used to predict the conditional quantiles (e.g., median or percentiles) of the target variable. Unlike ordinary least squares (OLS) regression, which estimates the mean of the target variable, Quantile Regression focuses on modeling the entire distribution of the target variable, making it useful for datasets with heteroscedasticity or outliers.
By estimating different quantiles, Quantile Regression provides a more complete view of the relationship between features and the target variable, especially in cases where the variance of the target variable changes with predictors.
Quantile Regression may not be suitable in the following cases:
from sklearn.linear_model import QuantileRegressor
# `quantile` specifies the quantile to estimate (e.g., 0.5 for median)
# `alpha` is the regularization strength
quantile_regressor = QuantileRegressor(quantile=0.5, alpha=0.1)
quantile_regressor.fit(X, y)
Polynomial Regression is a technique used to model non-linear relationships between the independent and dependent variables by extending linear regression with polynomial terms. It transforms the original features into polynomial features of a specified degree, allowing the model to capture non-linear patterns while still being considered a linear model (in terms of coefficients).
For example, in a quadratic regression (degree 2), the model includes terms for squared features, in addition to the original features, enabling it to fit a parabolic curve.
Polynomial Regression may not be suitable in the following scenarios:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
degree = 2 # Degree of the polynomial
polynomial_regression = make_pipeline(PolynomialFeatures(degree), LinearRegression())
polynomial_regression.fit(X, y)
Linear Discriminant Analysis (LDA) is a classification algorithm that works by finding a linear combination of features that best separates two or more classes. It projects the data onto a lower-dimensional space, maximizing the separation between classes while minimizing the variance within each class. LDA is particularly effective when class distributions are Gaussian and share a common covariance structure.
LDA can also be used for dimensionality reduction, where it seeks to project the data onto the most discriminative directions.
LDA may not be suitable in the following situations:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X, y)
Quadratic Discriminant Analysis (QDA) is a classification algorithm that extends Linear Discriminant Analysis (LDA) by allowing each class to have its own covariance matrix. This makes QDA more flexible than LDA as it can model data with non-linear decision boundaries. QDA works particularly well when the class distributions are Gaussian but have different covariance structures.
QDA computes a quadratic decision surface, making it suitable for problems where the relationship between features and class labels is non-linear.
QDA may not be suitable in the following scenarios:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(X, y)
Kernel Ridge Regression (KRR) is an extension of Ridge Regression that uses kernel functions to model non-linear relationships between features and the target variable. By leveraging the kernel trick, KRR implicitly maps the input data into a higher-dimensional space where linear relationships can capture non-linear patterns. It combines the regularization of Ridge Regression with the flexibility of kernel methods.
Commonly used kernel functions include:
Kernel Ridge Regression may not be suitable in the following scenarios:
from sklearn.kernel_ridge import KernelRidge
# `kernel` specifies the type of kernel (e.g., 'linear', 'poly', 'rbf')
# `alpha` is the regularization strength
# `gamma` controls the kernel bandwidth for RBF or polynomial kernels
krr = KernelRidge(kernel='rbf', alpha=1.0, gamma=0.5)
krr.fit(X, y)
Support Vector Machines (SVM) are powerful supervised learning algorithms used for classification tasks. SVM works by finding the hyperplane that best separates the data into classes, maximizing the margin (distance) between the hyperplane and the nearest data points of each class, known as support vectors. This margin maximization helps SVM achieve robust generalization.
SVM can handle both linear and non-linear classification problems by using kernels to transform the input data into higher-dimensional spaces.
SVM classification may not be suitable in the following scenarios:
from sklearn.svm import SVC
# `kernel`: Specifies the kernel type ('linear', 'poly', 'rbf', 'sigmoid').
# `C`: Regularization parameter (higher values allow less slack for misclassified points).
# `gamma`: Kernel coefficient for 'rbf' and 'poly' kernels ('scale' adjusts automatically).
svm_classifier = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_classifier.fit(X, y)
Support Vector Machines (SVM) can also be applied to regression tasks through Support Vector Regression (SVR). Unlike traditional regression, SVR aims to find a function that fits the data within a specified margin of tolerance, called the epsilon. SVR uses a similar principle to SVM classification, relying on support vectors and kernels to model relationships between features and the target variable.
SVR is particularly useful for handling non-linear regression problems by employing kernels to transform the input space.
SVR may not be suitable in the following scenarios:
from sklearn.svm import SVR
# `kernel`: Specifies the kernel type ('linear', 'poly', 'rbf', 'sigmoid').
# `C`: Regularization parameter (higher values allow less slack for deviations from the margin).
# `epsilon`: Specifies the margin of tolerance for fitting the data.
# `gamma`: Kernel coefficient for 'rbf' and 'poly' kernels ('scale' adjusts automatically).
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale')
svr.fit(X, y)
KDE is a non-parametric method that estimates the probability density function of a dataset. It smooths the data using a kernel function (e.g., Gaussian) and is useful for understanding the distribution of the data.
from sklearn.neighbors import KernelDensity
# `bandwidth` controls the smoothness of the density estimate
kde = KernelDensity(kernel='gaussian', bandwidth=0.5)
kde.fit(X)
Unsupervised Nearest Neighbors is a technique used to find the closest data points to a given sample without requiring labeled data. It is commonly used for clustering, anomaly detection, and density estimation. The algorithm works by measuring distances (e.g., Euclidean, Manhattan) between data points to identify their neighbors.
from sklearn.neighbors import NearestNeighbors
# `n_neighbors`: Number of nearest neighbors to find
# `algorithm`: Algorithm used for nearest neighbors search (e.g., 'auto', 'ball_tree', 'kd_tree', 'brute')
nn = NearestNeighbors(n_neighbors=2, algorithm='auto')
nn.fit(X)
Nearest Neighbors Classification is a supervised learning algorithm that predicts the class of a data point based on the majority class of its nearest neighbors. It uses distance metrics such as Euclidean or Manhattan distance to determine which training samples are closest to the input sample.
This method is simple yet effective for many classification tasks, particularly when the decision boundary between classes is non-linear.
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X, y)
Nearest Neighbors Regression is a supervised learning algorithm that predicts the target value of a data point based on the average (or weighted average) of the target values of its nearest neighbors. It is particularly effective for non-linear regression problems where the relationship between features and the target variable may be complex.
The distance metric (e.g., Euclidean, Manhattan) determines the proximity of data points, and the prediction is computed from the neighbors' target values.
from sklearn.neighbors import KNeighborsRegressor
# `n_neighbors`: Number of neighbors to consider
# `weights`: Weighting function ('uniform' for equal weights, 'distance' for inverse distance)
knn_regressor = KNeighborsRegressor(n_neighbors=2, weights='uniform')
knn_regressor.fit(X, y)
The Nearest Centroid Classifier is a simple yet effective classification algorithm that assigns a class label to a data point based on the nearest class centroid. The centroid of a class is the mean of all data points belonging to that class in feature space. This algorithm is computationally efficient and works well for linearly separable datasets.
Unlike k-Nearest Neighbors, which uses multiple neighbors, this method relies on the centroid of each class for classification, making it faster and simpler.
from sklearn.neighbors import NearestCentroid
nearest_centroid = NearestCentroid()
nearest_centroid.fit(X, y)
Neighborhood Components Analysis (NCA) is a supervised dimensionality reduction technique that learns a feature transformation to optimize k-Nearest Neighbors (k-NN) classification. It projects the data into a lower-dimensional space where the distance between data points is more meaningful for classification. NCA aims to maximize the probability of correctly classifying a point based on its nearest neighbors.
NCA is particularly useful for improving the performance of k-NN classifiers on datasets where the original feature space does not represent class separability well.
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
nca = NeighborhoodComponentsAnalysis(random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
pipeline = Pipeline([('nca', nca), ('knn', knn)])
pipeline.fit(X_train, y_train)
Gaussian Process Regression (GPR) is a non-parametric, Bayesian regression method that models the relationship between input features and a target variable using a Gaussian Process. GPR provides not only predictions but also uncertainty estimates for those predictions. This makes it especially useful in tasks where understanding uncertainty is as important as the predictions themselves.
GPR defines a prior over functions, and after observing data, it updates this prior to a posterior distribution. Predictions are made based on the posterior mean and variance.
(O(n^3))
, where (n)
is the number of training samples.from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-3, 1e3))
# `n_restarts_optimizer`: Number of times to restart the optimizer for hyperparameter tuning
# `alpha`: Value added to the diagonal of the kernel matrix for numerical stability
gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=1e-10)
gpr.fit(X, y)
Gaussian Process Classification (GPC) is a non-parametric, probabilistic classification method that uses Gaussian Processes to model the posterior distribution over the latent functions defining class probabilities. GPC outputs class probabilities along with predictions, making it useful when uncertainty quantification is required for classification tasks.
Like Gaussian Process Regression (GPR), GPC leverages kernels to model non-linear relationships between the input features and the target classes.
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-3, 1e3))
# `n_restarts_optimizer`: Number of optimizer restarts for hyperparameter tuning
# `max_iter_predict`: Maximum number of iterations for prediction convergence
gpc = GaussianProcessClassifier(kernel=kernel, n_restarts_optimizer=10, max_iter_predict=100)
gpc.fit(X, y)
Naive Bayes These models are based on Bayes' theorem and assume feature independence.
Gaussian Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem. It assumes that the features follow a Gaussian (normal) distribution and that they are conditionally independent given the class label. This algorithm is simple, efficient, and often performs well even with limited training data.
The Gaussian assumption makes it particularly suited for continuous features, where the likelihood of a feature is modeled using a Gaussian distribution.
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X, y)
Multinomial Naive Bayes is a variant of the Naive Bayes algorithm designed for classification tasks involving discrete, count-based features. It is commonly used for text classification and natural language processing (NLP) tasks, where feature vectors represent word counts or term frequencies.
The algorithm applies Bayes' Theorem, assuming that features are conditionally independent given the class label. It calculates the likelihood of a class based on the frequency of observed features.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
mnb = MultinomialNB()
mnb.fit(X, labels)
Complement Naive Bayes is a variant of Multinomial Naive Bayes designed to address class imbalance issues. It modifies the traditional Multinomial Naive Bayes algorithm by estimating probabilities from the complement of each class, focusing on reducing the impact of imbalanced class distributions. This makes it particularly effective for text classification tasks where one class dominates the dataset.
from sklearn.naive_bayes import ComplementNB
cnb = ComplementNB()
cnb.fit(X, labels)
Bernoulli Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem. It is specifically designed for binary feature data, where each feature represents the presence or absence of a particular property or attribute. The model assumes conditional independence among features given the class label and calculates probabilities for binary-valued features.
This variant is particularly useful in text classification tasks with binary term occurrence (e.g., presence/absence of words in a document).
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(X, y)
Categorical Naive Bayes is a variant of Naive Bayes designed for categorical feature data. It assumes that each feature follows a multinomial distribution conditioned on the class label. This algorithm is particularly useful when dealing with categorical features, such as ordinal or nominal data.
from sklearn.naive_bayes import CategoricalNB
cnb = CategoricalNB()
cnb.fit(X, y)
Decision Tree Classification is a supervised learning algorithm that splits data into subsets based on feature values to create a tree-like structure. Each internal node represents a decision based on a feature, and each leaf node represents a class label. Decision trees are easy to interpret and handle both categorical and numerical data.
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(max_depth=3)
dt_classifier.fit(X, y)
Decision Tree Regression is a supervised learning algorithm that predicts a continuous target variable by recursively partitioning the data space into regions and fitting a simple model (e.g., a constant) within each region. It is capable of capturing non-linear relationships.
from sklearn.tree import DecisionTreeRegressor
dt_regressor = DecisionTreeRegressor(max_depth=3)
dt_regressor.fit(X, y)
Gradient-Boosted Trees is an ensemble method that builds a series of decision trees, where each tree corrects the errors of the previous one. It uses gradient descent to minimize a loss function and is highly effective for both regression and classification tasks.
from sklearn.ensemble import GradientBoostingClassifier
gb_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb_classifier.fit(X, y)
Random Forests is an ensemble method that combines multiple decision trees, each trained on a random subset of the data and features. The final prediction is made by averaging (for regression) or majority voting (for classification).
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
rf_classifier.fit(X, y)
The Bagging Meta-Estimator is an ensemble method that fits multiple base models (e.g., decision trees) on random subsets of the training data and aggregates their predictions. It reduces variance and helps prevent overfitting.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
base_estimator = DecisionTreeClassifier(max_depth=3)
bagging_classifier = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)
bagging_classifier.fit(X, y)
Voting Classifier is an ensemble learning technique that combines predictions from multiple models (classifiers) to improve overall classification performance. It supports two modes of aggregation: hard voting, which uses the majority class prediction, and soft voting, which averages predicted probabilities.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
log_reg = LogisticRegression()
svm = SVC(probability=True)
dt = DecisionTreeClassifier()
voting_clf = VotingClassifier(
estimators=[('lr', log_reg), ('svm', svm), ('dt', dt)],
voting='soft'
)
voting_clf.fit(X, y)
Voting Regressor is an ensemble method that combines predictions from multiple regression models. It takes the average of individual model predictions to provide a more robust and stable regression estimate.
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
lr = LinearRegression()
dt = DecisionTreeRegressor()
svr = SVR()
voting_reg = VotingRegressor(
estimators=[('lr', lr), ('dt', dt), ('svr', svr)]
)
voting_reg.fit(X, y)
Stacked Generalization, or Stacking, is an ensemble learning technique that combines predictions from multiple base models (level-0 models) using a meta-model (level-1 model). The meta-model learns to optimize the final predictions by considering the outputs of the base models as its inputs.
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
base_models = [
('svc', SVC(probability=True)),
('dt', DecisionTreeClassifier())
]
meta_model = LogisticRegression()
stack_clf = StackingClassifier(estimators=base_models, final_estimator=meta_model)
stack_clf.fit(X, y)
AdaBoost (Adaptive Boosting) is an ensemble technique that combines multiple weak classifiers, typically decision trees, to create a strong classifier. It assigns higher weights to misclassified instances in each iteration, forcing subsequent classifiers to focus on these harder-to-classify samples.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
base_model = DecisionTreeClassifier(max_depth=1)
adaboost = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, learning_rate=1.0)
adaboost.fit(X, y)
Multi-layer Perceptron (MLP) is a supervised learning algorithm that uses feedforward neural networks. It consists of multiple layers of neurons, including input, hidden, and output layers. MLP learns complex non-linear patterns by optimizing weights using backpropagation.
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', solver='adam', max_iter=300)
mlp.fit(X, y)
A Gaussian Mixture Model (GMM) is a probabilistic model that assumes data is generated from a mixture of multiple Gaussian distributions. It is commonly used for clustering, density estimation, and anomaly detection.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, covariance_type='full', max_iter=100)
gmm.fit(X)
Variational Bayesian Gaussian Mixture (VBGM) is a probabilistic model similar to Gaussian Mixture Models (GMM), but with a Bayesian framework that introduces prior distributions over model parameters. This allows VBGM to automatically infer the number of components by controlling the complexity of the model.
from sklearn.mixture import BayesianGaussianMixture
vbgm = BayesianGaussianMixture(n_components=10, covariance_type='full', max_iter=100)
vbgm.fit(X)
Isomap (Isometric Mapping) is a non-linear dimensionality reduction technique that preserves geodesic distances between all points. It builds a graph of nearest neighbors and computes low-dimensional embeddings that maintain the manifold's structure.
from sklearn.manifold import Isomap
isomap = Isomap(n_neighbors=5, n_components=2)
X_transformed = isomap.fit_transform(X)
Locally Linear Embedding (LLE) is a non-linear dimensionality reduction technique that preserves local relationships among data points. It assumes that each point and its neighbors lie on a locally linear patch of the manifold and maps these patches into a lower-dimensional space.
from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_neighbors=10, n_components=2)
X_transformed = lle.fit_transform(X)
Modified Locally Linear Embedding (MLLE) is an enhancement of LLE that addresses sensitivity to noise and poor embeddings in certain situations. It introduces regularization and constraints to improve stability and robustness.
from sklearn.manifold import LocallyLinearEmbedding
mlle = LocallyLinearEmbedding(n_neighbors=10, n_components=2, method='modified')
X_transformed = mlle.fit_transform(X)
Hessian Eigenmapping, also known as Hessian Locally Linear Embedding, is a non-linear dimensionality reduction technique that focuses on preserving the local curvature of a manifold. It uses the Hessian operator to capture the local geometry and map the data to a lower-dimensional space.
from sklearn.manifold import LocallyLinearEmbedding
hessian = LocallyLinearEmbedding(n_neighbors=10, n_components=2, method='hessian')
X_transformed = hessian.fit_transform(X)
Spectral Embedding is a graph-based dimensionality reduction technique that uses the Laplacian of the similarity graph to compute embeddings. It is particularly effective for clustering and manifold learning tasks.
from sklearn.manifold import SpectralEmbedding
spectral = SpectralEmbedding(n_components=2)
X_transformed = spectral.fit_transform(X)
Local Tangent Space Alignment (LTSA) is a non-linear dimensionality reduction technique that extends Locally Linear Embedding. LTSA aligns local tangent spaces of the manifold to preserve the global structure in the lower-dimensional representation.
from sklearn.manifold import LocallyLinearEmbedding
ltsa = LocallyLinearEmbedding(n_neighbors=10, n_components=2, method='ltsa')
X_transformed = ltsa.fit_transform(X)
Multi-Dimensional Scaling (MDS) is a dimensionality reduction technique that preserves pairwise distances between data points in the lower-dimensional embedding. It is useful for visualizing high-dimensional data and exploring underlying structures.
from sklearn.manifold import MDS
mds = MDS(n_components=2, metric=True)
X_transformed = mds.fit_transform(X)
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique that emphasizes preserving local relationships in high-dimensional data. It is widely used for data visualization by projecting data into two or three dimensions.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42)
X_transformed = tsne.fit_transform(X)
K-Means is a partition-based clustering algorithm that divides data into (k) clusters by minimizing the within-cluster variance. It iteratively assigns data points to the nearest cluster centroid and updates the centroids based on the assignments.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
Affinity Propagation is a clustering algorithm that identifies exemplars (representative points) by considering pairwise similarities between data points. It does not require specifying the number of clusters in advance and relies on a message-passing approach.
from sklearn.cluster import AffinityPropagation
affinity = AffinityPropagation(random_state=42)
affinity.fit(X)
Mean Shift is a clustering algorithm that iteratively shifts data points towards the mode of the density estimated from a kernel function. It does not require specifying the number of clusters and automatically detects the number based on the data.
from sklearn.cluster import MeanShift
mean_shift = MeanShift()
mean_shift.fit(X)
Spectral Clustering is a graph-based clustering algorithm that partitions data by leveraging the eigenvectors of the Laplacian matrix of the similarity graph. It is particularly effective for clustering non-convex and non-linearly separable clusters.
from sklearn.cluster import SpectralClustering
spectral = SpectralClustering(n_clusters=3, affinity='nearest_neighbors', random_state=42)
spectral.fit(X)
Hierarchical Clustering builds a hierarchy of clusters using either a bottom-up (agglomerative) or top-down (divisive) approach. It does not require specifying the number of clusters in advance and produces a dendrogram for visualization.
from sklearn.cluster import AgglomerativeClustering
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
hierarchical.fit(X)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups data points into dense regions separated by low-density regions. It can identify clusters of arbitrary shapes and label outliers as noise.
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an extension of DBSCAN that performs hierarchical clustering and selects a flat clustering from the hierarchy. It is more robust to varying density and does not require specifying a fixed epsilon parameter.
from hdbscan import HDBSCAN
hdbscan = HDBSCAN(min_cluster_size=5)
hdbscan.fit(X)
OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm similar to DBSCAN but capable of identifying clusters with varying densities. It builds a reachability plot to visualize the clustering structure and determine appropriate cluster boundaries.
from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.1)
optics.fit(X)
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a hierarchical clustering algorithm designed for large datasets. It incrementally constructs a clustering feature tree (CF tree) and performs clustering based on memory constraints.
from sklearn.cluster import Birch
birch = Birch(n_clusters=3)
birch.fit(X)
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms data into a lower-dimensional space by projecting it onto the directions of maximum variance (principal components). It is commonly used for feature reduction and visualization.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_transformed = pca.fit_transform(X)
Kernel PCA is a non-linear extension of PCA that uses kernel functions to project data into a high-dimensional space before performing PCA. This allows it to capture non-linear structures in the data.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_transformed = kpca.fit_transform(X)
Truncated Singular Value Decomposition (SVD) is a linear dimensionality reduction technique that reduces the number of features by decomposing the data matrix into its singular values and vectors. It is widely used in text mining and latent semantic analysis.
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2, random_state=42)
X_transformed = svd.fit_transform(X)
Dictionary Learning is a sparse representation technique that learns a dictionary of basis vectors from the data. Each data point is represented as a sparse linear combination of these basis vectors. It is commonly used in signal processing and image denoising.
from sklearn.decomposition import DictionaryLearning
dict_learning = DictionaryLearning(n_components=2, alpha=1, max_iter=100, random_state=42)
X_transformed = dict_learning.fit_transform(X)
Factor Analysis is a statistical technique used to model observed variables as linear combinations of latent factors plus noise. It assumes that the data covariance can be explained by a lower-dimensional latent structure, making it useful for exploratory data analysis.
from sklearn.decomposition import FactorAnalysis
factor_analysis = FactorAnalysis(n_components=2, random_state=42)
X_transformed = factor_analysis.fit_transform(X)
Independent Component Analysis (ICA) is a dimensionality reduction technique that separates a multivariate signal into independent non-Gaussian components. It is widely used in blind source separation, such as separating audio signals or removing artifacts in EEG data.
from sklearn.decomposition import FastICA
ica = FastICA(n_components=2, random_state=42)
X_transformed = ica.fit_transform(X)
Non-Negative Matrix Factorization (NMF/NNMF) is a dimensionality reduction technique that decomposes a non-negative data matrix into two lower-dimensional non-negative matrices. It is particularly useful for extracting interpretable latent features.
from sklearn.decomposition import NMF
nmf = NMF(n_components=2, random_state=42)
X_transformed = nmf.fit_transform(X)
Latent Dirichlet Allocation (LDA) is a probabilistic generative model commonly used for topic modeling. It assumes that documents are mixtures of topics, and each topic is a distribution over words.
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)
Histograms are a basic non-parametric density estimation technique. They divide the data range into bins and count the number of observations within each bin to estimate the probability density.
import numpy as np
import matplotlib.pyplot as plt
plt.hist(X, bins=10, density=True)
plt.show()
Kernel Density Estimation (KDE) is a non-parametric method for estimating the probability density function of a random variable. It uses a kernel function, typically Gaussian, to smooth the density estimate.
from sklearn.neighbors import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=1.0)
kde.fit(X)
Restricted Boltzmann Machines (RBMs) are generative neural network models that learn a joint distribution over the input data and hidden features. They are commonly used for dimensionality reduction, feature learning, and as building blocks for deep belief networks.
from sklearn.neural_network import BernoulliRBM
rbm = BernoulliRBM(n_components=2, learning_rate=0.01, n_iter=100, random_state=42)
rbm.fit(X)
Scikit-learn’s vast range of models provides flexibility for tackling diverse machine learning problems. From regression and classification to clustering and dimensionality reduction, the library ensures there’s a tool for every scenario. By understanding the strengths and limitations of each model, you can select the most suitable one for your dataset and objectives.
© 2025 ApX Machine Learning. All rights reserved.
Learn Data Science & Machine Learning
Machine Learning Tools
Featured Posts