Now that you've learned about several fundamental probability distributions, let's put that knowledge into practice. These exercises will help you solidify your understanding of when to use specific distributions, how to calculate probabilities associated with them, and how to work with them using Python. Remember, distributions are the building blocks for modeling uncertainty in data, a frequent task in machine learning.
Exercise 1: Identifying the Right Distribution
For each scenario below, determine which probability distribution (Bernoulli, Binomial, Uniform, or Normal) best describes the random variable mentioned. Explain your reasoning briefly.
- Scenario: A single user visits a webpage. The random variable is whether the user clicks on an advertisement (Yes/No).
- Hint: Think about the number of trials and the possible outcomes.
- Scenario: You survey 100 students and ask if they prefer online or in-person classes. The random variable is the number of students out of 100 who prefer online classes.
- Hint: Consider multiple trials, each with two outcomes, and you're counting successes.
- Scenario: A random number generator produces numbers where any value between 0.0 and 1.0 is equally likely. The random variable is the next number generated.
- Hint: Focus on the equal likelihood across a continuous range.
- Scenario: You measure the height of adult males in a large city. The random variable is the height of a randomly selected male.
- Hint: Think about natural phenomena and how measurements often cluster around an average.
Answers and Reasoning:
- Bernoulli Distribution: This scenario involves a single trial (one user visiting) with two possible outcomes (click or no click). The Bernoulli distribution models the probability of success (e.g., clicking) in a single trial.
- Binomial Distribution: This involves a fixed number of independent trials (100 students), each trial has two outcomes (prefer online or in-person), and the probability of preferring online is assumed constant for each student. The Binomial distribution models the number of successes in a fixed number of Bernoulli trials.
- Uniform Distribution: The key here is "equally likely" over a continuous interval [0.0, 1.0]. The Uniform distribution assigns equal probability density to all outcomes within a specified range.
- Normal Distribution: Physical measurements like human height often follow a bell-shaped curve, where most values cluster around the average, and values further from the average become less likely. The Normal (Gaussian) distribution is commonly used to model such phenomena.
Exercise 2: Calculating Binomial Probabilities
Imagine a quality control process where the probability of a manufactured widget being defective is p=0.05. You inspect a batch of n=10 widgets. Let X be the random variable representing the number of defective widgets in the batch.
- What is the probability that exactly one widget is defective (P(X=1))?
- What is the probability that at most one widget is defective (P(X≤1))?
Solution Approach:
This scenario fits the Binomial distribution because:
- There's a fixed number of trials (n=10 widgets).
- Each trial is independent (assuming one widget's defectiveness doesn't affect others).
- Each trial has two outcomes (defective or not defective).
- The probability of success (being defective) is constant (p=0.05).
The Probability Mass Function (PMF) for the Binomial distribution gives the probability of observing exactly k successes in n trials:
P(X=k)=(kn)pk(1−p)n−k
where (kn)=k!(n−k)!n! is the binomial coefficient, representing the number of ways to choose k successes from n trials.
Calculations:
-
Probability of exactly one defective widget (k=1):
- n=10, k=1, p=0.05
- (110)=1!(10−1)!10!=110=10
- P(X=1)=(110)(0.05)1(1−0.05)10−1=10×0.05×(0.95)9
- P(X=1)≈10×0.05×0.6302=0.3151
So, there's about a 31.5% chance that exactly one widget is defective.
-
Probability of at most one defective widget (P(X≤1)):
- This means either zero widgets are defective (k=0) or exactly one widget is defective (k=1).
- P(X≤1)=P(X=0)+P(X=1)
- First, calculate P(X=0):
- (010)=0!(10−0)!10!=1
- P(X=0)=(010)(0.05)0(1−0.05)10−0=1×1×(0.95)10
- P(X=0)≈0.5987
- Now add P(X=0) and P(X=1):
- P(X≤1)≈0.5987+0.3151=0.9138
So, there's about a 91.4% chance that one or fewer widgets are defective in the batch.
Using Python (Optional):
You can verify these using SciPy:
import scipy.stats as stats
n = 10
p = 0.05
# Probability of exactly k=1 success
prob_k1 = stats.binom.pmf(k=1, n=n, p=p)
print(f"P(X=1) = {prob_k1:.4f}") # Output: P(X=1) = 0.3151
# Probability of at most k=1 successes (using Cumulative Distribution Function - CDF)
prob_k_le_1 = stats.binom.cdf(k=1, n=n, p=p)
print(f"P(X<=1) = {prob_k_le_1:.4f}") # Output: P(X<=1) = 0.9139
# Note: Slight difference due to rounding in manual calculation
Exercise 3: Working with the Normal Distribution
Assume that the scores on a standardized test are normally distributed with a mean (μ) of 1000 and a standard deviation (σ) of 150. Let X be the score of a randomly selected student.
- What is the probability that a student scores below 850 (P(X<850))?
- What is the probability that a student scores between 900 and 1100 (P(900<X<1100))?
Solution Approach:
For the Normal distribution, we calculate probabilities by finding the area under the Probability Density Function (PDF) curve. Since we can't integrate the PDF easily by hand, we typically convert the score (X) into a standard score (Z-score) and use a standard normal table or software functions (like SciPy's cdf
).
The Z-score tells us how many standard deviations a value is away from the mean:
Z=σX−μ
The standard normal distribution has μ=0 and σ=1.
Calculations:
-
Probability of scoring below 850 (P(X<850)):
- Convert X=850 to a Z-score:
Z=(850−1000)/150=−150/150=−1.0
- We need to find P(Z<−1.0). This corresponds to the area under the standard normal curve to the left of Z=−1.0.
- Using a standard normal table or
scipy.stats.norm.cdf( )
, we find this probability.
- P(Z<−1.0)≈0.1587
So, there's about a 15.9% chance a student scores below 850.
-
Probability of scoring between 900 and 1100 (P(900<X<1100)):
- Convert both scores to Z-scores:
- Z1=(900−1000)/150=−100/150≈−0.67
- Z2=(1100−1000)/150=100/150≈0.67
- We need P(−0.67<Z<0.67). This is the area under the standard normal curve between Z=−0.67 and Z=0.67.
- We calculate this as P(Z<0.67)−P(Z<−0.67).
- Using a table or software:
- P(Z<0.67)≈0.7486
- P(Z<−0.67)≈0.2514
- P(900<X<1100)=P(−0.67<Z<0.67)≈0.7486−0.2514=0.4972
So, there's about a 49.7% chance a student scores between 900 and 1100.
Using Python (Optional):
import scipy.stats as stats
mu = 1000
sigma = 150
# P(X < 850)
prob_below_850 = stats.norm.cdf(x=850, loc=mu, scale=sigma)
print(f"P(X < 850) = {prob_below_850:.4f}") # Output: P(X < 850) = 0.1587
# P(900 < X < 1100) = P(X < 1100) - P(X < 900)
prob_below_1100 = stats.norm.cdf(x=1100, loc=mu, scale=sigma)
prob_below_900 = stats.norm.cdf(x=900, loc=mu, scale=sigma)
prob_between = prob_below_1100 - prob_below_900
print(f"P(900 < X < 1100) = {prob_between:.4f}") # Output: P(900 < X < 1100) = 0.4950
# Note: Slight difference due to more precise Z-score calculation in SciPy
Here's a visualization of the area representing P(900<X<1100):
The shaded area represents the probability P(900<X<1100) for a Normal distribution with μ=1000 and σ=150.
Exercise 4: Sampling from Distributions
As discussed earlier, we can use Python libraries like NumPy to generate random samples that follow a specific probability distribution. This is useful for simulations and for understanding the expected patterns in data.
- Generate 1000 random samples from a Normal distribution with μ=50 and σ=5.
- Generate 1000 random samples from a Binomial distribution with n=20 trials and probability of success p=0.7.
- If you were to create a histogram of the samples generated in step 1, what shape would you expect it to have? What about the samples from step 2?
Python Implementation:
import numpy as np
import matplotlib.pyplot as plt # Optional: for visualization
# 1. Samples from Normal distribution
mu = 50
sigma = 5
normal_samples = np.random.normal(loc=mu, scale=sigma, size=1000)
# print("First 10 Normal samples:", normal_samples[:10])
# 2. Samples from Binomial distribution
n = 20
p = 0.7
binomial_samples = np.random.binomial(n=n, p=p, size=1000)
# print("First 10 Binomial samples:", binomial_samples[:10])
# Optional: Visualize the histograms
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1) # 1 row, 2 cols, plot 1
plt.hist(normal_samples, bins=30, density=True, color='#74c0fc', alpha=0.7)
plt.title('Histogram of Normal Samples (μ=50, σ=5)')
plt.xlabel('Value')
plt.ylabel('Density')
plt.subplot(1, 2, 2) # 1 row, 2 cols, plot 2
# Calculate bin edges for discrete values
bins = np.arange(binomial_samples.min(), binomial_samples.max() + 2) - 0.5
plt.hist(binomial_samples, bins=bins, density=True, color='#f783ac', alpha=0.7)
plt.title('Histogram of Binomial Samples (n=20, p=0.7)')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.tight_layout()
# plt.show() # Uncomment to display the plot if running locally
Answers:
-
- Normal Samples Histogram: You would expect the histogram of the Normal samples to be roughly bell-shaped, centered around the mean μ=50. The spread of the histogram would reflect the standard deviation σ=5. With 1000 samples, it should approximate the theoretical Normal PDF quite well.
- Binomial Samples Histogram: You would expect the histogram of the Binomial samples to show the distribution of the number of successes. Since p=0.7 is greater than 0.5, the histogram will likely be skewed slightly to the left, with the peak (most frequent outcome) around n×p=20×0.7=14 successes. It will be a discrete histogram, with bars only at integer values between 0 and 20.
These exercises provide a starting point for working with probability distributions. As you encounter more complex scenarios in machine learning, you'll find that understanding these fundamental distributions, their properties, and how to use them computationally is a significant advantage.