While databases provide access to internally managed data, many organizations and services expose their data externally through Application Programming Interfaces (APIs). APIs act as standardized contracts that allow different software systems to communicate. For data scientists, web APIs are a significant source of structured, often real-time, data ranging from social media activity and financial markets to weather forecasts and government statistics.
Unlike scraping websites, which involves parsing HTML designed for humans, APIs typically provide data in machine-readable formats like JSON (JavaScript Object Notation) or XML, making data extraction much cleaner and more reliable. We'll focus on REST (Representational State Transfer) APIs, a common architectural style for web APIs that uses standard HTTP methods.
Interacting with a REST API generally follows a request-response pattern:
Client Request: Your Python script (the client) sends an HTTP request to a specific URL (the endpoint) on the API server. This request includes:
GET
for retrieving data. Other methods like POST
, PUT
, DELETE
are used for creating, updating, or deleting resources, but are less common for simple data acquisition.Accept: application/json
) or authentication credentials (Authorization: Bearer YOUR_API_KEY
).?city=London&units=metric
).POST
or PUT
, not usually required for GET
requests retrieving data.Server Response: The API server processes the request and sends back an HTTP response, containing:
200 OK
, 404 Not Found
, 401 Unauthorized
, 500 Internal Server Error
).A simplified view of the client-server interaction when fetching data from a web API using an HTTP GET request.
requests
LibraryThe requests
library is the de facto standard in Python for making HTTP requests. If you don't have it installed, you can typically install it using pip:
pip install requests
Let's see how to use it to fetch data. We'll use the JSONPlaceholder API (https://jsonplaceholder.typicode.com
), a free fake online REST API for testing and prototyping.
Basic GET Request
To fetch a list of posts, you can make a GET request to the /posts
endpoint:
import requests
import pandas as pd
# Define the API endpoint URL
url = "https://jsonplaceholder.typicode.com/posts"
try:
# Send the GET request
response = requests.get(url)
# Raise an exception for bad status codes (4xx or 5xx)
response.raise_for_status()
# If the request was successful (status code 200)
print("Request successful!")
# Parse the JSON response body
data = response.json() # Returns a list of dictionaries
# Optionally, convert to a Pandas DataFrame
df_posts = pd.DataFrame(data)
print(f"Successfully retrieved {len(df_posts)} posts.")
print(df_posts.head())
except requests.exceptions.RequestException as e:
# Handle connection errors, timeouts, etc.
print(f"Request failed: {e}")
except requests.exceptions.HTTPError as e:
# Handle specific HTTP errors (like 404 Not Found, 401 Unauthorized)
print(f"HTTP error occurred: {e}")
print(f"Status Code: {e.response.status_code}")
# You might want to inspect e.response.text for more details from the API
except requests.exceptions.JSONDecodeError:
# Handle cases where the response body isn't valid JSON
print("Failed to decode JSON response.")
print("Response text:", response.text) # Log the raw text
Adding URL Parameters
Many APIs allow you to filter or customize results using query parameters appended to the URL (e.g., ?userId=1
). The requests
library makes this easy using the params
argument, which takes a dictionary.
import requests
import pandas as pd
# Fetch posts only for userId = 5
params = {"userId": 5}
url = "https://jsonplaceholder.typicode.com/posts"
try:
response = requests.get(url, params=params)
response.raise_for_status() # Check for HTTP errors
data = response.json()
df_user5_posts = pd.DataFrame(data)
print(f"\nRetrieved {len(df_user5_posts)} posts for userId=5:")
print(df_user5_posts.head())
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
except requests.exceptions.HTTPError as e:
print(f"HTTP error occurred: {e} (Status: {e.response.status_code})")
requests
automatically encodes the dictionary and appends it to the URL correctly (e.g., https://jsonplaceholder.typicode.com/posts?userId=5
).
Custom Headers and Authentication
Real-world APIs often require authentication, typically via an API key. You might also need to set other headers like User-Agent
or Accept
. Headers are passed as a dictionary to the headers
argument.
Common authentication patterns include:
Authorization
header (e.g., Authorization: Bearer YOUR_KEY
or Authorization: ApiKey YOUR_KEY
) or a custom header (e.g., X-API-Key: YOUR_KEY
).?apiKey=YOUR_KEY
).Let's simulate adding an API key and a custom User-Agent header (though JSONPlaceholder doesn't require them):
import requests
# Replace 'YOUR_ACTUAL_API_KEY' with a real key if needed for another API
api_key = "YOUR_ACTUAL_API_KEY"
headers = {
"Authorization": f"Bearer {api_key}",
"User-Agent": "MyDataScienceApplication/1.0",
"Accept": "application/json"
}
url = "https://jsonplaceholder.typicode.com/todos/1" # Example: fetch a single 'todo' item
try:
# Pass the headers dictionary
response = requests.get(url, headers=headers)
response.raise_for_status()
todo_item = response.json()
print("\nRetrieved single todo item (with simulated headers):")
print(todo_item)
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
except requests.exceptions.HTTPError as e:
print(f"HTTP error occurred: {e} (Status: {e.response.status_code})")
# If you get a 401 or 403 error, it's likely an authentication issue.
Always consult the specific API's documentation for the correct authentication method and required headers.
Successfully making the request is only half the process. You need to interpret the response correctly.
response.status_code
or use response.raise_for_status()
to ensure the request was successful (typically status code 200
). A 4xx
code indicates a client error (bad request, missing authentication, resource not found), while a 5xx
code indicates a server error.response.json()
to parse it into a Python dictionary or list. Be prepared to handle requests.exceptions.JSONDecodeError
if the response isn't valid JSON.try...except
blocks to gracefully handle network issues (requests.exceptions.RequestException
), HTTP errors (requests.exceptions.HTTPError
), and JSON parsing errors. Logging the status code and response text (response.text
) upon errors is helpful for debugging.APIs are shared resources. Most APIs enforce rate limits, restricting the number of requests you can make within a certain time window (e.g., 100 requests per minute). Exceeding these limits might result in temporary blocks (e.g., status code 429 Too Many Requests
).
Always:
time.sleep()
).429
Errors: If you receive a 429
status code, your script should wait (often the Retry-After
header in the response indicates how long) before trying again.Respecting API terms ensures continued access and responsible data acquisition.
By mastering interaction with web APIs using libraries like requests
, you gain access to a vast array of dynamic datasets, significantly expanding the scope of data you can incorporate into your data science projects. This forms a critical part of the data acquisition toolkit, complementing data retrieved from databases and files.
© 2025 ApX Machine Learning