Before you can analyze data, you first have to find it. Data doesn't magically appear in your analysis tools; it resides in various places, waiting to be collected. Think of this step like gathering ingredients before cooking. Depending on what you want to make (or what question you want to answer), you'll need different ingredients found in different locations. This section introduces common places where data lives.
Perhaps the most straightforward way data is stored and shared is in individual files. You've likely encountered many of these already.
These files might be downloaded from websites, received as email attachments, generated by other software, or stored on shared network drives. For many introductory data science tasks, you'll start by loading data from one of these file types.
Organizations often store large amounts of operational data in databases. A database is an organized collection of data, generally stored and accessed electronically from a computer system.
Think of a database as a highly structured digital filing cabinet. Instead of simple files, it uses specialized software (a Database Management System or DBMS) to store, manage, and retrieve data efficiently.
Customers
table to an Orders
table). Interacting with these databases typically involves using a language called SQL (Structured Query Language). While you don't need to be an SQL expert right away, it's good to know that databases are a primary source of data in many real-world scenarios.Accessing data from a database usually requires specific connection details (like an address, username, password) and often involves writing queries to request the exact pieces of data you need.
APIs, or Application Programming Interfaces, are sets of rules and protocols that allow different software applications to communicate and exchange data with each other.
Imagine you want the current weather for London. Instead of trying to read the complex systems of a weather service directly, the service might provide an API. Your program can send a request to a specific API web address (an endpoint), perhaps specifying "London," and the API will send back the current weather data, often formatted in JSON.
Many web services offer APIs to access their data:
Using APIs often requires understanding how to make specific web requests and sometimes involves obtaining access keys or tokens for authentication. They are a powerful way to get dynamic, up-to-date data directly from a source.
Sometimes, the data you want is visible on a website but isn't available through a convenient file download or API. In these cases, it's sometimes possible to write code that automatically extracts information directly from the web page's HTML structure. This process is called web scraping or web harvesting.
While useful, web scraping should be approached with caution:
robots.txt
file or a "Terms of Use" page) before scraping. Excessive scraping can overload a website's server, and scraping private or copyrighted data may be illegal.For beginners, it's usually more practical to start with data from files, databases, or well-defined APIs.
Data can also come from less common, or more specialized sources:
Knowing where data might come from is the first step. The next challenge, which we'll cover soon, is how to actually bring that data into your analysis environment – a process often called importing or loading data. The source you choose will depend heavily on the problem you're trying to solve and the data available to you.
© 2025 ApX Machine Learning