A feature store provides a consistent interface for models to access features, but the underlying data pipelines that populate it can operate in two fundamentally different modes: batch and real-time. The primary engineering task is to design these pipelines to compute features in a way that is identical across both contexts, thereby preventing training-serving skew. This decision directly influences your system's latency, cost, and operational complexity.
Batch feature computation involves processing large volumes of data in discrete, scheduled jobs. This is the traditional mode for generating training data and for features that do not require up-to-the-second freshness.
Primary Use Case: Training Data Generation
The most common application of batch processing is to create historical feature sets for model training. A scheduled job, perhaps running daily or weekly, reads raw historical data from a data lake (like Amazon S3 or Google Cloud Storage) or a data warehouse, applies a series of transformations, and writes the resulting features to an offline store. This offline store is typically optimized for high-throughput reads by training jobs, often using columnar formats like Apache Parquet or Delta Lake.
Architecture and Tooling
A typical batch pipeline is orchestrated to handle terabytes of data efficiently. The process uses distributed computing frameworks to parallelize the workload across a cluster of machines.
A standard batch feature computation pipeline. Data is read in bulk, processed by a distributed engine like Apache Spark using predefined logic, and stored in an offline-optimized format.
The defining characteristics of batch computation are:
In contrast, real-time (or streaming) feature computation processes data as it arrives, typically event-by-event or in small micro-batches. This approach is necessary for features that must reflect the immediate context of a user or system.
Primary Use Case: Online Inference
When a model makes a prediction for a live request, it may require features that are only seconds or milliseconds old. For example, a fraud detection model needs to know if a credit card was just used in another city, or a recommendation engine needs to incorporate the product a user just clicked on. Real-time pipelines provide this by consuming events from a message queue like Apache Kafka or AWS Kinesis, performing transformations, and loading the results into a low-latency online feature store.
Architecture and Tooling
The architecture for real-time feature computation is built for speed and continuous availability.
A standard real-time feature computation pipeline. A stream processing engine like Apache Flink consumes live events, applies transformations, and updates a low-latency online store like Redis.
The defining characteristics of real-time computation are:
Having two separate codebases for your batch (e.g., PySpark) and real-time (e.g., Flink) pipelines introduces a significant risk. Even a minor difference in how nulls are handled or how a timestamp is rounded can cause a deviation between the feature values seen during training and those used at serving. This is a classic source of training-serving skew.
The goal is to ensure the transformation logic is identical. A modern and effective pattern is to adopt a stream-first architecture.
In this model, all feature logic is defined once, within a stream processing framework. This framework becomes the single source of truth.
This unified approach guarantees that the feature f(x_train) is computed by the exact same code as f(x_serve).
A unified stream-first architecture. A single stream processor with shared feature logic serves both the online store from live events and backfills the offline store by replaying historical events.
Most production systems use a hybrid approach. The decision of whether to compute a feature in batch or real-time depends entirely on the model's requirements for data freshness versus the acceptable cost.
| Characteristic | Batch Computation | Real-time Computation |
|---|---|---|
| Latency | High (minutes to days) | Low (milliseconds to seconds) |
| Data Volume | Very Large (TB to PB) | Small, continuous events |
| Cost | Low per unit of data | High per unit of data |
| Complexity | Lower operational overhead | Higher operational overhead |
| Primary Use Case | Training data generation | Online inference serving |
| Example Feature | user_lifetime_spend |
user_clicks_last_minute |
Your role as an AI infrastructure engineer is not just to build these pipelines but to provide a platform where data scientists can easily define features and the system can intelligently route them to the appropriate computation engine, all while guaranteeing consistency between the training and serving environments.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with