Real-time vs. Batch Feature Computation

A feature store provides a consistent interface for models to access features, but the underlying data pipelines that populate it can operate in two fundamentally different modes: batch and real-time. The primary engineering task is to design these pipelines to compute features in a way that is identical across both contexts, thereby preventing training-serving skew. This decision directly influences your system's latency, cost, and operational complexity.

Batch Feature Computation

Batch feature computation involves processing large volumes of data in discrete, scheduled jobs. This is the traditional mode for generating training data and for features that do not require up-to-the-second freshness.

Primary Use Case: Training Data Generation

The most common application of batch processing is to create historical feature sets for model training. A scheduled job, perhaps running daily or weekly, reads raw historical data from a data lake (like Amazon S3 or Google Cloud Storage) or a data warehouse, applies a series of transformations, and writes the resulting features to an offline store. This offline store is typically optimized for high-throughput reads by training jobs, often using columnar formats like Apache Parquet or Delta Lake.

Architecture and Tooling

A typical batch pipeline is orchestrated to handle terabytes of data efficiently. The process uses distributed computing frameworks to parallelize the workload across a cluster of machines.

A standard batch feature computation pipeline. Data is read in bulk, processed by a distributed engine like Apache Spark using predefined logic, and stored in an offline-optimized format.

The defining characteristics of batch computation are:

High Throughput: Designed to process immense volumes of data cost-effectively.
High Latency: Jobs are scheduled and can take minutes, hours, or even longer to complete. Freshness is measured in hours or days.
Cost-Efficiency: Optimized for low cost per byte processed, often utilizing spot instances for the compute cluster.
Example Features: User lifetime value, 30-day purchase count, or product embeddings calculated over the entire catalog.

Real-time Feature Computation

In contrast, real-time (or streaming) feature computation processes data as it arrives, typically event-by-event or in small micro-batches. This approach is necessary for features that must reflect the immediate context of a user or system.

Primary Use Case: Online Inference

When a model makes a prediction for a live request, it may require features that are only seconds or milliseconds old. For example, a fraud detection model needs to know if a credit card was just used in another city, or a recommendation engine needs to incorporate the product a user just clicked on. Real-time pipelines provide this by consuming events from a message queue like Apache Kafka or AWS Kinesis, performing transformations, and loading the results into a low-latency online feature store.

Architecture and Tooling

The architecture for real-time feature computation is built for speed and continuous availability.

A standard real-time feature computation pipeline. A stream processing engine like Apache Flink consumes live events, applies transformations, and updates a low-latency online store like Redis.

The defining characteristics of real-time computation are:

Low Latency: Designed for sub-second data freshness, enabling immediate feature availability for inference.
Continuous Processing: Systems are "always on," processing a constant flow of events.
Higher Cost and Complexity: Maintaining highly available streaming systems and low-latency databases is operationally more demanding and expensive than running scheduled batch jobs.
Example Features: Number of items viewed in the last 5 minutes, a user's current location, or the average transaction value over a 60-second window.

The Unification Challenge

Having two separate codebases for your batch (e.g., PySpark) and real-time (e.g., Flink) pipelines introduces a significant risk. Even a minor difference in how nulls are handled or how a timestamp is rounded can cause a deviation between the feature values seen during training and those used at serving. This is a classic source of training-serving skew.

The goal is to ensure the transformation logic is identical. A modern and effective pattern is to adopt a stream-first architecture.

In this model, all feature logic is defined once, within a stream processing framework. This framework becomes the single source of truth.

For Real-time Serving: The streaming job runs continuously, consuming live events and populating the online store.
For Batch Training: The same streaming application logic is used to process historical data. This is done by "replaying" events from the data lake through the stream processor, which then writes the computed features to the offline store.

This unified approach guarantees that the feature f(x_train) is computed by the exact same code as f(x_serve).

A unified stream-first architecture. A single stream processor with shared feature logic serves both the online store from live events and backfills the offline store by replaying historical events.

Making the Right Choice

Most production systems use a hybrid approach. The decision of whether to compute a feature in batch or real-time depends entirely on the model's requirements for data freshness versus the acceptable cost.

Characteristic	Batch Computation	Real-time Computation
Latency	High (minutes to days)	Low (milliseconds to seconds)
Data Volume	Very Large (TB to PB)	Small, continuous events
Cost	Low per unit of data	High per unit of data
Complexity	Lower operational overhead	Higher operational overhead
Primary Use Case	Training data generation	Online inference serving
Example Feature	`user_lifetime_spend`	`user_clicks_last_minute`

Your role as an AI infrastructure engineer is not just to build these pipelines but to provide a platform where data scientists can easily define features and the system can intelligently route them to the appropriate computation engine, all while guaranteeing consistency between the training and serving environments.

Was this section helpful?

References

Designing Machine Learning Systems: An Iterative Process for Production-Ready AI, Chip Huyen, 2022 (O'Reilly Media) - This book provides a comprehensive perspective on building robust machine learning systems, including detailed coverage of feature engineering, feature stores, and strategies to prevent training-serving skew.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - A foundational text for understanding data system architectures, encompassing batch and stream processing, consistency models, and various data storage choices relevant to feature computation pipelines.
Stream Processing with Apache Flink: Fundamentals, StreamSQL, and Table API, Fabian Hueske, Vasia Kalavri, 2019 (O'Reilly Media) - This book offers a guide to Apache Flink, detailing its capabilities for real-time data processing and its application in creating unified batch and stream processing architectures for features.
Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison, 2015 Advances in Neural Information Processing Systems 28 (NIPS 2015) (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) DOI: 10.5555/2969442.2969562 - A seminal paper that highlights various challenges in putting machine learning into production, prominently discussing the critical issue of training-serving skew.