Machine learning systems operate effectively only when the input data reflects the current state of the environment. In traditional batch architectures, feature engineering happens periodically, resulting in stale data during inference. This chapter addresses the engineering challenges associated with shifting these computations to a streaming context, enabling millisecond-level latency for model predictions.
We begin by analyzing online feature generation. You will write Flink jobs that calculate aggregations over sliding windows, such as computing a moving average for a time series. If a model requires the mean value of a specific feature over a window , the stream processor must maintain the state and update it incrementally as new events arrive:
Following feature generation, we examine architectural patterns for model serving. You will compare the trade-offs between embedding models directly within the Flink TaskManager, avoiding network overhead, and using external inference services via HTTP or gRPC. This section includes a discussion on resource isolation and scaling independent of the stream processing cluster.
The chapter also covers the request-response pattern over asynchronous streams. Since Kafka decouples producers and consumers, implementing synchronous user-facing applications requires specific correlation strategies using temporary reply queues. Finally, we integrate these components with a Feature Store. You will configure the pipeline to write computed features to low-latency storage, ensuring consistency between training datasets and online inference contexts.
8.1 Online Feature Generation
8.2 Model Serving Patterns in Streams
8.3 Request-Response over Async Streams
8.4 Feature Store Integration
8.5 Hands-on Practical: Real-Time Inference Pipeline