Feature engineering transforms raw data into formats suitable for machine learning algorithms. This often involves SQL queries over static partitions of a data warehouse when operating in batch processing. However, these static partitions introduce significant latency. By the time a feature vector is computed, the user context may have changed, rendering the prediction irrelevant. Online feature generation moves this computation to the stream processing layer, calculating feature values as events arrive to ensure the model input reflects the immediate state of the environment.
The primary mechanism for generating features on data streams is the stateful aggregation. Unlike stateless transformations such as filtering or mapping, feature generation requires context. To calculate the "average transaction value over the last hour" or "number of clicks in the past 5 minutes," the processor must retain history.
In Apache Flink, we achieve this through windowing logic combined with keyed state. The raw stream is partitioned by a logical entity, such as a user_id or device_id, using the keyBy operator. This ensures that all events for a specific entity are routed to the same parallel task slot, allowing local state access without network shuffling during the window accumulation phase.
Data flow for a window aggregation. Events are sharded by key, processed against local state, and emitted as updated feature vectors.
A common performance bottleneck in feature generation arises from buffering raw events. If a window spans 24 hours and contains millions of events, storing all payloads in the state backend until the window triggers is inefficient. It increases memory pressure and causes spikes in CPU usage during the evaluation phase.
To mitigate this, we utilize incremental aggregation. Instead of storing the events, Flink updates a compact accumulator state. For a simple average, the state consists only of the running sum and the count. When a new event arrives, we update the state immediately:
When the window closes or triggers, the feature is computed simply as . This approach ensures that the storage complexity remains regardless of the number of events in the window.
In Flink, this is implemented using the AggregateFunction interface. This interface requires defining an accumulator, an add method, and a result retrieval method. The merge method is also critical for session windows where multiple partial aggregates might need to combine.
Sliding windows are ubiquitous in feature engineering. A feature defined as "count of login failures in the last 15 minutes, updated every minute" implies a window size of 15 minutes and a slide (hop) of 1 minute.
While simply, sliding windows can be computationally expensive if not tuned. A single event belongs to windows. For a 1-hour window sliding every minute, every event belongs to 60 overlapping windows. If Flink were to duplicate the event for each window bucket, the state size would explode.
Flink optimizes this by slicing the stream into "panes" based on the greatest common divisor of the window and slide interval. However, for high-throughput feature generation, it is often more efficient to use an Exponential Moving Average (EMA). The EMA places greater weight on recent data points and decays older information mathematically, eliminating the need for strict window boundaries and buffer management.
The recursive formula for EMA is:
Here, represents the smoothing factor, where . A higher discounts older observations faster. This calculation requires storing only the previous value, making it highly efficient for low-latency feature pipelines.
Real-time features depend heavily on the correctness of time. If network issues delay an event, it might arrive after the window for its timestamp has theoretically closed. In fraud detection, processing a transaction out of order could lead to a feature vector that misses a critical signal, resulting in a false negative.
Flink handles this via Watermarks. A watermark is a mechanism that flows with the stream and declares that no more events with a timestamp less than will arrive. You must configure the watermark generation strategy to balance latency and completeness.
For online inference, we often cannot afford to wait for late data. The standard pattern is to configure a short allowed lateness period and redirect any extremely late data to a side output for monitoring, ensuring the main feature pipeline remains low-latency.
When generating features for millions of users, the state backend becomes the limiting factor. Storing feature state on the Java Heap is rarely feasible due to Garbage Collection pauses. Instead, we configure RocksDB as the state backend. RocksDB stores state on the local disk (SSD) and relies on off-heap memory for caching.
However, RocksDB serialization adds overhead. Every state access requires serializing the key and value to bytes. To optimize this:
state.backend.rocksdb.block-cache-usage metric. If the cache is too small, read amplification will occur as the system frequently fetches data from disk.The following chart illustrates the latency trade-off when interacting with disk-based state versus in-memory computation as throughput scales.
Comparison of latency characteristics. Heap state offers lower initial latency but degrades rapidly under memory pressure (GC), while RocksDB maintains stable performance at higher throughputs.
Beyond numerical aggregations, models often require categorical features. In batch systems, techniques like One-Hot Encoding or Target Encoding are applied using the full vocabulary of the dataset. In streams, the vocabulary is unbounded and evolving.
For One-Hot Encoding, maintaining a dynamic dictionary in state is dangerous due to unbounded growth. A common stream-friendly alternative is Feature Hashing (hashing trick). We hash the categorical string to a fixed integer space.
This collision-prone method trades accuracy for constant memory usage, which is a necessary compromise in streaming environments. Alternatively, for Target Encoding (replacing a category with the mean of the target variable), we use the same incremental aggregation logic described earlier: maintain a running sum of the target and a count for each category key, updating the map in real-time.
By mastering these aggregation and state management techniques, you ensure that the feature vectors serving your AI models are both computationally efficient and temporally relevant.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•