Machine learning models often suffer from performance degradation when deployed to production because of training-serving skew. This phenomenon occurs when the data distribution or feature calculation logic used during inference differs from the historical data used to train the model. The Feature Store acts as the central interface in a streaming architecture that resolves this discrepancy by guaranteeing consistency between the offline training environment and the online inference environment.
To satisfy the competing requirements of low-latency lookup and high-volume historical analysis, feature stores typically implement a dual-database architecture. Flink acts as the computational engine that populates both storage layers simultaneously.
In this architecture, the Flink pipeline serves as the synchronization mechanism. As events traverse the stream, Flink computes the aggregate feature (such as clicks_last_5_minutes) and performs a dual-write operation.
Data flow showing Flink synchronizing computed features to both the Online Store for immediate inference and the Offline Store for historical training data.
The primary engineering challenge when writing to the online store is maintaining high throughput without blocking the Flink operator. A synchronous call to a database like Redis for every event will throttle the pipeline throughput to the latency of the network round-trip.
To mitigate this, you must use Flink's Asynchronous I/O API. This allows the stream processing operator to handle multiple concurrent requests to the external store. The order of results is preserved by Flink, ensuring that updates to the same key are applied in the correct sequence.
When designing the sink for the online store, you generally use an "Upsert" (Update or Insert) semantic. Since the online store represents the current state, older values are overwritten.
Where is the new feature value, is the internal Flink state (like a window accumulator), and is the triggering event.
The following code illustrates how to implement an asynchronous function to update a Redis feature store. This approach uses the AsyncFunction interface to decouple the database latency from the stream processing throughput.
public class RedisFeatureUpdater extends RichAsyncFunction<FeaturePayload, String> {
private transient RedisClient redisClient;
private transient StatefulRedisConnection<String, String> connection;
@Override
public void open(Configuration parameters) {
redisClient = RedisClient.create("redis://localhost:6379");
connection = redisClient.connect();
}
@Override
public void asyncInvoke(FeaturePayload input, ResultFuture<String> resultFuture) {
// Asynchronously update the feature vector
RedisAsyncCommands<String, String> async = connection.async();
// Entity ID, Value: Serialized Feature Vector
CompletionStage<String> future = async.set(
"feature:" + input.getEntityId(),
input.toJson()
);
future.whenComplete((res, error) -> {
if (error != null) {
resultFuture.completeExceptionally(error);
} else {
resultFuture.complete(Collections.singleton(res));
}
});
}
}
A critical requirement for the offline store is the ability to reconstruct the state of a feature at any specific historical timestamp. This is known as time travel. If a model predicts fraud based on the number of transactions in the last hour, the training data must reflect the count exactly at the moment the fraud label was generated, not the count at the end of the day.
When Flink writes to the offline store, it must append records rather than update them. Each record in the offline store should contain:
The offline store essentially becomes an append-only log of feature mutations. When data scientists generate a training dataset, they perform an "as-of join" logic:
This query selects the most recent feature value for entity that was available strictly before or at the time when the target variable was observed.
Integrating a feature store introduces a trade-off between data freshness and system complexity. In batch-based feature engineering, features are updated periodically (e.g., daily), leading to a "sawtooth" freshness pattern where data becomes progressively staler until the next batch run. Streaming integration creates a near-flat freshness line, keeping the difference between event time and feature availability minimal.
However, writing to a remote store introduces network latency. Optimizing the Flink Sink often involves buffering writes or using pipelining.
Comparison of feature freshness. Batch updates degrade over time, creating a discrepancy between reality and the model's view. Streaming updates maintain near-zero latency.
In a distributed system, events often arrive out of order. Flink's watermarking mechanism handles aggregation correctness internally, but updating the external feature store requires specific strategies to prevent "zombie" data, where an old event overwrites a newer value in the online store.
There are two primary strategies to handle this synchronization:
last_updated_timestamp column. The sink function uses a conditional write (Compare-and-Swap). The update is only applied if the incoming record's timestamp is greater than the stored timestamp.INCRBY in Redis) or stores raw aggregates that are summed at query time. This is resilient to ordering but requires the read-path (inference service) to perform final computation.For the offline store, late data is less problematic because it is an append-only log. The challenge there is strictly analytical: ensuring that the "as-of join" query correctly accounts for when the data would have been available for inference, to avoid data leakage during training.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with