Real-time inference systems frequently require a synchronous contract with the end user. When a user executes a transaction or submits a search query, they expect an immediate result based on the latest data. While Apache Flink and Kafka excel at high-throughput asynchronous processing, bridging the gap between a blocking HTTP client and a non-blocking streaming topology requires an architectural pattern. This involves implementing a request-response loop over an event-driven backbone, often referred to as the Correlation Identifier Pattern.The Impedance MismatchThe fundamental challenge lies in the operational semantics of the protocols. HTTP and gRPC are synchronous by default; the client opens a socket and waits for bytes to return. Kafka is strictly asynchronous; producers push data to a log and disconnect or move to the next message without waiting for consumers to process that specific record.To serve a machine learning model via Flink while maintaining a RESTful interface for the client, the API Gateway must act as a sophisticated mediator. It translates a blocking request into a message production event and effectively pauses the HTTP thread until a corresponding message arrives on a reply channel.Architecture of Asynchronous Request-ReplyThe implementation relies on two specific Kafka topics: a request-topic and a reply-topic. The workflow follows a strict sequence of events involving a Correlation ID, which uniquely identifies the transaction across the distributed system.Request Generation: The Gateway receives an HTTP POST request. It generates a unique UUID (Correlation ID) and instantiates a CompletableFuture (or equivalent promise object) in local memory, mapped to that ID.Ingestion: The Gateway produces a record to the request-topic. The payload contains the feature vector or raw data, and the Correlation ID is injected into the Kafka Record Headers.Processing: Flink consumes the request. It performs feature engineering (e.g., joining with windowed aggregates), calls the model, and generates a prediction.Response: Flink produces the prediction result to the reply-topic. Crucially, it must copy the Correlation ID from the input header to the output header.Completion: The Gateway consumes from the reply-topic. It extracts the Correlation ID, looks up the pending CompletableFuture in its local memory map, completes the future with the prediction result, and responds to the HTTP client.digraph G { rankdir=LR; node [shape=box, style=filled, fontname="Helvetica", fontsize=10]; edge [fontname="Helvetica", fontsize=9]; subgraph cluster_0 { label="Synchronous Domain"; style=filled; color="#e9ecef"; Client [label="HTTP Client", fillcolor="#a5d8ff", color="#1c7ed6"]; Gateway [label="API Gateway\n(Async Servlet)", fillcolor="#bac8ff", color="#4263eb"]; } subgraph cluster_1 { label="Asynchronous Domain"; style=filled; color="#e9ecef"; RequestTopic [label="Request Topic\n(Kafka)", fillcolor="#ffc9c9", color="#fa5252"]; FlinkJob [label="Flink Feature Eng.\n& Inference", fillcolor="#b2f2bb", color="#37b24d"]; ReplyTopic [label="Reply Topic\n(Kafka)", fillcolor="#ffc9c9", color="#fa5252"]; } Client -> Gateway [label="POST /predict"]; Gateway -> RequestTopic [label="Produce (ID: 123)"]; RequestTopic -> FlinkJob [label="Consume"]; FlinkJob -> ReplyTopic [label="Produce (ID: 123)"]; ReplyTopic -> Gateway [label="Consume"]; Gateway -> Client [label="200 OK"]; }The API Gateway acts as the bridge between synchronous HTTP protocols and the asynchronous Kafka log, using Correlation IDs to map distinct responses back to pending open connections.Scalability and The Return Address ProblemA naive implementation of the pattern above fails in a horizontally scaled environment. If you deploy ten instances of the API Gateway, the instance that produced the request (and holds the open HTTP connection) is not guaranteed to be the one that consumes the reply. If Gateway-A produces the request but Gateway-B consumes the reply, Gateway-A will eventually time out, resulting in a failed user request despite the model functioning correctly.To resolve this, we utilize Kafka's partitioning mechanics as a routing layer. This is often implemented using a specific "reply-to" header that indicates not just the topic, but the specific partition or metadata required to route the message back to the correct producer instance.Strategy: Gateway-Specific PartitionsIn this approach, each Gateway instance is assigned a static partition ID of the reply-topic. When Gateway-Instance-3 produces a request, it includes a header Reply-Partition: 3. The Flink job, upon completing the inference, reads this header and explicitly produces the result to Partition 3 of the reply-topic.This ensures data locality. Gateway-Instance-3 consumes only from Partition 3, guaranteeing it receives only the responses relevant to the requests it initiated. This eliminates the need for Gateways to consume and discard millions of irrelevant messages meant for other instances.The reliability of this system can be modeled mathematically. Let $T_{timeout}$ be the maximum time the Gateway waits before sending a 504 Gateway Timeout. The total processing latency $L_{total}$ is the sum of network transit, queuing, and inference time:$$ L_{total} = L_{net} + L_{queue(req)} + L_{inference} + L_{queue(rep)} $$For the system to be viable, we must ensure that the 99th percentile of latency satisfies:$$ P_{99}(L_{total}) < T_{timeout} $$Handling Timeouts and Ghost StateIn high-throughput environments, requests will occasionally time out. If the Flink job is stalled by backpressure or a garbage collection pause, $L_{total}$ may exceed $T_{timeout}$. The Gateway will close the HTTP connection and remove the CompletableFuture from memory.However, the message is still in Kafka. Flink will eventually process it and produce a reply. When the Gateway receives this "late" reply, it will find no matching ID in its pending map. This scenario, known as a "ghost response," requires careful handling. The Gateway must silently discard messages with unknown Correlation IDs to prevent memory leaks or log flooding.Furthermore, the map holding pending requests must be implemented with automatic eviction policies. If the Kafka consumer thread crashes or the reply message is lost, the pending future would technically remain in memory forever. Implementing a ConcurrentHashMap with time-based eviction (or using a library like Caffeine) is mandatory to prevent OutOfMemoryError in the Gateway layer.{"layout": {"width": 700, "height": 350, "title": "Latency Distribution vs Timeout Threshold", "xaxis": {"title": "Latency (ms)", "showgrid": true, "color": "#495057"}, "yaxis": {"title": "Probability Density", "showgrid": true, "color": "#495057"}, "plot_bgcolor": "rgba(0,0,0,0)", "paper_bgcolor": "rgba(0,0,0,0)", "font": {"family": "Helvetica", "color": "#343a40"}}, "data": [{"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150], "y": [0.005, 0.01, 0.025, 0.04, 0.045, 0.04, 0.025, 0.015, 0.01, 0.005, 0.002, 0.001, 0.0005, 0.0001, 0], "type": "scatter", "mode": "lines", "fill": "tozeroy", "name": "System Latency", "line": {"color": "#4dabf7"}}, {"x": [100, 100], "y": [0, 0.05], "type": "scatter", "mode": "lines", "name": "HTTP Timeout (100ms)", "line": {"color": "#fa5252", "dash": "dash"}}]}This distribution demonstrates the risk of tail latency. The area under the curve to the right of the red dashed line represents the percentage of requests that will result in a timeout error, even if the computation eventually succeeds.Serialization and Header ProtocolEfficiency in this pattern dictates that metadata should remain in the headers, keeping the payload strictly for the feature vector or model input. Kafka headers are essentially key-value pairs of byte arrays.When implementing the Flink SerializationSchema, you must extract headers from the incoming ConsumerRecord and propagate them to the ProducerRecord. Flink’s KafkaRecordSerializationSchema allows direct access to the transport headers.A typical header protocol for an AI inference request might look like this:X-Correlation-ID: 550e8400-e29b-41d4-a716-446655440000 (UUID)X-Reply-Topic: inference.replies.v1X-Reply-Partition: 4 (The integer ID of the gateway instance)X-Creation-Timestamp: 1678886400000 (Used to calculate time-in-flight)By standardizing these headers, you decouple the Flink job from the specific gateway implementation. The Flink job becomes a pure function: it accepts data, processes it, and returns it to the address specified in the headers, unaware of whether the caller is a REST API, a WebSocket server, or another internal microservice.