Deploying machine learning models within a stream processing pipeline requires a fundamental shift from traditional batch scoring. Traditional batch contexts typically prioritize throughput, where latency is generally irrelevant. However, streaming systems, specifically those powered by Flink and Kafka, demand an architecture optimized for end-to-end latency while still maintaining high throughput. The decision on how to integrate the model, whether to embed it directly in the operator or expose it as an external service, dictates the fault tolerance, scalability, and operational complexity of the pipeline.The Embedded Model PatternThe most performance-efficient approach for low-latency inference is the embedded model pattern. In this architecture, the model artifact is loaded directly into the memory of the Flink TaskManager. The inference logic executes as a standard operator function, typically a RichMapFunction or RichFlatMapFunction.This approach eliminates network overhead. Data stays within the process boundary, passing from the feature generation operator to the inference operator via local memory or network buffers if a shuffle is required. For models that can be represented as POJOs (Plain Old Java Objects) or serialized via lightweight formats like PMML or ONNX, this results in sub-millisecond inference times.To implement this, you override the open() method of the Flink function to load the model. This ensures the model is loaded once per parallel instance rather than per event.public class EmbeddedInference extends RichMapFunction<FeatureVector, Prediction> { private transient ModelWrapper model; @Override public void open(Configuration parameters) { // Load model from distributed file system or cache this.model = ModelLoader.load("s3://models/v1/classifier.onnx"); } @Override public Prediction map(FeatureVector features) { return model.predict(features); } }However, embedding models introduces significant resource management challenges. The model competes for CPU and memory with the Flink runtime. A memory leak in the model's native libraries (common in JNI bridges for TensorFlow or PyTorch) can crash the entire TaskManager, causing pipeline instability. Furthermore, scaling the inference layer requires scaling the entire Flink job, which may lead to over-provisioning if the bottleneck is purely compute-bound inference rather than I/O.digraph G { rankdir=TB; node [style=filled, shape=rect, fontname="Arial"]; edge [fontname="Arial"]; subgraph cluster_tm { label="Flink TaskManager (JVM)"; style=filled; color="#dee2e6"; node [style=filled, color="#ffffff"]; Source [label="Source\n(Kafka Consumer)"]; FeatureEng [label="Feature\nEngineering"]; Inference [label="Embedded Model\n(Inference Operator)", color="#d0bfff"]; Sink [label="Sink\n(Kafka Producer)"]; Source -> FeatureEng; FeatureEng -> Inference [label="Local/Network Buffer"]; Inference -> Sink; } }Data flow within a TaskManager using the embedded pattern. The inference logic runs in the same JVM as the stream processing, sharing resources.The External Service PatternFor large deep learning models or scenarios requiring strict resource isolation, the external service pattern is preferred. Here, the Flink operator acts as a client, sending RPC calls (gRPC or REST) to a dedicated model serving infrastructure such as TensorFlow Serving, TorchServe, or NVIDIA Triton.This architecture decouples the lifecycle of the stream processor from the machine learning model. The model serving layer can scale independently on specialized hardware (GPUs/TPUs) while Flink manages the state and data flow on standard commodity hardware.Implementation relies heavily on Flink's Asynchronous I/O API (AsyncDataStream). Synchronous calls would block the operator thread, causing backpressure to propagate upstream and drastically reducing throughput. By using AsyncWaitOperator, Flink can manage multiple in-flight requests concurrently.The total latency $L_{total}$ for a single event in this pattern is defined as:$$ L_{total} = L_{net} + L_{queue} + L_{inference} + L_{serde} $$Where $L_{net}$ is network round-trip time, $L_{queue}$ is the wait time in the serving layer's queue, and $L_{serde}$ is the cost of serializing the feature vector and deserializing the prediction.{"layout": {"title": "Latency Comparison: Embedded vs External", "xaxis": {"title": "Metric"}, "yaxis": {"title": "Latency (ms)"}, "barmode": "group", "colorscale": [{"color": "#4c6ef5"}, {"color": "#fa5252"}]}, "data": [{"x": ["P50", "P95", "P99"], "y": [2, 5, 12], "name": "Embedded", "type": "bar", "marker": {"color": "#4c6ef5"}}, {"x": ["P50", "P95", "P99"], "y": [25, 45, 120], "name": "External (gRPC)", "type": "bar", "marker": {"color": "#fa5252"}}]}Latency distribution comparison between embedded execution and external RPC calls. External calls introduce higher baseline latency and greater variance due to network conditions.The external pattern introduces complexity regarding ordering and checkpoints. Flink offers two modes for async results: orderedWait and unorderedWait.Ordered Wait: Ensures that if Event A enters before Event B, the result of A is emitted before B. This adds latency buffering, as the operator must hold back faster results until earlier results arrive.Unordered Wait: Emits results as soon as the external service responds. This minimizes latency and maximizes throughput but scrambles the event order. If strict ordering is required downstream (e.g., for time-series analysis), the stream must be re-ordered using watermarks and windowing.Dynamic Model UpdatesIn production environments, models degrade over time and require retraining. A pipeline restart to update a model binary is often unacceptable due to downtime and state recovery costs.For the embedded pattern, Flink's Broadcast State pattern offers a solution. You can define a secondary stream connected to a model repository or a notification topic. When a new model version becomes available, the control stream broadcasts the model metadata (or the model file itself if small) to all parallel inference operators. The operators then hot-swap the model instance in memory.// Control stream for model updates BroadcastStream<ModelConfig> controlStream = env .addSource(new ModelConfigSource()) .broadcast(modelStateDescriptor); // Connect data stream with control stream dataStream .connect(controlStream) .process(new BroadcastModelProcessor());In the external service pattern, updates are handled transparently to Flink. The model serving cluster performs a rolling update or canary deployment. Flink simply continues sending requests to the load balancer endpoint. This simplifies the operational burden on the data engineering team but requires coordination to ensure schema compatibility between the features generated by Flink and the input expected by the new model version.Choosing the Right ArchitectureThe choice between embedded and external patterns often comes down to the complexity of the model and the operational constraints of the organization.FactorEmbedded PatternExternal Service PatternModel SizeSmall to Medium (< 500MB)Large (GBs), LLMsLatency RequirementUltra-low (< 10ms)Moderate (> 20ms)HardwareCPU (mostly)GPU/TPUScalingCoupled with Flink parallelismIndependent scalingDeploymentRequires pipeline updateIndependent of pipelineFor many high-frequency trading or fraud detection use cases, the embedded pattern is non-negotiable due to the latency constraints. For recommendation engines or content analysis where deep learning is required, the external pattern provides the necessary flexibility and hardware support.An emerging hybrid approach, often called the Sidecar Pattern, attempts to bridge these worlds. In a Kubernetes environment, the model container runs in the same pod as the Flink TaskManager. They communicate via the loopback interface (localhost), removing physical network hops while maintaining process isolation. This reduces $L_{net}$ significantly while preventing a model crash from bringing down the JVM. However, this adds complexity to the Kubernetes scheduler and resource definition.