From Business Requirement to Production LLM: A Full-Cycle Scenario
Every production LLM that answers a customer question or automates a workflow passed through the hands of five or six distinct roles before it ever served a single request. Each role hands off a well-defined artifact to the next — and when that handoff is fuzzy, that’s usually where projects stall for months. This post walks through one realistic scenario end to end, role by role, showing exactly what each person delivers and what they expect to receive from the person before them.
The scenario
A financial services company wants an internal AI assistant that can answer employee questions about internal policy documents — HR policy, compliance procedures, IT guidelines — instead of employees searching through a document portal. Leadership wants it accurate, auditable, and running entirely within the company’s own private infrastructure for data residency reasons. That single sentence is about to pass through six different roles before it becomes a running system.

1. Business stakeholder: defines the requirement, not the solution
The business owner (often a VP of HR Operations or Compliance, sponsored by IT leadership) doesn’t specify a model, a GPU count, or an architecture. They specify outcomes and constraints:
- Functional requirement: employees ask a question in natural language, get an accurate answer grounded in current policy documents, with a citation back to the source document.
- Non-functional requirements: data never leaves the company’s private infrastructure (rules out public LLM APIs), answers must be auditable (every response needs to be traceable to what document and what model version produced it), and the system must handle a few thousand employees without becoming a bottleneck.
- Success metric: reduce policy-related HR tickets by a target percentage within two quarters.
What gets handed to the next role: a requirements document, not a technical design. This is the artifact that everything downstream gets measured against — if the eventual system is technically impressive but doesn’t cite sources, it has failed the requirement regardless of how well-engineered it is.
2. Solution architect / developer: turns the requirement into a design
The architect (often the senior developer or platform architect on the AI team) translates the business requirement into a concrete technical shape, and this is where “which model, which architecture” gets decided:
- Architecture decision: this is a retrieval-augmented generation (RAG) problem, not a fine-tuning problem — the requirement for citations and current, auditable answers points directly at retrieval over a maintained document index, with a general-purpose LLM doing the reasoning and answer composition on top.
- Model selection: given the data-residency constraint, an open-weights model that can be self-hosted (rather than a hosted API model) is selected, sized to what the available GPU budget can serve at acceptable latency.
- System design: API gateway → retrieval service (vector database + embedding model) → LLM inference service → response with citations. Authentication, rate limiting, and audit logging are designed in from the start because the non-functional requirements demanded them.
- Capacity plan: an estimate of expected concurrent users translates into an estimate of GPU inference capacity needed, which becomes the number the infrastructure team plans hardware against.
What gets handed to the next role: a system design document, a model selection with justification, and a capacity/GPU sizing estimate. This is the artifact the data scientist and infrastructure teams both build from.
3. Data scientist: builds and validates the model behavior
The data scientist doesn’t build the platform — they own whether the model actually answers correctly, and that’s a genuinely different job from the architect’s:
- Data preparation: policy documents are cleaned, chunked, and prepared for embedding; sensitive documents are tagged with the access permissions that must be preserved when they’re retrieved later (this metadata becomes critical for the Kubernetes-hosted retrieval service downstream — losing it here is how RAG permission-leakage bugs get introduced).
- Embedding and retrieval tuning: choosing and evaluating the embedding model, tuning chunk size and retrieval parameters against a test set of real employee questions with known correct answers.
- Evaluation: building a held-out evaluation set and measuring answer accuracy, citation correctness, and hallucination rate before anything goes near production. This is also where fine-tuning (if needed at all) happens — often a light instruction-tuning pass so the model consistently cites sources in the expected format, run as a training job.
- The training/tuning job itself: this is the point where the data scientist hands a job specification — not infrastructure, a job — to the Slurm operator: “here is a training script, here is the dataset location, here is the expected GPU count and duration.”
What gets handed to the next role: a validated model (or confirmation that a base model needs no fine-tuning), an evaluation report, and — if fine-tuning was needed — a Slurm job submission for training compute.
4. Slurm operator: runs the training/fine-tuning job on the GPU cluster
The Slurm operator doesn’t touch the model’s accuracy — their job is making sure the compute is available, correctly allocated, and the job completes reliably:
- Receives the job specification from the data scientist and translates it into a Slurm batch script: partition selection, GPU count and topology (are the requested GPUs on the same InfiniBand leaf switch for fast synchronization?), wall-clock time limit, and priority/QOS assignment based on project.
- Submits the job, monitors it through DCGM and Prometheus for GPU health and utilization during the run, and watches for the failure modes that are specific to distributed training — a single node dropping out of the job, a storage path saturating during checkpoint writes, a network fabric degradation slowing the all-reduce step.
- On completion, the resulting model checkpoint is written to the storage tier the platform team designated, and the job’s resource usage is logged for cost/capacity accounting.
What gets handed to the next role: a trained/fine-tuned model artifact sitting in shared storage, plus the training run’s metadata (metrics, duration, resource usage) for the MLOps tracking system.
5. Kubernetes operator: takes the model from artifact to running service
The Kubernetes operator’s job starts once there’s a model to serve — their concern is availability, scaling, and integration into the surrounding application, not training:
- Packages the model-serving component (using an inference server like vLLM, Triton, or NVIDIA NIM) into a container image, with the appropriate GPU resource requests defined so Kubernetes’ device plugin schedules it onto GPU-enabled nodes correctly.
- Deploys the retrieval service (vector database + embedding lookup), the LLM inference service, and the API gateway as separate Kubernetes services, wired together with the access-control metadata the data scientist tagged back in step 3 — this is where the citation-and-permission requirement from the business stakeholder in step 1 actually gets enforced at runtime.
- Configures autoscaling based on expected concurrent request load from the architect’s capacity plan, sets up health checks so a failed pod is automatically replaced, and wires the whole thing into the observability stack (Prometheus scraping request latency and GPU utilization, Grafana dashboards for the operations team).
- Runs load testing against the deployed service to confirm it holds up under the concurrent-user estimate from step 2, before it’s opened to real employees.
What gets handed to the next role: a running, autoscaling, monitored service — accessible via the API gateway — ready for the infrastructure team to plug into the company network.
6. Infrastructure engineer: connects the service to the rest of the enterprise
The infrastructure/network engineer’s job is making the running service reachable, secure, and integrated into the company’s existing systems, closing the loop back to the original business requirement:
- Configures network segmentation (firewall rules, micro-segmentation policy) so the service is reachable only from the intended internal network, consistent with the data-residency and security requirements from step 1.
- Integrates authentication with the company’s existing identity provider, so employees log in with their existing corporate credentials rather than a separate account.
- Sets up the audit logging pipeline that satisfies the “every response must be traceable” requirement — piping request logs, model version, and cited documents into the company’s existing log retention and compliance system.
- Confirms DR/backup coverage for the storage tier holding the document index and model checkpoints, and documents the runbook for the on-call rotation that will support the service in production.
What gets handed back: a production URL, an operational runbook, and — critically — confirmation that every constraint from the original business requirement in step 1 is actually satisfied, not just technically possible.
Why the handoffs matter more than any single role
Look back at where things actually break in practice: it’s rarely inside a single role’s work. It’s at the seams. A data scientist who doesn’t tag document permissions correctly creates a security bug that only surfaces when the Kubernetes operator wires up retrieval in step 5. A capacity estimate from the architect that doesn’t account for retrieval latency, not just LLM inference latency, causes the Kubernetes operator’s load test to fail in step 5 for a reason nobody anticipated in step 2. A Slurm operator who doesn’t communicate a training job’s actual GPU-hours back to the architect makes the next capacity plan wrong before the next project even starts.
The organizations that ship LLM systems reliably aren’t the ones with the single best data scientist or the single best Kubernetes operator — they’re the ones where each role understands what artifact the next role needs, and treats the handoff itself as part of the job, not an afterthought once “their part” is done.