LLM

Build your own private LLM in 14 days

From "we want our own AI" to production traffic on Llama-3 70B — the exact playbook we've used for 28 customers. Hardware, fine-tuning, RAG, eval, deploy.

AS Anjali Sharma · Head of ML 12 min read · 15 Apr 2026

Why "private LLM" usually fails

Most teams that try to build their own LLM blow through their timeline because they conflate three different projects: training a model, serving inferences, and grounding answers in their data. Each has its own complexity. Each has its own way of going wrong. Bundle them and you'll be at month four with nothing in production.

Our 14-day plan separates them. By the end of week two, you'll have a working production endpoint your team can hit. The fine-tuning happens in parallel with the deployment work, not in series.

Days 1–2 · Discovery and architecture

Two questions decide everything that follows:

What do you actually need the LLM to do? Be specific. "Customer support" is not a use case. "Resolve Tier-1 password reset tickets without escalation" is.
What data is the model expected to know? Not "all our docs" — what subset will it pull context from at query time? PDFs? Tickets? Slack? A specific Confluence space?

Output of this phase: a one-page architecture document and a list of 30 evaluation prompts with acceptable answers. The eval prompts are non-negotiable. You can't ship what you can't measure.

Days 3–4 · Hardware provisioning

For Llama-3 70B inference at 100+ QPS you need 4× A100 80GB minimum (we usually recommend 8× for headroom). For fine-tuning, double that. We provision this in under 4 hours on Glixy. AWS would take 2–4 weeks for the GPU quota alone.

Software stack baseline:

vLLM 0.5+ for serving (PagedAttention is mandatory for any serious throughput)
NCCL tuned for our InfiniBand fabric
CUDA 12.4, PyTorch 2.3
Prometheus + Grafana for token-level observability

Days 5–7 · Base model + RAG pipeline (in parallel)

While one team prepares fine-tuning data, another wires up the production-shape RAG pipeline using the off-the-shelf Llama-3 70B. This is the single most important sequencing call we make. It means you have a working system to test against by day 7, regardless of whether the fine-tune is ready.

RAG components:

Document loader: handles PDFs (with OCR fallback), DOCX, Markdown, HTML, code repos, Confluence, Notion exports.
Chunker: semantic chunking using sentence-transformers, not naive token-windows. Average 400 tokens, 50-token overlap.
Embedder: BGE-large or E5-large depending on language mix. Multilingual matters in India.
Vector DB: Weaviate self-hosted (we run it) or Pinecone (managed). 50-100 ms p95 retrieval at 10M+ chunks.
Re-ranker: cross-encoder over the top-50 results. Bumps retrieval accuracy by 8-12% in our benchmarks.
Citation tracker: every answer carries pointers to the source chunks. Non-negotiable for compliance use cases.

Days 8–10 · Fine-tuning (LoRA / QLoRA)

For most use cases, you don't need a full fine-tune. LoRA on a few thousand domain examples gets you 80% of the value at 5% of the cost. Our default setup:

method: qlora
rank: 64
alpha: 128
target_modules: [q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj]
lr: 2e-4
epochs: 3
batch_size: 8 per GPU (effective 64 across 8 GPUs)

A 5,000-example fine-tune of Llama-3 70B finishes in 4-6 hours on 8× A100. Cost: roughly ₹14,000 of compute. The output is a few hundred MB of LoRA adapters that load on top of the base weights at inference.

Days 11–12 · Eval, eval, eval

Now run those 30 eval prompts you wrote on day 1. Add another 50 your team has accumulated by now. Run them through:

Base Llama-3 70B (no RAG)
Base + RAG
Fine-tuned + RAG (your candidate production model)

Score each on factual accuracy (vs. expected answer), citation quality (does it ground claims?), refusal correctness (does it say "I don't know" when it should?), and latency. Track in a spreadsheet. The improvements over the base model should be obvious. If they're not, your fine-tune data is wrong, not the method.

Days 13–14 · Production deploy

Wrap the inference stack in an OpenAI-compatible REST API. This is critical: it lets every existing tool (LangChain, LlamaIndex, your own apps) drop in the new endpoint with one line of config.

Auth: API keys per app, scoped per route, rotated on a schedule
Rate limits: per-key, per-endpoint, per-IP — log everything
Streaming: SSE for chat UIs, regular JSON for batch
Observability: tokens-in, tokens-out, p50/p95/p99 latency, GPU memory, queue depth
Auto-scale: scale replicas on queue depth, not CPU. CPU is meaningless for GPU workloads.

What week 3 looks like

Yes, you're in production by day 14. But the real work is the next 90 days: drift monitoring, prompt versioning, A/B testing new fine-tunes against production traffic, building a synthetic-data pipeline so retraining gets cheaper. We help with all of it on Growth and Enterprise plans.

The two things people skip and regret

Eval discipline. Without a fixed eval set you cannot tell if a change made things better or worse. Build it on day 1 and never touch it.
Cite everything. Every answer with a source link. The day a customer claims your model "made up" a policy detail, you'll need to point at the exact PDF page that disagrees.

🚀 Want us to run this for you? →