Home / Knowledge / Run Gemma Locally

Hardware guide

How to Run Google Gemma Locally & Offline

Google Gemma is Google's family of open-weight large language models — downloadable model files spanning compact ~1B variants to 27B-class models — and, alongside NVIDIA Nemotron, one of the two families AIOD deploys in fully offline appliances. Gemma's particular strengths: unusually small capable variants, broad multilingual coverage, vision-capable versions, and the most mature local-deployment ecosystem of any open family.

The family at a glance

Variants evolve, so treat this as the durable shape of the line-up and verify the current model cards before ordering hardware:

ClassScaleNatural habitatUse it for
Gemma compact~1–4B paramsUltra-low-power edge · embeddedCompanion tasks, tight battery budgets, always-on assistants
Gemma mid~12B paramsJetson-class · single GPUGeneral assistants with retrieval; multilingual field units
Gemma large~27B paramsHigh-end single GPU · nodeStrongest reasoning in the family; team-scale serving

Two family traits matter for offline work. First, multilingual reach: Gemma models are trained across a very wide language base, which is decisive for humanitarian and civic resilience deployments serving mixed-language communities. Second, recent generations include vision-capable variants — useful when the mission involves reading a rating plate, a wound, or a fault display, not just text.

VRAM: the sizing maths

The same planning rule as any open model — parameters × bytes per parameter sets the floor, plus 10–30% headroom for context and retrieval:

Model sizeFP168-bit4-bit
~4B parameters~8 GB~4 GB~2.5 GB
~12B parameters~24 GB~12 GB~7–8 GB
~27B parameters~54 GB~27 GB~15 GB

The headline: a well-quantised Gemma 27B fits a single high-end consumer GPU, and the 4B variant runs in the kind of memory budget that leaves room for everything else on a Jetson. Long context windows still cost extra — budget for them if your retrieval chunks are large. Or skip the arithmetic and use the hardware sizing calculator.

The software stack

  • Ollama / llama.cpp (GGUF) — Gemma's quantised builds are first-class citizens here; this is the fastest path to a working offline assistant and our default on edge units.
  • vLLM — high-throughput serving for node and multi-user deployments.
  • NVIDIA-optimised paths — Gemma runs well across NVIDIA silicon from Jetson to RTX and DGX-class, which keeps our single-hardware-platform discipline intact whichever family the mission selects.
  • Retrieval — as ever, embedding model and vector index live on-device; a RAG stack that calls an embedding API is not offline.

The licence, honestly

Gemma ships under Google's Gemma terms of use: weights are freely downloadable and commercial use is permitted, subject to a prohibited-use policy. It is open-weight rather than OSI open-source — a distinction that matters to purists and almost never to deployments. What matters for the air gap is simple and satisfied: the weights are files, you may run them, and they live on your machine forever. We document the licence position in every handover pack.

Gemma or Nemotron?

The honest selection logic we use, mission by mission:

  • Pick Gemma when the corpus or user base is multilingual, the power budget is extreme (1–4B class), vision input earns its keep, or you want the deepest quantised-community ecosystem behind you.
  • Pick Nemotron when raw performance-per-watt on NVIDIA silicon is the binding constraint, or the deployment sits on TensorRT-LLM-optimised DGX-class hardware where its tuning shows.
  • Either way — both are open-weight, both run air-gapped, both are yours at handover. The choice is engineering, not allegiance; the full comparison lives on the technology page.
The model is a component. The mission picks the component — never the other way around.

For the end-to-end picture — security envelope, sustainment, corpus design — continue with The Complete Guide to Offline LLM Deployment.

FAQ

Gemma, asked directly.

What is Google Gemma?

Google's family of open-weight large language models — compact ~1B variants to 27B-class models — with strong multilingual coverage and the most mature local-deployment ecosystem of any open family.

Can Gemma run completely offline?

Yes. The weights are downloadable files; with llama.cpp, Ollama or vLLM on local hardware, the model runs with no internet connection at all — which is how AIOD ships it in air-gapped appliances.

How much VRAM does Gemma 27B need?

Roughly 54 GB at FP16, ~27 GB at 8-bit, and ~15 GB at 4-bit before context overhead — so a well-quantised 27B fits a single high-end GPU. Verify against the current model card.

Family-agnostic, mission-led

Tell us the mission. We'll pick the model.

Gemma, Nemotron, or both behind one retrieval layer — sized, benchmarked and sealed into an appliance you own.

DEPLOY@AIOD.APP →