About the course
The era of "just call the GPT-4 API" is evolving. For enterprises concerned with latency, cost, privacy, and edge deployment, Small Language Models (SLMs) are the new frontier. This intensive workshop moves beyond the cloud to show experienced engineers how to deploy, fine-tune, and optimize compact models that rival their giant counterparts in specialized tasks.
We focus on the "Sovereign AI" stack: learning how to run models like Phi-3, Mistral, and Gemma on local hardware, in-browser (WebGPU), and on edge devices. This course is about precision engineering: getting 90% of the performance of a 175B parameter model using only 3B-7B parameters.
Instructor-led online and in-house face-to-face options are available - as part of a wider customised training programme, or as a standalone workshop, on-site at your offices or at one of many flexible meeting spaces in the UK and around the World.
-
- Model Selection & Quantization: Understand the trade-offs between 4-bit, 8-bit, and FP16 precision.
- On-Device Deployment: Master llama.cpp, Ollama, and WebLLM for local and browser-based execution.
- PEFT & LoRA Fine-Tuning: Learn to specialize an SLM for a specific domain (e.g., medical, legal, or codebase-specific) using minimal hardware.
- SLM-Agentic Workflows: Use SLMs as high-speed "sub-agents" for specific tasks like routing, classification, and summarization.
- Privacy-First Architecture: Building "Local-First" AI systems that never send data to a third-party cloud.
-
This hands-on workshop is aimed mainly at:
Senior Software Engineers looking to reduce API dependency and costs.
Mobile & IoT Developers wanting to bring "intelligence to the edge."
DevOps & Platform Engineers tasked with self-hosting private AI infrastructure.
-
Workshop attendees should ideally have:
Strong proficiency in Python and CLI environments.
Basic familiarity with PyTorch or Hugging Face is helpful but not required.
A laptop with at least 16GB of RAM (for local inference labs).
We can customise the training to match your team's experience and needs - for instance with more time and coverage of fundamentals for budding new data professionals. Get in touch to find out how.
-
This SLM course is available for private / custom delivery for your team - as an in-house face-to-face workshop at your location of choice, or as online instructor-led training via MS Teams (or your own preferred platform).
Get in touch to find out how we can deliver tailored training which focuses on your project requirements and learning goals.
-
The Case for Small
The "Small" Landscape: Comparative analysis of Phi-3 (Microsoft), Mistral-7B, and Gemma (Google).
SLMs vs. LLMs: When is a 3B model actually better than a 1T model?
Benchmarks for Developers: Moving beyond MMLU to real-world latency and throughput metrics.
Optimization & Quantization
Quantization 101: How GGUF and AWQ formats allow high-performance AI to run on a laptop.
The "Distillation" Pattern: Using a "Teacher" model (GPT-4) to train a "Student" model (Phi-3).
Hardware Acceleration: Utilizing Metal (Mac), CUDA (Nvidia), and WebGPU for zero-lag inference.
Fine-Tuning on a Budget
Parameter-Efficient Fine-Tuning (PEFT): Mastering LoRA (Low-Rank Adaptation) and QLoRA.
Domain Specialization: Hands-on lab: Fine-tuning an SLM to understand your company's specific API documentation or coding style.
Dataset Preparation: Curating high-quality "synthetic" data for SLM training.
Edge & Browser Integration
AI in the Browser: Using Transformers.js and WebLLM to run models entirely on the client side.
Mobile Deployment: Strategies for iOS/Android integration with ONNX and CoreML.
Privacy-First RAG: Implementing a full Retrieval-Augmented Generation stack that stays 100% local.
Orchestration with SLMs
The "Small Agent" Pattern: Using SLMs as fast, cheap routers in a multi-model system.
Speculative Decoding: Using an SLM to "draft" text for a larger model to verify (speeding up inference by 2-3x).
Closing Lab: Building a private, local-first coding assistant using an SLM.
-
Ollama: The industry standard for running, managing, and "modelfile" customizing SLMs like Mistral and Llama-3 locally on macOS, Linux, and Windows.
Hugging Face Models (filtered search = SLM): The central repository for discovering specialized small models, including Microsoft’s Phi-3, Google’s Gemma, and Apple’s OpenELM.
llama.cpp: The foundational C/C++ implementation for high-performance LLM inference that enables "local-first" AI across almost any hardware.
MLX (Apple Silicon): Apple’s specialized machine learning framework designed specifically for high-efficiency training and inference on M-series chips.
Transformers.js: A library for running state-of-the-art machine learning models directly in the browser, enabling 100% private, client-side AI.
LM Studio: A GUI-based tool for discovering, downloading, and running local LLMs, offering a built-in local server that mimics the OpenAI API structure.
Unsloth: A high-speed fine-tuning library that makes training SLMs like Mistral and Llama 2–5x faster and 70% more memory-efficient.
WebLLM: A high-performance in-browser LLM inference engine that utilizes WebGPU for hardware-accelerated local execution.
vLLM: A fast and easy-to-use library for LLM serving and deployment, optimized for high-throughput enterprise environments.
Trusted by