Introduction to LLM concepts

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
In this post, I'll cover various LLM concepts and the questions I asked myself while diving deep into the world of LLMs. I expect this post to be updated as I continue to learn more things around LLMs.
General LLM Concepts and Questions
This section describes general concepts and questions around Large Language Models.
What's a Large Language Model (LLM)?
A Large Language Model (LLM) is an AI system trained on vast amounts of text data to understand and generate human-like language. Using deep learning, specifically transformer architectures, LLMs can answer questions, summarize text, generate content, and more. They predict the most likely next word based on context.
What's the difference between a Base
and an Instruct
model?
A base model predicts the next word. An instruct model has been fine-tuned to predict the next word in a conversation between a user and a helpful assistant, specifically to follow instructions. There are also chat models, fine-tuned for conversational formats.
You can prompt a base model to follow instructions with proper formatting, but instruct models are trained to do this out-of-the-box.
Usually, base models are used as a foundation for further fine-tuning.
To make a base model act like an instruct model, you can format prompts using chat templates. Learn more about that here. You can also view an example of a chat template in the tokenizer_config.json
here.
Aspect | Base Models | Instruct Models |
---|---|---|
Training Objective | General language modeling. | Fine-tuned for instruction following. |
Focus | Broad and generic text understanding. | Task-specific, user-centric interactions. |
Performance | Literal, unrefined responses. | Context-aware, tailored responses. |
Safety and Alignment | Lower alignment with human values. | Higher alignment due to RLHF. |
Examples of Usage | Pre-training for fine-tuning tasks. | User-facing applications like chatbots. |
RLHF → Reinforcement Learning from Human Feedback
- Base Models: Useful as a starting point for further fine-tuning for specific tasks.
- Instruct Models: Ideal for deployment in interactive environments where users provide natural-language instructions.
You may want to use a base model when training from scratch or when focusing on text/code continuation tasks.
What are Parameters in a LLM Model?
Parameters are the numerical values in the model that determine how it processes and generates text. They are the trainable weights and biases adjusted during training to learn patterns from data.
Teacher vs Critic vs Judge Model
Teacher Model
A teacher model is a large, high-performing model used to train a smaller student model. This process, called knowledge distillation, involves the teacher generating outputs (soft labels) on data to help the student learn nuanced patterns. Teacher models can also generate synthetic data based on curated datasets.
Critic Model
A critic model evaluates the outputs of another model, focusing on quality, coherence, relevance, or accuracy. It's often used during training to provide feedback on outputs from the teacher model.
Judge Model
The judge model ranks or scores outputs from different models or generations. It acts as an arbiter to select the best response based on predefined criteria, often using human preference data for training.
LLM as a Judge
LLMs can be used as judges to evaluate outputs in a human-like way without needing manual review. You can read more about this in this Hugging Face cookbook.
Critic and Judge Collaboration
In advanced setups:
- Critic provides iterative feedback to improve outputs.
- Judge selects the best version from improved outputs.
This setup is ideal for high-stakes use cases like medical, legal, or technical content generation.
RAG vs Fine Tuning
RAG (Retrieval-Augmented Generation) connects a model to external databases to retrieve relevant content at runtime. Fine-tuning adjusts internal model weights using domain-specific training data. Both aim to improve model performance for specific tasks.
RAG augments prompts with dynamic data; fine-tuning bakes knowledge directly into the model.
Most modern fine-tuning is done using PEFT (Parameter-Efficient Fine Tuning), which adjusts only parts of the model, reducing training costs.
Aspect | RAG (Retrieval-Augmented Generation) | Fine-Tuning |
---|---|---|
Knowledge Source | External, dynamically retrieved | Internal, baked into model’s weights |
Adaptation Method | Combines retrieval and generation | Updates model weights with training data |
Scalability | Scales with growing knowledge bases | Requires retraining for new domains |
Flexibility | Adapts in real-time | Static after training |
Cost and Time | No retraining cost, but needs infrastructure | Computationally expensive and slower |
Performance | Dependent on retrieval quality | Highly accurate if trained well |
Use Cases | Open-domain, dynamic data | Fixed-domain, precision tasks |
Evaluating Models
To evaluate LLM performance, benchmarks are used. Two common ones:
- MMLU: Massive Multitask Language Understanding — tests across various subjects.
- MATH: Evaluates mathematical problem-solving skills.
What is Quantization?
Quantization compresses a model by reducing the precision of its weights and activations. For instance, reducing from 32-bit floats (FP32) to 8-bit integers (INT8). This lowers memory usage and speeds up inference, often with minimal accuracy loss.
Example: A 2B parameter model in FP16 takes ~4GB:
(2000000000 * 2) / 1024 / 1024 / 1024 = ~3.7GB
More info: Guide to Quantization
What is a Distilled Model?
A distilled model is a smaller, faster version of a larger model, created through knowledge distillation. It aims to retain most of the performance while reducing size and compute needs.
How Knowledge Distillation Works
- A large "teacher" model makes predictions on a dataset.
- A smaller "student" model learns to mimic the teacher’s outputs.
- Training involves both original data and the teacher’s predictions.
The result: similar performance, lower cost — great for real-world use.
What are Tokens in the Context of LLMs?
Tokens are the smallest units a model understands — not always full words. For example, “play” and “ful” are separate tokens. LLMs often have vocabularies of ~32K tokens (e.g. LLaMA 2), not 600K like English words.
Tokenization breaks words into parts. Example: “played” → “play” + “ed”.
Check out this tokenizer in action: Hugging Face Playground
LLMs use special tokens for structure — the most important is the EOS (End Of Sequence) token. For example: <|im_end|>
You can view these in tokenizer_config.json
: SmolLM2 Example
LLMs continue generating text until the EOS token is predicted.
Explore decoding processes: Basic Decoding | Beam Search
What is the Context Window / Context Length / Model Max Length?
The context window is the maximum number of tokens a model can handle in one go — including both input and generated output.
Example: With a 2048-token window, if the prompt uses 1000 tokens, only 1048 remain for the model’s response.
LLMs don’t have memory. In chats, “memory” is simulated by resending previous messages in the context window each time.
Larger context = more memory required.
You can find context limits in: tokenizer_config.json
→ model_max_length
config.json
→ max_position_embeddings
If unsure, check: LLM Explorer
What is the Embedding Size?
Embedding size = number of dimensions used to represent each token numerically. Example: Embedding size 768 means each token is a 768D vector.
Bigger embedding = better nuance, more compute.
What are Weights and Activations?
Weights: Trainable values that control connections between neurons. They “learn” language patterns during training.
Activations: The output of a neuron after applying an activation function (e.g., ReLU, Sigmoid). They introduce non-linearity to help models understand complex relationships.
Together, they power everything from translation to text generation.
Safetensors vs GGUF Models
These are two common formats for LLMs:
- Safetensors: Raw format — ideal for fine-tuning, exploration.
- GGUF: Binary, optimized for deployment and sharing.
Required vRAM for LLM Models
Estimate vRAM like this:
Number of Parameters × (Precision /× 1.2
Example for 2B parameters in FP32:
(2,000,000,000 × 4 × 1.2) / 1024³ = ~8.9GB
This assumes:
– Max sequence length: 512
– Batch size: 8
– Single GPU
Some models list their vRAM needs: Mistral Sizes
Note: Inference uses less memory than training.
Try these tools to check hardware compatibility:
LLM Model Naming
There’s no strict convention, but here’s a common pattern:
codellama-70b-python.Q4_K_S.gguf
70B
→ 70 billion parametersQ4
→ Quantization level_K
→ Weight rounding method_S
→ Size category (S, M, L)
What does a name like 8x7B mean?
It’s a mixture of experts: 8 models, each with 7B parameters.
Padding in LLM Models
Padding adds dummy tokens (like <PAD>
) to make input sequences the same length for batch processing.
Example:
["Hello", "<PAD>", "<PAD>"] ["How", "are", "you?"] ["Good", "morning", "<PAD>"]
Why it matters:
- Batch Processing: Needed for parallelization
- Fixed-Length Input: Ensures input fits expected tensor shape
- Efficient GPU Use: Aligns data for better performance
Padding-Free LLMs
These models dynamically handle variable-length sequences.
- Efficiency: No wasted compute on pad tokens
- Memory: Saves space
- Scalability: Great for real-time use cases
Prompts
Prompts are the inputs that guide an LLM’s output. They can be simple or complex, and they set the tone, style, and task.
System Prompt
A hidden instruction to shape model behavior. You can define your own when self-hosting. Some, like Claude, publish theirs: Anthropic’s system prompts
Chain of Thought (CoT)
Prompts that encourage models to think step-by-step. Improves reasoning for complex problems.
No CoT vs Zero-shot CoT vs Manual CoT
- No CoT: Just the question — no reasoning steps
- Zero-shot CoT: Add “Think step by step” to the prompt
- Manual CoT: Explicit step-by-step examples in the prompt
LLM Agents
LLM agents use models + tools (like APIs or databases) to reason and act. They're great for automation.
Example use case: An agent monitors logs and fixes issues, or opens support cases based on analysis.
No-code tools: n8n.io
Pro-code tools: LangGraph
Learn more: Hugging Face Agents Course
Model Context Protocol (MCP)
An open standard for connecting LLMs to data securely. Apps can be clients or servers.
Learn more: Anthropic Announcement | Official Docs
LLM Training and Fine-Tuning Concepts
Synthetic Data Generation (SDG)
Generate training data using a teacher model. InstructLab’s SDG 1.5 adds a Critic model that revises and filters before finalizing the data.
More info: InstructLab Blog
Overfitting and Underfitting
Overfitting: Model memorizes training data — poor generalization.
Underfitting: Model too simple — can’t learn from training data either.
Required Number of Epochs
Epoch = one pass through the training data. 3–6 is common for fine-tuning. Example: InstructGPT was trained for 16 epochs.
Source: InstructGPT Training Details
Effective Batch Size
Total training examples processed before updating the model. Includes batch per GPU and gradient accumulation.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.