Introduction to LLM concepts

mavazque
Cadet
Cadet

In this post, I'll cover various LLM concepts and the questions I asked myself while diving deep into the world of LLMs. I expect this post to be updated as I continue to learn more things around LLMs.

This post is the result of my exploratory work on LLMs. While I’ve made a strong effort to ensure the information is accurate, I’m far from an expert on the topic, and some details may not be entirely correct. If you notice anything missing or inaccurate, please leave a comment!

General LLM Concepts and Questions

This section describes general concepts and questions around Large Language Models.

What's a Large Language Model (LLM)?

A Large Language Model (LLM) is an AI system trained on vast amounts of text data to understand and generate human-like language. Using deep learning, specifically transformer architectures, LLMs can answer questions, summarize text, generate content, and more. They predict the most likely next word based on context.

What's the difference between a Base and an Instruct model?

A base model predicts the next word. An instruct model has been fine-tuned to predict the next word in a conversation between a user and a helpful assistant, specifically to follow instructions. There are also chat models, fine-tuned for conversational formats.

You can prompt a base model to follow instructions with proper formatting, but instruct models are trained to do this out-of-the-box.

Usually, base models are used as a foundation for further fine-tuning.

To make a base model act like an instruct model, you can format prompts using chat templates. Learn more about that here. You can also view an example of a chat template in the tokenizer_config.json here.

Aspect Base Models Instruct Models
Training Objective General language modeling. Fine-tuned for instruction following.
Focus Broad and generic text understanding. Task-specific, user-centric interactions.
Performance Literal, unrefined responses. Context-aware, tailored responses.
Safety and Alignment Lower alignment with human values. Higher alignment due to RLHF.
Examples of Usage Pre-training for fine-tuning tasks. User-facing applications like chatbots.

RLHF → Reinforcement Learning from Human Feedback

  • Base Models: Useful as a starting point for further fine-tuning for specific tasks.
  • Instruct Models: Ideal for deployment in interactive environments where users provide natural-language instructions.

You may want to use a base model when training from scratch or when focusing on text/code continuation tasks.

What are Parameters in a LLM Model?

Parameters are the numerical values in the model that determine how it processes and generates text. They are the trainable weights and biases adjusted during training to learn patterns from data.

Teacher vs Critic vs Judge Model

Teacher Model

A teacher model is a large, high-performing model used to train a smaller student model. This process, called knowledge distillation, involves the teacher generating outputs (soft labels) on data to help the student learn nuanced patterns. Teacher models can also generate synthetic data based on curated datasets.

Critic Model

A critic model evaluates the outputs of another model, focusing on quality, coherence, relevance, or accuracy. It's often used during training to provide feedback on outputs from the teacher model.

Judge Model

The judge model ranks or scores outputs from different models or generations. It acts as an arbiter to select the best response based on predefined criteria, often using human preference data for training.

LLM as a Judge

LLMs can be used as judges to evaluate outputs in a human-like way without needing manual review. You can read more about this in this Hugging Face cookbook.

Critic and Judge Collaboration

In advanced setups:

  1. Critic provides iterative feedback to improve outputs.
  2. Judge selects the best version from improved outputs.

This setup is ideal for high-stakes use cases like medical, legal, or technical content generation.

RAG vs Fine Tuning

RAG (Retrieval-Augmented Generation) connects a model to external databases to retrieve relevant content at runtime. Fine-tuning adjusts internal model weights using domain-specific training data. Both aim to improve model performance for specific tasks.

RAG augments prompts with dynamic data; fine-tuning bakes knowledge directly into the model.

Most modern fine-tuning is done using PEFT (Parameter-Efficient Fine Tuning), which adjusts only parts of the model, reducing training costs.

Aspect RAG (Retrieval-Augmented Generation) Fine-Tuning
Knowledge Source External, dynamically retrieved Internal, baked into model’s weights
Adaptation Method Combines retrieval and generation Updates model weights with training data
Scalability Scales with growing knowledge bases Requires retraining for new domains
Flexibility Adapts in real-time Static after training
Cost and Time No retraining cost, but needs infrastructure Computationally expensive and slower
Performance Dependent on retrieval quality Highly accurate if trained well
Use Cases Open-domain, dynamic data Fixed-domain, precision tasks

Evaluating Models

To evaluate LLM performance, benchmarks are used. Two common ones:

  • MMLU: Massive Multitask Language Understanding — tests across various subjects.
  • MATH: Evaluates mathematical problem-solving skills.

What is Quantization?

Quantization compresses a model by reducing the precision of its weights and activations. For instance, reducing from 32-bit floats (FP32) to 8-bit integers (INT8). This lowers memory usage and speeds up inference, often with minimal accuracy loss.

Example: A 2B parameter model in FP16 takes ~4GB:

(2000000000 * 2) / 1024 / 1024 / 1024 = ~3.7GB

quantization.png
More info: Guide to Quantization

What is a Distilled Model?

A distilled model is a smaller, faster version of a larger model, created through knowledge distillation. It aims to retain most of the performance while reducing size and compute needs.

How Knowledge Distillation Works

  • A large "teacher" model makes predictions on a dataset.
  • A smaller "student" model learns to mimic the teacher’s outputs.
  • Training involves both original data and the teacher’s predictions.

The result: similar performance, lower cost — great for real-world use.

What are Tokens in the Context of LLMs?

Tokens are the smallest units a model understands — not always full words. For example, “play” and “ful” are separate tokens. LLMs often have vocabularies of ~32K tokens (e.g. LLaMA 2), not 600K like English words.

Tokenization breaks words into parts. Example: “played” → “play” + “ed”.

Check out this tokenizer in action: Hugging Face Playground

LLMs use special tokens for structure — the most important is the EOS (End Of Sequence) token. For example: <|im_end|>

You can view these in tokenizer_config.json: SmolLM2 Example

LLMs continue generating text until the EOS token is predicted.

Explore decoding processes: Basic Decoding | Beam Search

What is the Context Window / Context Length / Model Max Length?

The context window is the maximum number of tokens a model can handle in one go — including both input and generated output.

Example: With a 2048-token window, if the prompt uses 1000 tokens, only 1048 remain for the model’s response.

LLMs don’t have memory. In chats, “memory” is simulated by resending previous messages in the context window each time.

Larger context = more memory required.

You can find context limits in: tokenizer_config.jsonmodel_max_length
config.jsonmax_position_embeddings

If unsure, check: LLM Explorer

What is the Embedding Size?

Embedding size = number of dimensions used to represent each token numerically. Example: Embedding size 768 means each token is a 768D vector.

Bigger embedding = better nuance, more compute.

What are Weights and Activations?

Weights: Trainable values that control connections between neurons. They “learn” language patterns during training.

Activations: The output of a neuron after applying an activation function (e.g., ReLU, Sigmoid). They introduce non-linearity to help models understand complex relationships.

Together, they power everything from translation to text generation.

Safetensors vs GGUF Models

These are two common formats for LLMs:

  • Safetensors: Raw format — ideal for fine-tuning, exploration.
  • GGUF: Binary, optimized for deployment and sharing.

Required vRAM for LLM Models

Estimate vRAM like this:

Number of Parameters × (Precision /  × 1.2

Example for 2B parameters in FP32:

(2,000,000,000 × 4 × 1.2) / 1024³ = ~8.9GB

This assumes:
– Max sequence length: 512
– Batch size: 8
– Single GPU

Some models list their vRAM needs: Mistral Sizes

Note: Inference uses less memory than training.

Try these tools to check hardware compatibility:

LLM Model Naming

There’s no strict convention, but here’s a common pattern:
model-names.png

codellama-70b-python.Q4_K_S.gguf
  • 70B → 70 billion parameters
  • Q4 → Quantization level
  • _K → Weight rounding method
  • _S → Size category (S, M, L)

What does a name like 8x7B mean?

It’s a mixture of experts: 8 models, each with 7B parameters.

Padding in LLM Models

Padding adds dummy tokens (like <PAD>) to make input sequences the same length for batch processing.

Example:

["Hello", "<PAD>", "<PAD>"]
["How", "are", "you?"]
["Good", "morning", "<PAD>"]

Why it matters:

  1. Batch Processing: Needed for parallelization
  2. Fixed-Length Input: Ensures input fits expected tensor shape
  3. Efficient GPU Use: Aligns data for better performance

Padding-Free LLMs

These models dynamically handle variable-length sequences.

  • Efficiency: No wasted compute on pad tokens
  • Memory: Saves space
  • Scalability: Great for real-time use cases

Prompts

Prompts are the inputs that guide an LLM’s output. They can be simple or complex, and they set the tone, style, and task.

System Prompt

A hidden instruction to shape model behavior. You can define your own when self-hosting. Some, like Claude, publish theirs: Anthropic’s system prompts

Chain of Thought (CoT)

Prompts that encourage models to think step-by-step. Improves reasoning for complex problems.

No CoT vs Zero-shot CoT vs Manual CoT
  • No CoT: Just the question — no reasoning steps
  • Zero-shot CoT: Add “Think step by step” to the prompt
  • Manual CoT: Explicit step-by-step examples in the prompt

LLM Agents

LLM agents use models + tools (like APIs or databases) to reason and act. They're great for automation.

Example use case: An agent monitors logs and fixes issues, or opens support cases based on analysis.

No-code tools: n8n.io

Pro-code tools: LangGraph

Learn more: Hugging Face Agents Course

Model Context Protocol (MCP)

An open standard for connecting LLMs to data securely. Apps can be clients or servers.

Learn more: Anthropic Announcement | Official Docs

LLM Training and Fine-Tuning Concepts

Synthetic Data Generation (SDG)

Generate training data using a teacher model. InstructLab’s SDG 1.5 adds a Critic model that revises and filters before finalizing the data.

More info: InstructLab Blog

Overfitting and Underfitting

Overfitting: Model memorizes training data — poor generalization.

Underfitting: Model too simple — can’t learn from training data either.

Required Number of Epochs

Epoch = one pass through the training data. 3–6 is common for fine-tuning. Example: InstructGPT was trained for 16 epochs.

Source: InstructGPT Training Details

Effective Batch Size

Total training examples processed before updating the model. Includes batch per GPU and gradient accumulation.

Tags (1)