In this post, I'll cover various LLM concepts and the questions I asked myself while diving deep into the world of LLMs. I expect this post to be updated as I continue to learn more things around LLMs.
This section describes general concepts and questions around Large Language Models.
A Large Language Model (LLM) is an AI system trained on vast amounts of text data to understand and generate human-like language. Using deep learning, specifically transformer architectures, LLMs can answer questions, summarize text, generate content, and more. They predict the most likely next word based on context.
Base
and an Instruct
model?A base model predicts the next word. An instruct model has been fine-tuned to predict the next word in a conversation between a user and a helpful assistant, specifically to follow instructions. There are also chat models, fine-tuned for conversational formats.
You can prompt a base model to follow instructions with proper formatting, but instruct models are trained to do this out-of-the-box.
Usually, base models are used as a foundation for further fine-tuning.
To make a base model act like an instruct model, you can format prompts using chat templates. Learn more about that here. You can also view an example of a chat template in the tokenizer_config.json
here.
Aspect | Base Models | Instruct Models |
---|---|---|
Training Objective | General language modeling. | Fine-tuned for instruction following. |
Focus | Broad and generic text understanding. | Task-specific, user-centric interactions. |
Performance | Literal, unrefined responses. | Context-aware, tailored responses. |
Safety and Alignment | Lower alignment with human values. | Higher alignment due to RLHF. |
Examples of Usage | Pre-training for fine-tuning tasks. | User-facing applications like chatbots. |
RLHF → Reinforcement Learning from Human Feedback
You may want to use a base model when training from scratch or when focusing on text/code continuation tasks.
Parameters are the numerical values in the model that determine how it processes and generates text. They are the trainable weights and biases adjusted during training to learn patterns from data.
A teacher model is a large, high-performing model used to train a smaller student model. This process, called knowledge distillation, involves the teacher generating outputs (soft labels) on data to help the student learn nuanced patterns. Teacher models can also generate synthetic data based on curated datasets.
A critic model evaluates the outputs of another model, focusing on quality, coherence, relevance, or accuracy. It's often used during training to provide feedback on outputs from the teacher model.
The judge model ranks or scores outputs from different models or generations. It acts as an arbiter to select the best response based on predefined criteria, often using human preference data for training.
LLMs can be used as judges to evaluate outputs in a human-like way without needing manual review. You can read more about this in this Hugging Face cookbook.
In advanced setups:
This setup is ideal for high-stakes use cases like medical, legal, or technical content generation.
RAG (Retrieval-Augmented Generation) connects a model to external databases to retrieve relevant content at runtime. Fine-tuning adjusts internal model weights using domain-specific training data. Both aim to improve model performance for specific tasks.
RAG augments prompts with dynamic data; fine-tuning bakes knowledge directly into the model.
Most modern fine-tuning is done using PEFT (Parameter-Efficient Fine Tuning), which adjusts only parts of the model, reducing training costs.
Aspect | RAG (Retrieval-Augmented Generation) | Fine-Tuning |
---|---|---|
Knowledge Source | External, dynamically retrieved | Internal, baked into model’s weights |
Adaptation Method | Combines retrieval and generation | Updates model weights with training data |
Scalability | Scales with growing knowledge bases | Requires retraining for new domains |
Flexibility | Adapts in real-time | Static after training |
Cost and Time | No retraining cost, but needs infrastructure | Computationally expensive and slower |
Performance | Dependent on retrieval quality | Highly accurate if trained well |
Use Cases | Open-domain, dynamic data | Fixed-domain, precision tasks |
To evaluate LLM performance, benchmarks are used. Two common ones:
Quantization compresses a model by reducing the precision of its weights and activations. For instance, reducing from 32-bit floats (FP32) to 8-bit integers (INT8). This lowers memory usage and speeds up inference, often with minimal accuracy loss.
Example: A 2B parameter model in FP16 takes ~4GB:
(2000000000 * 2) / 1024 / 1024 / 1024 = ~3.7GB
More info: Guide to Quantization
A distilled model is a smaller, faster version of a larger model, created through knowledge distillation. It aims to retain most of the performance while reducing size and compute needs.
The result: similar performance, lower cost — great for real-world use.
Tokens are the smallest units a model understands — not always full words. For example, “play” and “ful” are separate tokens. LLMs often have vocabularies of ~32K tokens (e.g. LLaMA 2), not 600K like English words.
Tokenization breaks words into parts. Example: “played” → “play” + “ed”.
Check out this tokenizer in action: Hugging Face Playground
LLMs use special tokens for structure — the most important is the EOS (End Of Sequence) token. For example: <|im_end|>
You can view these in tokenizer_config.json
: SmolLM2 Example
LLMs continue generating text until the EOS token is predicted.
Explore decoding processes: Basic Decoding | Beam Search
The context window is the maximum number of tokens a model can handle in one go — including both input and generated output.
Example: With a 2048-token window, if the prompt uses 1000 tokens, only 1048 remain for the model’s response.
LLMs don’t have memory. In chats, “memory” is simulated by resending previous messages in the context window each time.
Larger context = more memory required.
You can find context limits in: tokenizer_config.json
→ model_max_length
config.json
→ max_position_embeddings
If unsure, check: LLM Explorer
Embedding size = number of dimensions used to represent each token numerically. Example: Embedding size 768 means each token is a 768D vector.
Bigger embedding = better nuance, more compute.
Weights: Trainable values that control connections between neurons. They “learn” language patterns during training.
Activations: The output of a neuron after applying an activation function (e.g., ReLU, Sigmoid). They introduce non-linearity to help models understand complex relationships.
Together, they power everything from translation to text generation.
These are two common formats for LLMs:
Estimate vRAM like this:
Number of Parameters × (Precision /× 1.2
Example for 2B parameters in FP32:
(2,000,000,000 × 4 × 1.2) / 1024³ = ~8.9GB
This assumes:
– Max sequence length: 512
– Batch size: 8
– Single GPU
Some models list their vRAM needs: Mistral Sizes
Note: Inference uses less memory than training.
Try these tools to check hardware compatibility:
There’s no strict convention, but here’s a common pattern:
codellama-70b-python.Q4_K_S.gguf
70B
→ 70 billion parametersQ4
→ Quantization level_K
→ Weight rounding method_S
→ Size category (S, M, L)It’s a mixture of experts: 8 models, each with 7B parameters.
Padding adds dummy tokens (like <PAD>
) to make input sequences the same length for batch processing.
Example:
["Hello", "<PAD>", "<PAD>"] ["How", "are", "you?"] ["Good", "morning", "<PAD>"]
Why it matters:
These models dynamically handle variable-length sequences.
Prompts are the inputs that guide an LLM’s output. They can be simple or complex, and they set the tone, style, and task.
A hidden instruction to shape model behavior. You can define your own when self-hosting. Some, like Claude, publish theirs: Anthropic’s system prompts
Prompts that encourage models to think step-by-step. Improves reasoning for complex problems.
LLM agents use models + tools (like APIs or databases) to reason and act. They're great for automation.
Example use case: An agent monitors logs and fixes issues, or opens support cases based on analysis.
No-code tools: n8n.io
Pro-code tools: LangGraph
Learn more: Hugging Face Agents Course
An open standard for connecting LLMs to data securely. Apps can be clients or servers.
Learn more: Anthropic Announcement | Official Docs
Generate training data using a teacher model. InstructLab’s SDG 1.5 adds a Critic model that revises and filters before finalizing the data.
More info: InstructLab Blog
Overfitting: Model memorizes training data — poor generalization.
Underfitting: Model too simple — can’t learn from training data either.
Epoch = one pass through the training data. 3–6 is common for fine-tuning. Example: InstructGPT was trained for 16 epochs.
Source: InstructGPT Training Details
Total training examples processed before updating the model. Includes batch per GPU and gradient accumulation.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.