Studyify
Search Index...

Technical Encyclopedia

A high-fidelity guide covering the full spectrum of Large Language Model theory, from foundational principles to production architecture.

Phase: Beginner

AI vs ML vs DL

Technical Specification

Understanding the hierarchy of Artificial Intelligence.

Conceptual Analogy

"AI is the planet, Machine Learning is a continent on that planet, and Deep Learning is a specific city on that continent."

Architecture Schema
graph TD A[Artificial Intelligence] --> B[Machine Learning] B --> C[Deep Learning] C --> D[LLMs / Transformers]

Implementation Details

Artificial Intelligence is the broad discipline of creating intelligent machines. Machine Learning is a subset where machines learn from data without explicit programming. Deep Learning is a subset of ML using multi-layered neural networks.

Supervised Learning

Technical Specification

Learning with a teacher (labeled data).

Conceptual Analogy

"Like a teacher showing flashcards: 'This is a Cat', 'This is a Dog'. Eventually, the student learns to identify them alone."

Architecture Schema
flowchart LR A[Input Image] --> B[Model] B --> C[Prediction: Cat] C -- Compare --> D[Label: Cat] D -- Correct! --> B

Implementation Details

A training paradigm where the model learns to map inputs (features) to outputs (labels) based on example input-output pairs.

Neural Network

Technical Specification

A computer system inspired by the human brain.

Conceptual Analogy

"A team of people solving a puzzle. Each person focuses on one small piece, passes their finding to the next person, until the full picture is revealed."

Architecture Schema
graph LR A[Input Layer] --> B[Hidden Layer 1] B --> C[Hidden Layer 2] C --> D[Output Layer]

Implementation Details

A series of algorithms that endeavor to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates.

Training

Technical Specification

Teaching the model by showing it examples and correcting mistakes.

Conceptual Analogy

"Like school. The model takes a test, gets a grade (Loss), and studies to do better next time (Gradient Descent)."

Implementation Details

The process of optimizing the model's parameters (weights and biases) to minimize the loss function on a dataset.

Epoch

Technical Specification

One complete pass through the entire training dataset.

Conceptual Analogy

"One full reading of the textbook."

Implementation Details

A hyperparameter defining the number of times the learning algorithm will work through the entire training dataset.

Inference

Technical Specification

Using the trained model to make predictions.

Conceptual Analogy

"Taking the final exam using what you learned."

Implementation Details

The stage where a trained model is deployed to generate predictions on new, unseen data.

Phase: Intermediate

Artificial Neuron

Technical Specification

The atomic unit of a neural network.

Conceptual Analogy

"A tiny decision switch. It takes signals in, weighs them, and if the signal is strong enough, it fires."

Architecture Schema
graph LR X1[Input 1] -- w1 --> S((Sum)) X2[Input 2] -- w2 --> S B[Bias] --> S S --> A[Activation] A --> Y[Output]

Implementation Details

Computes y = f(∑(w * x) + b). It performs a weighted sum of inputs, adds a bias, and passes it through an activation function.

Tokenization

Technical Specification

Breaking text into chunks (tokens) for the model.

Conceptual Analogy

"Chopping a sentence into Lego bricks. 'I love AI' -> ['I', 'love', 'AI']."

Architecture Schema
graph TD A["Raw Text: 'Learning AI'"] --> B[Tokenizer] B --> C["Tokens: [1024, 4522]"]

Implementation Details

The process of converting raw text into specific IDs from a fixed vocabulary. BPE (Byte Pair Encoding) is commonly used to balance word and character level splitting.

Embedding

Technical Specification

Converting tokens into meaningful number lists (vectors).

Conceptual Analogy

"GPS coordinates for words. 'King' and 'Queen' are close together on the map."

Architecture Schema
graph TD Token[Token: 'Cat'] --> Embed[Embedding Layer] Embed --> Vector["Vector: [0.1, 0.9, -0.4...]"]

Implementation Details

Dense vector representations where semantically similar words map to nearby points in high-dimensional space.

Gradient Descent

Technical Specification

The algorithm used to minimize errors.

Conceptual Analogy

"Descending a misty mountain. You feel the slope with your feet and execute a step downwards."

Architecture Schema
graph TD A[Calculate Loss] --> B[Compute Gradient] B --> C[Update Weights] C --> A

Implementation Details

An iterative optimization algorithm for finding a local minimum of a differentiable function.

Bias

Technical Specification

An adjustable threshold for a neuron.

Conceptual Analogy

"Like a personal preference. Even if the input arguments are weak, a high bias might make you say 'Yes' anyway."

Implementation Details

A learnable parameter added to the weighted sum, allowing the activation function to be shifted left or right.

Weight

Technical Specification

The strength of a connection between neurons.

Conceptual Analogy

"Like the volume knob on a radio channel. High weight means the signal is loud and important; low weight means it's ignored."

Implementation Details

A learnable parameter that scales the input signal. Training involves adjusting these weights to minimize error.

Activation Function

Technical Specification

Decides if a neuron should 'fire' or stay silent.

Conceptual Analogy

"The rule for the switch. 'If the total pressure is above 50, open the floodgate.'"

Implementation Details

A non-linear function (like ReLU or GELU) applied to the neuron's output, enabling the network to learn complex patterns.

Loss Function

Technical Specification

A score of how wrong the model is.

Conceptual Analogy

"The difference between your answer and the correct answer on a test. Lower score is better!"

Implementation Details

A mathematical function (e.g., Cross-Entropy) that quantifies the discrepancy between the predicted output and the actual target.

Backpropagation

Technical Specification

Calculating who is to blame for an error.

Conceptual Analogy

"If a company loses money, you trace back from the CEO -> Manager -> Worker to find where the mistake happened."

Implementation Details

The algorithm for computing the gradient of the loss function with respect to the weights using the chain rule.

Batch Size

Technical Specification

Number of examples the model sees before updating itself.

Conceptual Analogy

"Do you grade homework one by one, or collect 32 of them and grade them all at once?"

Implementation Details

The number of training samples to work through before the model's internal parameters are updated.

Top-K Sampling

Technical Specification

Limiting choices to the top K best options.

Conceptual Analogy

"Instead of picking from every word in the dictionary, only consider the top 5 most likely next words."

Implementation Details

A decoding strategy that filters the distribution to only the top K most probable next tokens.

Determinism

Technical Specification

Getting the exact same result every time.

Conceptual Analogy

"Calculating 2+2 (always 4) vs asking a friend for a story (always different)."

Implementation Details

A property where the model produces the same output for a given input, usually achieved by setting temperature to 0.

Parameters

Technical Specification

The total number of adjustable weights and biases in the model.

Conceptual Analogy

"The number of synapses in a brain. More parameters generally mean more knowledge and intelligence."

Implementation Details

The sum of all weights and biases in the neural network. Larger models generally have higher capacity but require more compute.

Temperature

Technical Specification

Controls the creativity/randomness of the model.

Conceptual Analogy

"Low temp = Robot, always predictable. High temp = Poet, creative but maybe crazy."

Implementation Details

A hyperparameter used to scale the logits before applying softmax. Higher values flatten the distribution (more random), lower values sharpen it (more deterministic).

Context Window

Technical Specification

How much text the model can remember at once.

Conceptual Analogy

"Short-term memory. If it's too full, the model forgets the beginning of the conversation."

Implementation Details

The maximum number of tokens the model can process in a single forward pass.

Phase: Expert

Transformer Architecture

Technical Specification

The architecture behind GPT, BERT, and modern LLMs.

Conceptual Analogy

"An assembly line that processes the entire sentence at once (Parallel) rather than word-by-word (Sequential/RNN)."

Architecture Schema
graph TD Input --> Embed Embed --> Enc[Encoder Blocks] Enc --> Dec[Decoder Blocks] Dec --> Output

Implementation Details

Introduced in 'Attention Is All You Need' (2017). Relies entirely on self-attention mechanisms to draw global dependencies between input and output.

Self-Attention Mechanism

Technical Specification

Allows the model to focus on relevant parts of the input.

Conceptual Analogy

"Reading a sentence and looking back at previous words to understand 'it' or 'they'."

Architecture Schema
graph TD X[Input] --> Q[Query] X --> K[Key] X --> V[Value] Q -- MatMul --> S[Scores] K -- MatMul --> S S -- Softmax --> W[Weights] W -- MatMul --> Out[Output] V -- MatMul --> Out

Implementation Details

Computes attention scores using Query (Q), Key (K), and Value (V) matrices: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V.

Query (Q)

Technical Specification

What a token is looking for.

Conceptual Analogy

"Like a Google search bar. The word 'Bank' sends out a query: 'Are we talking about money or rivers?'"

Implementation Details

A vector derived from the input embedding representing the current token's search intent.

Key (K)

Technical Specification

What a token contains/offers.

Conceptual Analogy

"Like the tag on a file folder. 'River' has a key saying 'I am about nature/water'."

Implementation Details

A vector derived from the input embedding that serves as a label or index for matching with Queries.

Value (V)

Technical Specification

The actual information passed along.

Conceptual Analogy

"The content inside the folder. If the Query matches the Key, you get the Value."

Implementation Details

A vector derived from the input embedding containing the actual information to be aggregated if the attention score is high.

Encoder-Decoder

Technical Specification

A common architecture for sequence-to-sequence tasks.

Conceptual Analogy

"Like a translator. The encoder understands the input language, and the decoder generates the output in another language."

Architecture Schema
graph TD Input --> Encoder Encoder --> Context Context --> Decoder Decoder --> Output

Implementation Details

An architecture where an encoder processes the input sequence into a fixed-size context vector, and a decoder generates an output sequence from that context vector. Used in machine translation and summarization.

Softmax Function

Technical Specification

Converts numbers into probabilities that sum to one.

Conceptual Analogy

"Like a popularity contest. It takes raw scores and turns them into percentages, showing how popular each option is."

Architecture Schema
graph TD Input[Logits: -1, 0, 3] --> Softmax Softmax --> Output[Probabilities: 0.04, 0.10, 0.86]

Implementation Details

A function that takes a vector of arbitrary real-valued scores and squashes them to a vector of probabilities, where the probabilities of each value are between 0 and 1 and sum to 1.

Visual Encoder (ViT)

Technical Specification

The part of the model that learns to 'see' images.

Conceptual Analogy

"The eyes and visual cortex of the AI. It turns pixels into understanding."

Implementation Details

Usually a Vision Transformer (ViT) that chops an image into patches and processes them similarly to text tokens to extract visual features.

Spectrogram

Technical Specification

A visual picture of sound frequencies.

Conceptual Analogy

"Sheet music for a computer. It shows low and high notes over time."

Implementation Details

A visual representation of the spectrum of frequencies of a signal as it varies with time.

Multimodal

Technical Specification

Capable of understanding Text, Audio, and Images.

Conceptual Analogy

"A genius who can read, listen, and see, instead of just reading."

Implementation Details

The ability of a single model to process and relate information from multiple modalities (text, vision, audio) in a shared embedding space.

Quantization

Technical Specification

Reducing the precision of numbers to make the model smaller and faster.

Conceptual Analogy

"Like lowering the resolution of an image. It looks almost the same but takes up way less space."

Implementation Details

The process of mapping continuous infinite values to a smaller set of discrete finite values, e.g., converting 32-bit floats to 4-bit integers.

End of Technical Documentation