AI &amp; LLM Glossary

A manifest describing an A2A agent's capabilities, endpoint, and authentication requirements.

AI Coding Assistant

A tool that helps write, review, or debug code using AI.

Alignment

The process of ensuring a model behaves in ways consistent with human values and intentions.

Annotation

The process of adding labels or metadata to training data.

API

Application Programming Interface, a set of rules for software components to communicate.

API Pricing

The cost structure for using a model or service via API, usually per input and output tokens.

Artificial General Intelligence

Hypothetical AI that can understand, learn, and perform any intellectual task a human can do across any domain.

Artificial Intelligence

The broad field of creating machines that can perform tasks requiring human-like intelligence, such as reasoning, learning, and perception.

Assistant

An AI application that helps users complete tasks through conversation.

Attention

A mechanism that lets a model focus on relevant parts of input when producing each output token.

Autonomy

The degree to which an agent can operate without human intervention.

Autoregressive

Generating output one token at a time, using previously generated tokens as context for the next.

B

Backpropagation

The algorithm used to train neural networks by propagating error gradients backward and updating weights.

Batch

A subset of training data processed together in one forward and backward pass.

Batch Processing

Sending groups of requests together, often at a lower price but with higher latency.

Batch Size

The number of training examples used in a single update step.

Benchmark

A standardized dataset and task used to compare models.

Bias

A learnable value added to a neuron's input before activation, helping the model fit data better.

BLEU

A metric that evaluates the quality of generated text by comparing it to reference text.

BM25

A ranking function used in keyword search to estimate document relevance.

BPE

Byte Pair Encoding, a subword tokenization algorithm used by many language models.

C

Caching

Storing and reusing previous results to reduce latency and cost.

CDN

Content Delivery Network, a geographically distributed network that speeds up content delivery.

Chain-of-Thought

Prompting a model to show its reasoning step by step before giving a final answer.

Chatbot

A conversational interface that uses AI to interact with users.

Chunking

Splitting documents into smaller pieces before embedding and storing them.

CI/CD

Continuous Integration and Continuous Deployment, automated pipelines for building and releasing software.

Class Imbalance

When some classes or outcomes are far more common than others in training data.

Classification

Assigning input data to predefined categories.

Claude Code

An agentic terminal coding tool powered by Claude.

Clustering

Grouping data points into clusters based on similarity.

Code Execution

Running code generated by a model, usually in a controlled environment.

Cold Start

Difficulty in making predictions for new users or items with little historical data.

Collaborative Filtering

Making recommendations based on patterns across many users.

Completion

The text generated by a model in response to a prompt.

Constitutional AI

A training approach where models critique and revise their own outputs according to a set of principles.

Content Filter

A system that blocks or flags disallowed content.

Content-Based Filtering

Making recommendations based on attributes of items a user has liked.

Context Window

The context window is the maximum number of tokens a model can consider in a single forward pass. It includes the system prompt, user messages, retrieved documents, and the model's own generated output. If the total exceeds the window, the oldest tokens are dropped or the request fails. Context windows vary widely. Small models may handle 4,000 tokens, while frontier models can process 128,000, 1,000,000, or even 10,000,000 tokens. Long context is useful for summarizing books, analyzing large codebases, and holding extended conversations without losing earlier details. A larger window does not always mean better results. Very long inputs can dilute attention, making the model miss important details. Techniques like RAG, selective summarization, and hierarchical chunking help fit the most relevant information into the window without exceeding the limit.

Continual Pre-training

Further pre-training a model on additional domain-specific data before fine-tuning.

Conversational AI

AI systems designed for natural language dialogue.

Copilot

An AI assistant embedded in a workflow to augment human work.

Cost Per Million Tokens

A common pricing unit for API-based language models.

Curriculum Learning

Training a model on easier examples first and gradually increasing difficulty.

Cursor

An AI-native code editor built on VS Code with strong agentic features.

D

Data Augmentation

Creating additional training examples from existing data to improve model robustness.

Data Leakage

When information from outside the training set inappropriately influences model training.

Debugging

The process of finding and fixing errors in software.

Decoder

A model component that generates output sequences from an encoded representation.

Deep Learning

A branch of machine learning based on multi-layer neural networks that can learn complex patterns from large amounts of data.

Dense Model

A model where all parameters are active during every forward pass.

DevOps

Practices that combine software development and IT operations to shorten delivery cycles.

Diffusion Model

A generative model that learns to reverse a noise-adding process to create images, audio, or video.

Domain Adaptation

Adapting a model to perform better on data from a specific domain or industry.

Dropout

A regularization technique that randomly disables neurons during training to prevent co-adaptation.

E

Edge Deployment

Running models close to end users to reduce latency.

Embedding

An embedding is a list of numbers, usually called a vector, that represents the meaning of a piece of data. Semantically similar items end up close together in this numeric space, which lets a computer compare meaning using distance rather than exact keyword matches. Embedding models are trained to produce these vectors. For text, the model reads a sentence or document and outputs a dense vector, often with hundreds or thousands of dimensions. You can then measure similarity with cosine similarity or Euclidean distance. Two sentences about payment processing will have embeddings closer to each other than a sentence about baseball, even if they share no common words. Embeddings power search, recommendations, clustering, and RAG. A typical RAG system stores document chunks as vectors in a vector database and retrieves the nearest neighbors to a user's query embedding. Embeddings can also represent images, audio, and other modalities when the model is trained on multimodal data.

Embedding Model

A model that converts data into dense numerical vectors that capture semantic meaning.

Encoder

A model component that processes input into a dense internal representation.

Encoder-Decoder

An architecture that first encodes input into a representation and then decodes it into output.

Entity

A distinct object or concept represented in a knowledge graph, such as a person, place, or product.

Epoch

One complete pass through the entire training dataset during model training.

Evaluation

The process of measuring a model's performance on tasks or benchmarks.

Explainability

The degree to which a model's decisions can be understood by humans.

Extractive Summarization

Creating a summary by selecting existing sentences or phrases from the source.

F

F1 Score

The harmonic mean of precision and recall.

Factuality

The degree to which generated content is factually correct.

Fairness

The property of a model treating different groups equitably.

Feature

An individual measurable property or characteristic of data used by a model.

Feature Engineering

Transforming raw data into features that improve model performance.

Few-Shot Learning

Learning a task from only a few examples, often by including them in the prompt.

Few-Shot Prompting

Including examples of desired input-output pairs in the prompt to guide the model.

Fine-tuning

Fine-tuning is the process of further training a pre-trained model on a smaller, task-specific dataset so it becomes better at a particular job. The base model already knows grammar, facts, and reasoning from pre-training; fine-tuning teaches it the style, format, or domain you care about. Common reasons to fine-tune include matching a brand voice, classifying support tickets, extracting structured fields from documents, and improving performance on low-resource languages. You typically need hundreds to thousands of high-quality examples. Each example pairs an input with the desired output, and the model's weights are updated to reduce the error on those examples. Fine-tuning is not always the right first step. Prompt engineering, retrieval augmentation, and few-shot examples are faster and cheaper to iterate. Fine-tuning becomes worthwhile when the behavior you want is hard to describe in a prompt, must be consistent at scale, or needs to run without sending long examples every request. Techniques like LoRA and QLoRA make fine-tuning feasible on consumer hardware by updating only a small subset of weights.

First-Token Latency

The time until the first token of a response is received.

Foundation Model

A large model trained on broad data that can be adapted to many downstream tasks.

Function Calling

A model capability to generate calls to external functions with structured arguments.

G

Generalization

A model's ability to perform well on new, unseen data rather than only memorizing training examples.

Generative AI

AI systems that create new content such as text, images, audio, or code based on learned patterns.

GGUF

A binary format for storing quantized models for efficient local inference.

GitHub Copilot

An AI pair programmer from GitHub that provides code suggestions across editors.

Gradient Descent

An optimization algorithm that iteratively adjusts parameters to minimize a model's loss function.

GraphQL

A query language for APIs that allows clients to request exactly the data they need.

Ground Truth

The accurate reference answer used to evaluate model predictions.

gRPC

A high-performance RPC framework that uses protocol buffers for service definitions.

Guardrails

Controls that constrain model behavior to stay within acceptable boundaries.

H

Hallucination

When a model generates plausible-sounding but false or unsupported information.

Human-in-the-Loop

A design where humans review or approve agent actions at key decision points.

Hybrid Search

Combining vector similarity with keyword or structured filtering for retrieval.

Hyperparameter

A configuration value set before training begins, such as learning rate or batch size.

I

IDE

Integrated Development Environment, a software application that provides tools for coding.

In-Context Learning

A model's ability to learn a task from examples embedded directly in the prompt.

Inference

The process of running a trained model on new input data to produce an output or prediction.

Ingestion Pipeline

The process of loading, chunking, embedding, and storing documents for retrieval.

Input Token

A token counted from the prompt sent to a model.

Instruction Tuning

Fine-tuning a model on instruction-following examples to improve its ability to respond to user requests.

Interpretability

The study of understanding how models represent and process information internally.

J

Jailbreak

A prompt crafted to bypass a model's safety guidelines or restrictions.

JSON Mode

A model configuration that constrains output to valid JSON.

JWT

JSON Web Token, a compact format for securely transmitting claims between parties.

K

Keyword Search

Retrieving documents based on exact or approximate word matches.

Knowledge Base

A structured repository of information an AI system can query or retrieve from.

Knowledge Graph

A network of entities and relationships used to represent structured knowledge.

KV Cache

A cache of key and value tensors used to speed up autoregressive generation by avoiding redundant computation.

L

Label

The correct output associated with a training example.

Large Language Model

A neural network trained on vast text data to understand and generate human language.

Latency

The delay between a request and the start of a response.

Learning Rate

A hyperparameter that controls how much model weights are updated during each training step.

llama.cpp

A C++ implementation for running Llama models efficiently on consumer hardware.

Load Balancing

Distributing requests across multiple servers to improve reliability and performance.

Local Model

A model that runs on local hardware without requiring cloud API calls.

Logging

Recording events and messages from software for analysis.

Long Context

The ability of a model to process very large context windows, often hundreds of thousands of tokens.

LoRA

Low-Rank Adaptation, a parameter-efficient fine-tuning method that updates small adapter matrices instead of all weights.

Loss Function

A function that measures how far a model's predictions are from the correct answers during training.

M

Machine Learning

A subset of AI where systems improve at tasks through experience and data without being explicitly programmed.

Matrix Factorization

A technique that decomposes user-item interaction matrices into latent factors.

Max Tokens

The maximum number of tokens a model is allowed to generate in a response.

MCP

MCP stands for Model Context Protocol. It is an open standard that lets AI clients connect to external tools, data sources, and prompts through a single, consistent interface. Anthropic introduced MCP in late 2024, and it has since been adopted by Claude Desktop, Cursor, Cline, VS Code, Windsurf, and a growing list of community clients. An MCP server is a small program that exposes three things: tools the model can call, resources the client can read, and prompts that help users accomplish common tasks. An MCP client discovers those capabilities and decides when to invoke them. Transport is usually stdio for local servers or Server-Sent Events for remote ones. For developers, MCP removes the need to build a custom integration for every API. You write one server, and any compatible client can use it. For users, it means AI assistants can securely access files, databases, SaaS tools, and web services without each client reinventing the wheel.

MCP Client

An application that connects to MCP servers and uses their capabilities.

MCP Server

A program that exposes tools, resources, and prompts via the Model Context Protocol.

Mechanistic Interpretability

A research area that reverse-engineers neural networks to understand their internal circuits.

Metric

A quantitative measure of model performance, such as accuracy, F1, or BLEU.

Mixture of Experts

An architecture where only a subset of specialized sub-networks is activated per input, improving efficiency.

MLOps

Practices for deploying and maintaining machine learning models in production.

Model Context Protocol

An open standard that lets AI assistants connect to external data sources and tools through a common interface.

Moderation

Filtering or flagging content that violates safety policies.

Monitoring

Tracking system health, performance, and behavior over time.

Multi-Agent System

A system where multiple agents collaborate, compete, or delegate tasks to achieve complex goals.

Multi-Head Attention

Running multiple attention mechanisms in parallel to capture different kinds of relationships between tokens.

Multimodal Model

A model that can process and generate multiple types of input, such as text, images, and audio.

N

Named Entity Recognition

Identifying and classifying named entities such as people, organizations, and locations in text.

Natural Language Processing

The field focused on enabling computers to understand, interpret, and generate human language.

Nearest Neighbor

Finding the closest data points to a query in a vector space.

Neural Network

A computational model inspired by biological neurons, organized in layers that process input data to produce outputs.

Normalization

Scaling data to a standard range or distribution.

O

OAuth

An authorization framework for delegated access to resources.

Observability

The ability to understand internal system state from external outputs.

Ollama

A tool for running open-source models locally with simple commands.

On-Device

Running a model locally on a user's device rather than on a remote server.

One-Shot

Performing a task after seeing a single example.

ONNX

An open format for representing machine learning models, enabling cross-framework deployment.

OpenAPI

A specification format for describing HTTP APIs.

Orchestration

Coordinating multiple tools, agents, or services to complete a workflow.

Output Parsing

Extracting structured data from model outputs, often using schemas or regular expressions.

Output Token

A token generated by the model in its response.

Overfitting

When a model learns training data too closely and performs poorly on unseen data.

Overlap

Shared text between adjacent chunks to preserve context across chunk boundaries.

P

Parameter

A configurable internal variable in a neural network that is learned during training and determines model behavior.

Parameter-Efficient Fine-Tuning

Methods that adapt a pre-trained model to new tasks while updating only a small fraction of parameters.

Perplexity

A measure of how well a probability model predicts a sample; lower is better.

Persona

A defined identity or character assigned to a model in a prompt.

Personalization

Tailoring outputs or recommendations to individual users.

Pipeline

A linear sequence of data processing or model steps.

Planning

The process of deciding which actions to take and in what order to achieve a goal.

Plugin

An add-on module that extends a system's capabilities.

Pre-training

Training a model on a large corpus to learn general language patterns before task-specific adaptation.

Precision

The proportion of predicted positives that are actually correct.

Preference Model

A model that learns to rank outputs based on which ones humans prefer.

Preprocessing

Cleaning and transforming raw data before it is fed to a model.

Prompt

The input text given to a language model to elicit a desired response.

Prompt Chaining

Breaking a complex task into a sequence of prompts where each step uses the previous output.

Prompt Engineering

Prompt engineering is the practice of crafting inputs to a language model so it produces better outputs without changing the model's weights. It covers word choice, structure, examples, constraints, and the order in which information appears. A well-engineered prompt can turn a mediocre response into a precise, actionable one. Effective prompts are usually clear, specific, and formatted. They state the task, define the audience, set the output format, and include any constraints. Adding examples, known as few-shot prompting, helps the model understand patterns that are hard to describe in words. Breaking complex tasks into steps, called chain-of-thought prompting, improves reasoning and arithmetic. Prompt engineering is iterative. You write a prompt, test it on diverse inputs, measure the results, and refine. Tools like the VePrompts Prompt Optimizer can surface issues such as ambiguity, missing constraints, or conflicting instructions. Good prompt engineering is often the fastest way to improve an AI feature before investing in fine-tuning or custom infrastructure.

Prompt Injection

An attack where malicious input overrides or leaks system instructions.

Prompt Template

A reusable prompt pattern provided by an MCP server for common tasks.

Protocol Buffers

A language-neutral binary serialization format developed by Google.

Q

Quality Assurance

Processes designed to ensure products meet quality standards.

Quantization

Reducing the precision of model weights to decrease memory usage and increase inference speed.

Query Rewriting

Transforming a user query to improve retrieval, such as expanding acronyms or adding synonyms.

Quota

A maximum allowance of usage for an account or key.

R

RAG

RAG stands for Retrieval-Augmented Generation. It is a pattern that gives a language model access to information outside its training data by fetching relevant documents at query time and including them in the prompt. Instead of memorizing facts, the model reasons over retrieved snippets, which makes answers more accurate, current, and traceable. A typical RAG pipeline has four stages. First, documents are split into chunks and converted into embeddings using an embedding model. Second, those embeddings are stored in a vector database. Third, when a user asks a question, the system embeds the query and searches the database for the closest chunks. Finally, the retrieved chunks are added to the prompt as context, and the model generates an answer grounded in that evidence. RAG is especially useful when answers depend on private data, such as internal wikis, support tickets, or product documentation. It also reduces hallucination because the model can cite the retrieved text. Teams often tune RAG by changing chunk size, overlap, reranking algorithms, and query rewriting strategies.

Rate Limit

A cap on the number of requests or tokens allowed in a time window.

Re-ranking

A second-stage model that scores and reorders retrieved documents for better relevance.

ReAct

A pattern where an agent Reasons and Acts in alternating steps to solve tasks.

Reasoning Model

A model optimized for step-by-step logical reasoning and complex problem solving.

Recall

The proportion of actual positives that were correctly identified.

Recommendation

Suggesting items or actions to users based on data.

Red Teaming

Attempting to find vulnerabilities, biases, or harmful behaviors in a model.

Reflection

An agent evaluating its own output and revising it based on critique.

Regularization

Techniques used to reduce overfitting by discouraging overly complex models.

Reinforcement Learning

Learning by interacting with an environment and receiving rewards or penalties.

Reinforcement Learning from Human Feedback

Training a model using human preference signals to make outputs more helpful and harmless.

Relationship

A connection between two entities in a knowledge graph.

Resource

Read-only data exposed by an MCP server that a client can pull into context.

REST

Representational State Transfer, an architectural style for designing networked APIs.

Retrieval-Augmented Generation

Generating responses grounded in retrieved external documents to improve accuracy and recency.

Reward Model

A model trained to score outputs according to human preferences, used in RLHF.

RLHF

Short for Reinforcement Learning from Human Feedback.

Robustness

A model's ability to maintain performance under noisy or adversarial inputs.

Role Prompting

Asking the model to assume a specific role or persona to shape its responses.

ROUGE

A set of metrics for evaluating automatic summarization by comparing overlap with reference summaries.

S

Safety

Practices that reduce harmful, unethical, or dangerous model outputs.

Sandbox

An isolated execution environment that limits what code can access.

Scaling

Adjusting compute resources to handle varying workloads.

Self-Attention

Attention applied within a single sequence, allowing each token to relate to every other token.

Self-Correction

An agent identifying and fixing its own mistakes.

Semantic Search

Searching by meaning rather than exact keyword matches, often using embeddings.

Sentiment Analysis

Determining the emotional tone or opinion expressed in text.

Sequence-to-Sequence

A model architecture that maps an input sequence to an output sequence.

Similarity Search

Finding items with embeddings close to a query embedding, usually by cosine or Euclidean distance.

Skill

A specific capability advertised by an agent or service.

Sparse Model

A model where most parameters are inactive for any given input, reducing compute per forward pass.

SSE

Server-Sent Events, an HTTP-based transport for streaming messages from server to client.

Standardization

Transforming data to have zero mean and unit variance.

STDIO

A transport that uses standard input and output for local MCP server communication.

Stop Sequence

A string that signals the model to stop generating further tokens.

Structured Output

Requiring the model to produce output conforming to a defined schema.

Summarization

Producing a shorter version of a longer text while preserving key information.

Supervised Learning

Training a model on labeled input-output pairs.

Synthetic Data

Data generated by models or simulations rather than collected from real-world sources.

System Prompt

A system prompt is the high-level instruction that sets the model's role, tone, constraints, and behavior for a conversation. It is sent once at the start of the context and influences every response that follows. While users see the assistant's reply, they usually do not see the system prompt unless the application exposes it. A good system prompt is specific and scoped. Instead of saying you are helpful, it might say you are a senior React reviewer who gives concise feedback in bullet points, flags security issues, and never writes full code replacements. This reduces ambiguity and makes the model's output more consistent across sessions. System prompts are also the first line of defense for safety and product requirements. You can use them to enforce output formats, reject off-topic requests, require citations, or ask the model to disclose uncertainty. Because they carry so much influence, small changes to a system prompt often produce larger improvements than adding more examples to user messages.

T

Temperature

Temperature is a sampling parameter that controls how random a language model's outputs are. It scales the logits, or raw scores, that the model assigns to each possible next token before a token is chosen. A lower temperature makes the model more conservative and deterministic; a higher temperature makes it more creative and varied. At temperature zero, the model almost always picks the highest-scoring token, which is ideal for tasks like code generation, factual answers, and structured output where consistency matters. At temperature one or above, the model is more willing to sample lower-scoring tokens, which can produce surprising phrasing, creative writing, and diverse brainstorming ideas. There is no universal best setting. Coding and data extraction usually benefit from low temperatures around 0.1 to 0.3. Marketing copy, fiction, and idea generation often feel better at 0.7 to 1.0. If outputs are too repetitive, raise the temperature. If they become erratic or off-topic, lower it.

Test-Time Compute

Spending more computation during inference to improve output quality, such as through reasoning or search.

Testing

Evaluating software to ensure it behaves as expected.

Throttling

Slowing or limiting requests to enforce rate limits.

Throughput

The number of tokens or requests processed per unit of time.

Token