Appendix B - UOp Summary

UOps form the intermediate representation (IR) in tinygrad’s computation graph after the initial Tensor operations are defined and before final code generation. They represent a lower-level, more explicit graph that undergoes several optimization and transformation passes.

Meta / Framework Ops

These UOps are primarily used by the tinygrad framework itself for graph structure, scheduling, device management, and metadata, rather than direct computation on tensor data.

`SINK`

Description: Represents the final output(s) of a computation graph or kernel. It acts as a root node for graph traversal and scheduling.
Purpose: Marks the end of a computation path that needs to be realized. Used to define the boundaries of kernels during scheduling.
Args: Optional KernelInfo containing metadata about the kernel (name, local dims, etc.).
Sources: One or more UOps that produce the final results (often STORE or ASSIGN ops).
Stage: Graph Construction, Scheduling (marks kernel boundaries).
Notes: Essential for defining what needs to be computed. Multiple UOps can feed into a single SINK. Simplified away before final rendering.

`KERNEL`

Description: An internal UOp used during scheduling to encapsulate the operations belonging to a single kernel launch.
Purpose: Groups UOps together that will be executed as one unit on the target device. Helps manage kernel boundaries and dependencies.
Args: Kernel object containing the kernel’s AST and metadata.
Sources: UOps representing the inputs required by the kernel (often BUFFER or other KERNEL outputs via ASSIGN).
Stage: Scheduling.
Notes: This is a temporary node used by the scheduler and is expanded/replaced before final code generation.

`NAME`

Description: Assigns a name (string) to a UOp, typically used for naming kernels or variables in the generated code.
Purpose: Code readability and debugging. Allows generated kernels/variables to have meaningful names based on the high-level operations.
Args: str - the desired name.
Sources: Usually none, but can wrap other ops during specific transformations.
Stage: Scheduling, Code Generation.
Notes: Often associated with SINK or DEFINE_VAR.

`DEVICE`

Description: Represents a target device (e.g., “CPU”, “CUDA:0”).
Purpose: Specifies the device for memory allocation or computation. Used as a source for BUFFER, COPY, CONST.
Args: str - the device name.
Sources: None.
Stage: Graph Construction, Scheduling.

`MULTI`

Description: Represents a tensor sharded across multiple devices.
Purpose: Manages data parallelism by holding references to UOps representing shards on different devices.
Args: (Optional[int], tuple[bool, ...]) - The sharding axis (or None) and a tuple indicating which shards hold real data (vs padding/zeros).
Sources: Multiple UOps, each representing a shard on a specific device.
Stage: Graph Construction (created by Tensor.shard).
Notes: Handled by a specific rewrite pass (get_multi_map) to distribute operations across shards.

`COPY`

Description: Copies data from one buffer/device to another.
Purpose: Explicit data movement between devices or cloning a buffer.
Args: bool - clone=True forces a new allocation even on the same device.
Sources: (DEVICE, UOp) - Target device and the UOp to copy from.
Stage: Graph Construction, Scheduling.
Notes: Simplified away if source and destination devices are the same and clone=False. Can become BufferXfer for optimized transfers.

`ASSIGN`

Description: Represents an assignment operation. At the Tensor level, it signifies tensor.assign(other_tensor). At the UOp level after lowering, it often represents updating an accumulator (DEFINE_ACC).
Purpose: In-place modification of data (conceptually) or accumulator updates.
Args: None.
Sources: (target, value) - The UOp being assigned to (often BUFFER or DEFINE_ACC) and the new value UOp.
Stage: Graph Construction, Lowering (Accumulator updates).
Notes: High-level ASSIGN is often lowered to STORE or kernel operations. Low-level ASSIGN is crucial for reductions.

`BIND`

Description: Binds a symbolic Variable (DEFINE_VAR) to a concrete integer value (CONST).
Purpose: Resolves symbolic dimensions or parameters at runtime or during JIT compilation.
Args: None.
Sources: (DEFINE_VAR, CONST) - The variable and the constant value it’s bound to.
Stage: Graph Construction (Symbolic Shapes), JIT.
Notes: Removed during scheduling/lowering by substituting the variable with its bound constant value.

`DEFINE_VAR`

Description: Defines a symbolic variable, typically representing a dimension size.
Purpose: Allows for operations on tensors with shapes that are not known at compile time (symbolic shapes).
Args: (name: str, min_val: int, max_val: int) - Name and range of the variable.
Sources: None.
Stage: Graph Construction (Symbolic Shapes).
Notes: Usually bound using BIND before execution.

`UNIQUE`

Description: Represents a unique identifier, typically used for buffer allocation.
Purpose: Ensures that different BUFFER UOps representing distinct allocations get unique identities, even if they have the same size/dtype/device.
Args: int - A unique integer.
Sources: None.
Stage: Graph Construction (Buffer creation).

`EMPTY`

Description: Represents an empty tensor (placeholder).
Purpose: Used internally, often as a starting point for tensor creation before data is specified (e.g., Tensor.empty).
Args: None.
Sources: None.
Stage: Graph Construction.
Notes: Usually replaced or filled quickly.

`NOOP`

Description: A no-operation instruction.
Purpose: Used as a placeholder or identity operation during graph transformations, often inserted and then removed by subsequent passes. Can sometimes force materialization in specific backends (e.g., before BITCAST).
Args: None.
Sources: (UOp,) - The UOp to pass through.
Stage: Intermediate Transformations.

`CUSTOM` / `CUSTOMI`

Description: Represents a custom operation defined by a string, potentially with specific backend implementations. CUSTOMI implies inline code.
Purpose: Allows extending tinygrad with operations not covered by standard Ops, often for backend-specific intrinsics or complex fused operations.
Args: str - A format string for the custom code.
Sources: Variable number of UOps, used to fill the format string.
Stage: Lowering, Code Generation.

Constants

`CONST`

Description: Represents a scalar constant value.
Purpose: Embeds constant values directly into the computation graph.
Args: ConstType (int, float, bool) - The constant value.
Sources: Usually none, but can have a VIEW(DEVICE) source to indicate device placement.
Stage: All stages.
Notes: Tensor.full creates CONST ops expanded to the correct shape.

`VCONST`

Description: Represents a vector constant value.
Purpose: Similar to CONST but for vector types.
Args: tuple[ConstType, ...] - The tuple of constant values.
Sources: Usually none.
Stage: All stages.
Notes: Often lowered to VECTORIZE(CONST, CONST, ...) during rendering.

Movement Ops

These ops change the logical view (shape, strides, offset, mask) of data without necessarily moving or copying it in memory immediately. They primarily manipulate the ShapeTracker associated with a buffer.

`RESHAPE`

Description: Changes the shape of the tensor while preserving the total number of elements.
Purpose: Modifies the logical dimensions.
Args: tuple[sint, ...] - The new shape.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into VIEW ops.

`PERMUTE`

Description: Reorders the dimensions of the tensor.
Purpose: Changes the logical order of axes.
Args: tuple[int, ...] - The permutation order.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into VIEW ops.

`EXPAND`

Description: Expands dimensions of size 1 to a larger size.
Purpose: Broadcasting.
Args: tuple[sint, ...] - The target shape with expanded dimensions.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into VIEW ops. Stride becomes 0 for expanded dimensions.

`PAD`

Description: Adds padding to the tensor along specified dimensions.
Purpose: Increases the size of dimensions, typically for convolutions or alignment.
Args: tuple[tuple[sint, sint], ...] - Padding amounts (before, after) for each dimension.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into VIEW ops. Adjusts offset and mask.

`SHRINK`

Description: Shrinks the tensor along specified dimensions by selecting a sub-region.
Purpose: Cropping or selecting parts of a tensor.
Args: tuple[tuple[sint, sint], ...] - Start and end indices (exclusive) for shrinking each dimension.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into VIEW ops. Adjusts offset and mask.

`FLIP`

Description: Reverses the order of elements along specified dimensions.
Purpose: Data augmentation or specific algorithms requiring reversed views.
Args: tuple[bool, ...] - A boolean tuple indicating which axes to flip.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into VIEW ops. Modifies strides and offset.

Lowering / Indexing Ops

These UOps appear during the lowering process, translating logical views and operations into memory accesses and validity checks.

`VIEW`

Description: Represents a logical view (ShapeTracker) applied to a base buffer UOp. This is the primary way movement ops are represented after initial lowering.
Purpose: Encapsulates shape, stride, offset, and mask information without creating new data. Connects logical tensor operations to underlying buffer representations.
Args: ShapeTracker - The view information.
Sources: (UOp,) - The base UOp (often BUFFER, CONST, or another VIEW). Can also have a DEVICE source for CONST.
Stage: Lowering, Scheduling.
Notes: Multiple VIEW ops are merged into one. CONTIGUOUS ops often trigger realization before a VIEW. VIEW ops are pushed towards memory operations (LOAD/STORE) or constants during simplification.

`INDEX`

Description: Calculates a memory address/index based on a buffer and logical indices, potentially applying a validity mask.
Purpose: Translates multi-dimensional logical indexing into a linear memory index for LOAD and STORE. Encapsulates the ShapeTracker.to_indexed_uops logic.
Args: None.
Sources: (buffer_uop, logical_indices_uop, Optional[valid_uop]) - The buffer (e.g., DEFINE_GLOBAL), the calculated index expression, and an optional validity mask (dtypes.bool).
Stage: Lowering (post rewrite_shapetracker_with_index), Codegen.
Notes: This is part of the “new style” load/store introduced to simplify rendering.

`VALID`

Description: Represents the validity mask derived from a ShapeTracker’s mask attribute.
Purpose: Computes whether a given logical index corresponds to a valid element within the original (unpadded, unshrunk) data. Used for masking operations, especially loads/stores near boundaries.
Args: None.
Sources: (VIEW,) - The VIEW UOp containing the ShapeTracker with mask information.
Stage: Lowering.
Notes: Often simplified or combined with index calculations. Becomes the valid part of INDEX or the gate in LOAD/STORE.

`GEP` (Get Element Pointer)

Description: Extracts specific elements from a vector UOp.
Purpose: Accessing individual lanes of a vectorized operation or constant.
Args: tuple[int, ...] - The indices of the elements to extract.
Sources: (UOp,) - The vector UOp.
Stage: Codegen, Final Rendering.
Notes: The inverse of VECTORIZE. Allows scalar operations on elements previously combined into a vector.

Memory Ops

These UOps deal directly with memory allocation, definition, and access.

`BUFFER`

Description: Represents a raw memory buffer allocated on a specific device. This is the “base” UOp for most tensor data after initial allocation.
Purpose: Holds the reference to the actual allocated memory used by tensors.
Args: int - Size of the buffer in elements.
Sources: (DEVICE, UNIQUE) - The device and a unique identifier.
Stage: Graph Construction, Scheduling.
Notes: BUFFER UOps map to Buffer objects which manage the actual memory. BUFFER itself doesn’t have a ShapeTracker; VIEW ops are applied on top.

`BUFFER_VIEW`

Description: Represents a view of an existing BUFFER UOp, potentially with an offset. Introduced for DISK buffers.
Purpose: Allows accessing parts of a larger buffer (e.g., a file on disk) without loading the entire thing.
Args: (size: int, offset: int) - Size in elements and offset in elements from the base buffer.
Sources: (BUFFER,) - The base buffer.
Stage: Scheduling (Specific backends like DISK).

`DEFINE_GLOBAL`

Description: Defines a global buffer in the kernel arguments.
Purpose: Declares input/output buffers passed into the kernel.
Args: int (optional, buffer index) or None.
Sources: None.
Stage: Final Lowering (inside linearize_uop), Codegen.
Notes: Has a PtrDType. The renderer uses this to generate kernel signatures.

`DEFINE_LOCAL`

Description: Defines a buffer in local (shared) memory.
Purpose: Allocation of shared memory for intermediate results accessible by threads within a workgroup.
Args: str - Name for the local buffer.
Sources: None.
Stage: Lowering, Codegen.
Notes: Has a PtrDType with local=True. Requires synchronization (BARRIER).

`DEFINE_ACC`

Description: Defines an accumulator register or variable, typically initialized to an identity element and used within reduction loops.
Purpose: Holds the intermediate state during reduction operations.
Args: (int,) - An accumulator index/identifier.
Sources: (initial_value, *reduce_ranges) - The identity element (CONST) and the RANGE UOps defining the reduction loops.
Stage: Lowering (created from REDUCE_AXIS).
Notes: Lowered REDUCE_AXIS becomes DEFINE_ACC -> ALU -> ASSIGN(acc).

`LOAD`

Description: Loads data from memory (global or local).
Purpose: Reading data from a buffer into registers/variables for computation.
Args: Optional load configuration (e.g., cache hints, not currently used extensively).
Sources:
- Old style (pre-linearize): (BUFFER, VIEW, Optional[STORE]) - Buffer, ShapeTracker view, optional dependency.
- New style (post-linearize): (INDEX, Optional[alt_value], Optional[gate], Optional[BARRIER]) - Indexed address, value if gate is false, gate condition, barrier dependency.
Stage: Lowering, Codegen.
Notes: Validity checks (from ShapeTracker masks) are incorporated into the INDEX or gate.

`STORE`

Description: Stores data into memory (global or local).
Purpose: Writing computation results back to a buffer.
Args: None.
Sources:
- Old style (pre-linearize): (BUFFER, VIEW, value) - Buffer, ShapeTracker view, value to store.
- New style (post-linearize): (INDEX, value, Optional[gate]) - Indexed address, value to store, optional gate condition.
Stage: Lowering, Codegen.
Notes: Often the final operation(s) feeding into a SINK.

Core Compute / ALU Ops

These perform basic element-wise arithmetic, logical, comparison, and transcendental operations. They generally expect sources to have the same shape and dtype (except for comparisons and specific cases like WHERE).

Unary (`EXP2`, `LOG2`, `SIN`, `SQRT`, `RECIP`, `NEG`)

Description: Apply standard unary mathematical functions.
Purpose: Element-wise computation.
Args: None.
Sources: (UOp,) - The input UOp.
Stage: All stages.
Notes: NEG is often represented as x * -1. Transcendental ops (EXP2, LOG2, SIN) might be rewritten to use approximations or backend-specific implementations.

Binary (`ADD`, `MUL`, `IDIV`, `MAX`, `MOD`, `CMPLT`, `CMPNE`, `XOR`, `SHL`, `SHR`, `OR`, `AND`, `SUB`, `FDIV`, `POW`)

Description: Apply standard binary mathematical or logical functions.
Purpose: Element-wise computation between two operands.
Args: None.
Sources: (UOp, UOp) - The two input UOps.
Stage: All stages.
Notes:
- Commutative ops (ADD, MUL, MAX, CMPNE, XOR, AND, OR) might have sources swapped during optimization.
- IDIV is integer division (truncates towards zero). FDIV (internal) represents float division (x / y), often lowered to x * RECIP(y).
- CMPLT/CMPNE output bool dtype.
- SUB is often represented as x + (-y).

Ternary (`WHERE`, `MULACC`)

Description: Apply standard ternary functions.
Purpose: Element-wise computation involving three operands.
Args: None.
Sources:
- WHERE: (condition, true_value, false_value) - Condition must be bool.
- MULACC: (a, b, c) - Computes a * b + c.
Stage: All stages.
Notes: MULACC (Multiply-Accumulate) can often map efficiently to hardware FMA (Fused Multiply-Add) instructions.

`CAST`

Description: Changes the data type of the elements.
Purpose: Type conversion (e.g., float to int, int to float, float16 to float32).
Args: None.
Sources: (UOp,) - The input UOp.
Stage: All stages.
Notes: Behavior depends on the source and destination types (truncation, rounding, etc.).

`BITCAST`

Description: Reinterprets the bits of the elements as a different data type of the same size.
Purpose: Low-level manipulation, often used for specific algorithms or interacting with hardware types (e.g., float <-> int).
Args: None.
Sources: (UOp,) - The input UOp.
Stage: All stages.
Notes: Does not change the underlying bit pattern, only the type interpretation. Requires source and destination dtypes to have the same itemsize.

Reduce Ops

`REDUCE_AXIS`

Description: Performs a reduction operation (like sum, max) along specified axes.
Purpose: Aggregates data across dimensions.
Args: (Ops, tuple[int, ...]) - The reduction operation (e.g., Ops.ADD, Ops.MAX) and the axes to reduce.
Sources: (UOp,) - The input UOp.
Stage: High-level, Lowering.
Notes: Lowered into a combination of DEFINE_ACC, RANGE loops, ALU ops, and ASSIGN. Can be split or grouped during optimization.

`WMMA` (Warp Matrix Multiply Accumulate)

Description: Represents a hardware-accelerated matrix multiplication operation performed cooperatively by a group of threads (warp/wavefront).
Purpose: Leverages specialized hardware units (like Tensor Cores on NVIDIA GPUs, Matrix Cores on AMD GPUs, AMX on Apple Silicon) for high-performance matrix multiplication.
Args: (name, dims, dtype_in, dtype_out, device, threads, upcast_axes, reduce_axes) - Detailed configuration for the WMMA operation.
Sources: (A, B, C) - Input matrices A, B, and accumulator C.
Stage: Codegen (inserted by optimization passes like apply_tensor_cores).
Notes: Highly backend-specific. Requires specific data layouts and operand types.

Control Flow Ops

`RANGE`

Description: Represents a loop range, typically used for iterating over tensor dimensions.
Purpose: Defines the iteration space for loops in the generated code.
Args: int - An identifier for the loop variable (axis index).
Sources: (start, end) - UOps defining the start (inclusive) and end (exclusive) of the loop.
Stage: Lowering (created by get_index), Codegen.
Notes: Rendered as for loops in C-style backends. Used as sources for DEFINE_ACC.

`IF`

Description: Represents a conditional block.
Purpose: Conditional execution in the generated code.
Args: None.
Sources: (condition, Optional[BARRIER]) - The boolean condition UOp and an optional barrier dependency.
Stage: Lowering, Codegen.
Notes: Requires a corresponding ENDIF. Code between IF and ENDIF is executed only if the condition is true. Used for gating LOAD/STORE.

`ENDRANGE` / `ENDIF`

Description: Marks the end of a RANGE or IF block, respectively.
Purpose: Defines the scope of loops and conditionals.
Args: None.
Sources: (RANGE,) or (IF,) - The corresponding start block UOp.
Stage: Lowering, Codegen.
Notes: Rendered as closing braces } in C-style backends.

`BARRIER`

Description: Represents a synchronization point, typically for local (shared) memory.
Purpose: Ensures that all threads in a workgroup reach this point before any thread proceeds, making writes to shared memory visible to other threads.
Args: None.
Sources: Usually (STORE,) operations to local memory that need to complete before subsequent reads. Can also be a source for IF.
Stage: Lowering, Codegen.
Notes: Essential for correctness when using shared memory for reductions or caching.

Vectorization / Structure Ops

`VECTORIZE`

Description: Combines multiple scalar UOps into a single vector UOp.
Purpose: Explicitly represents vector operations for backends that support them. Also used internally as a structural node (e.g., for VCONST).
Args: None.
Sources: (scalar_uop_1, scalar_uop_2, ...) - The scalar UOps to be combined.
Stage: Codegen, Final Rendering.
Notes: The inverse of GEP. Renderers translate this into vector types and operations if supported.

`UNROLL`

Description: Represents loop unrolling during code generation. Structurally similar to VECTORIZE but used for axes that are fully unrolled rather than vectorized.
Purpose: Optimization to reduce loop overhead by duplicating the loop body. Also used structurally in Tensor Core lowering.
Args: tuple[tuple[int, int], ...] - The axes being unrolled and their sizes ((axis, size), ...).
Sources: (UOp,) - The UOp whose unrolled axes are represented.
Stage: Codegen (inserted by Kernel.apply_opt), Expansion.
Notes: The expander pass (do_expand) processes UNROLL ops, effectively performing the unrolling by duplicating and adjusting source UOps.

`CONTRACT`

Description: Represents the contraction (summation) part of a vectorized reduction or Tensor Core operation. Inverse of UNROLL for specific axes.
Purpose: Used structurally during the expansion/contraction passes related to vectorization and Tensor Cores.
Args: tuple[tuple[int, int], ...] - The axes being contracted and their sizes ((axis, size), ...).
Sources: (UOp,) - The UOp (often an UNROLL) being contracted.
Stage: Expansion.

`CAT`

Description: Concatenates multiple vector UOps. (Internal use).
Purpose: Used during expansion/vectorization passes to combine vectors.
Args: None.
Sources: Multiple vector UOps.
Stage: Expansion.
Notes: Lowered to VECTORIZE with GEP sources before rendering.

Internal / Removed Early Ops

These ops exist briefly during graph construction or early simplification but are typically removed before significant lowering or scheduling.

`CONTIGUOUS` / `CONTIGUOUS_BACKWARD`

Description: Marks a requirement for the data to be in contiguous memory layout. CONTIGUOUS_BACKWARD affects the backward pass only.
Purpose: Triggers realization or specific memory layouts. Often used before operations that require contiguous inputs (like some COPY operations or external calls).
Args: None.
Sources: (UOp,) - The input UOp.
Stage: Graph Construction, Early Simplification.
Notes: Usually removed by simplification rules (sym) if the input is already known to be contiguous or if the operation can be achieved via a VIEW.

`DETACH`

Description: Removes the UOp from the computation graph for gradient calculation purposes.
Purpose: Implements Tensor.detach().
Args: None.
Sources: (UOp,) - The input UOp.
Stage: Graph Construction, Early Simplification.
Notes: Removed by simplification rules (sym).

`BLOCK` / `BLOCKSTART` / `BLOCKFORK` / `BLOCKEND`

Description: Internal ops used by linearize_uop to group UOps into basic blocks based on control flow (RANGE, IF).
Purpose: Facilitates structuring the UOp list into code blocks for rendering.
Args: BasicBlock or int.
Sources: Variable, depending on the block structure.
Stage: Codegen (linearize_uop).
Notes: These are temporary structural nodes used only within the linearization process and do not appear in the final UOp list passed to the renderer.

This summary covers the primary UOps and their roles. The exact behavior and interactions can be complex, as they are subject to numerous rewrite rules during the compilation process.

Meta / Framework Ops

SINK

KERNEL

NAME

DEVICE

MULTI

COPY

ASSIGN

BIND

DEFINE_VAR

UNIQUE

EMPTY

NOOP

CUSTOM / CUSTOMI

Constants

CONST

VCONST

Movement Ops

RESHAPE

PERMUTE

EXPAND

PAD

SHRINK

FLIP

Lowering / Indexing Ops

VIEW

INDEX

VALID

GEP (Get Element Pointer)