1.1 - UOp Summary
UOps form the intermediate representation (IR) in tinygrad’s computation graph after the initial Tensor operations are defined and before final code generation. They represent a lower-level, more explicit graph that undergoes several optimization and transformation passes.
Meta / Framework Ops
These UOps are primarily used by the tinygrad framework itself for graph structure, scheduling, device management, and metadata, rather than direct computation on tensor data.
SINK
- Description: Represents the final output(s) of a computation graph or kernel. It acts as a root node for graph traversal and scheduling.
- Purpose: Marks the end of a computation path that needs to be realized. Used to define the boundaries of kernels during scheduling.
- Args: Optional
KernelInfo
containing metadata about the kernel (name, local dims, etc.). - Sources: One or more UOps that produce the final results (often
STORE
orASSIGN
ops). - Stage: Graph Construction, Scheduling (marks kernel boundaries).
- Notes: Essential for defining what needs to be computed. Multiple UOps can feed into a single
SINK
. Simplified away before final rendering.
KERNEL
- Description: An internal UOp used during scheduling to encapsulate the operations belonging to a single kernel launch.
- Purpose: Groups UOps together that will be executed as one unit on the target device. Helps manage kernel boundaries and dependencies.
- Args:
Kernel
object containing the kernel’s AST and metadata. - Sources: UOps representing the inputs required by the kernel (often
BUFFER
or otherKERNEL
outputs viaASSIGN
). - Stage: Scheduling.
- Notes: This is a temporary node used by the scheduler and is expanded/replaced before final code generation.
NAME
- Description: Assigns a name (string) to a UOp, typically used for naming kernels or variables in the generated code.
- Purpose: Code readability and debugging. Allows generated kernels/variables to have meaningful names based on the high-level operations.
- Args:
str
- the desired name. - Sources: Usually none, but can wrap other ops during specific transformations.
- Stage: Scheduling, Code Generation.
- Notes: Often associated with
SINK
orDEFINE_VAR
.
DEVICE
- Description: Represents a target device (e.g., “CPU”, “CUDA:0”).
- Purpose: Specifies the device for memory allocation or computation. Used as a source for
BUFFER
,COPY
,CONST
. - Args:
str
- the device name. - Sources: None.
- Stage: Graph Construction, Scheduling.
MULTI
- Description: Represents a tensor sharded across multiple devices.
- Purpose: Manages data parallelism by holding references to UOps representing shards on different devices.
- Args:
(Optional[int], tuple[bool, ...])
- The sharding axis (orNone
) and a tuple indicating which shards hold real data (vs padding/zeros). - Sources: Multiple UOps, each representing a shard on a specific device.
- Stage: Graph Construction (created by
Tensor.shard
). - Notes: Handled by a specific rewrite pass (
get_multi_map
) to distribute operations across shards.
COPY
- Description: Copies data from one buffer/device to another.
- Purpose: Explicit data movement between devices or cloning a buffer.
- Args:
bool
-clone=True
forces a new allocation even on the same device. - Sources:
(DEVICE, UOp)
- Target device and the UOp to copy from. - Stage: Graph Construction, Scheduling.
- Notes: Simplified away if source and destination devices are the same and
clone=False
. Can becomeBufferXfer
for optimized transfers.
ASSIGN
- Description: Represents an assignment operation. At the Tensor level, it signifies
tensor.assign(other_tensor)
. At the UOp level after lowering, it often represents updating an accumulator (DEFINE_ACC
). - Purpose: In-place modification of data (conceptually) or accumulator updates.
- Args: None.
- Sources:
(target, value)
- The UOp being assigned to (oftenBUFFER
orDEFINE_ACC
) and the new value UOp. - Stage: Graph Construction, Lowering (Accumulator updates).
- Notes: High-level
ASSIGN
is often lowered toSTORE
or kernel operations. Low-levelASSIGN
is crucial for reductions.
BIND
- Description: Binds a symbolic
Variable
(DEFINE_VAR
) to a concrete integer value (CONST
). - Purpose: Resolves symbolic dimensions or parameters at runtime or during JIT compilation.
- Args: None.
- Sources:
(DEFINE_VAR, CONST)
- The variable and the constant value it’s bound to. - Stage: Graph Construction (Symbolic Shapes), JIT.
- Notes: Removed during scheduling/lowering by substituting the variable with its bound constant value.
DEFINE_VAR
- Description: Defines a symbolic variable, typically representing a dimension size.
- Purpose: Allows for operations on tensors with shapes that are not known at compile time (symbolic shapes).
- Args:
(name: str, min_val: int, max_val: int)
- Name and range of the variable. - Sources: None.
- Stage: Graph Construction (Symbolic Shapes).
- Notes: Usually bound using
BIND
before execution.
UNIQUE
- Description: Represents a unique identifier, typically used for buffer allocation.
- Purpose: Ensures that different
BUFFER
UOps representing distinct allocations get unique identities, even if they have the same size/dtype/device. - Args:
int
- A unique integer. - Sources: None.
- Stage: Graph Construction (Buffer creation).
EMPTY
- Description: Represents an empty tensor (placeholder).
- Purpose: Used internally, often as a starting point for tensor creation before data is specified (e.g.,
Tensor.empty
). - Args: None.
- Sources: None.
- Stage: Graph Construction.
- Notes: Usually replaced or filled quickly.
NOOP
- Description: A no-operation instruction.
- Purpose: Used as a placeholder or identity operation during graph transformations, often inserted and then removed by subsequent passes. Can sometimes force materialization in specific backends (e.g., before
BITCAST
). - Args: None.
- Sources:
(UOp,)
- The UOp to pass through. - Stage: Intermediate Transformations.
CUSTOM
/ CUSTOMI
- Description: Represents a custom operation defined by a string, potentially with specific backend implementations.
CUSTOMI
implies inline code. - Purpose: Allows extending tinygrad with operations not covered by standard Ops, often for backend-specific intrinsics or complex fused operations.
- Args:
str
- A format string for the custom code. - Sources: Variable number of UOps, used to fill the format string.
- Stage: Lowering, Code Generation.
Constants
CONST
- Description: Represents a scalar constant value.
- Purpose: Embeds constant values directly into the computation graph.
- Args:
ConstType
(int, float, bool) - The constant value. - Sources: Usually none, but can have a
VIEW(DEVICE)
source to indicate device placement. - Stage: All stages.
- Notes:
Tensor.full
createsCONST
ops expanded to the correct shape.
VCONST
- Description: Represents a vector constant value.
- Purpose: Similar to
CONST
but for vector types. - Args:
tuple[ConstType, ...]
- The tuple of constant values. - Sources: Usually none.
- Stage: All stages.
- Notes: Often lowered to
VECTORIZE(CONST, CONST, ...)
during rendering.
Movement Ops
These ops change the logical view (shape, strides, offset, mask) of data without necessarily moving or copying it in memory immediately. They primarily manipulate the ShapeTracker
associated with a buffer.
RESHAPE
- Description: Changes the shape of the tensor while preserving the total number of elements.
- Purpose: Modifies the logical dimensions.
- Args:
tuple[sint, ...]
- The new shape. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into
VIEW
ops.
PERMUTE
- Description: Reorders the dimensions of the tensor.
- Purpose: Changes the logical order of axes.
- Args:
tuple[int, ...]
- The permutation order. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into
VIEW
ops.
EXPAND
- Description: Expands dimensions of size 1 to a larger size.
- Purpose: Broadcasting.
- Args:
tuple[sint, ...]
- The target shape with expanded dimensions. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into
VIEW
ops. Stride becomes 0 for expanded dimensions.
PAD
- Description: Adds padding to the tensor along specified dimensions.
- Purpose: Increases the size of dimensions, typically for convolutions or alignment.
- Args:
tuple[tuple[sint, sint], ...]
- Padding amounts (before, after) for each dimension. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into
VIEW
ops. Adjusts offset and mask.
SHRINK
- Description: Shrinks the tensor along specified dimensions by selecting a sub-region.
- Purpose: Cropping or selecting parts of a tensor.
- Args:
tuple[tuple[sint, sint], ...]
- Start and end indices (exclusive) for shrinking each dimension. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into
VIEW
ops. Adjusts offset and mask.
FLIP
- Description: Reverses the order of elements along specified dimensions.
- Purpose: Data augmentation or specific algorithms requiring reversed views.
- Args:
tuple[bool, ...]
- A boolean tuple indicating which axes to flip. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into
VIEW
ops. Modifies strides and offset.
Lowering / Indexing Ops
These UOps appear during the lowering process, translating logical views and operations into memory accesses and validity checks.
VIEW
- Description: Represents a logical view (
ShapeTracker
) applied to a base buffer UOp. This is the primary way movement ops are represented after initial lowering. - Purpose: Encapsulates shape, stride, offset, and mask information without creating new data. Connects logical tensor operations to underlying buffer representations.
- Args:
ShapeTracker
- The view information. - Sources:
(UOp,)
- The base UOp (oftenBUFFER
,CONST
, or anotherVIEW
). Can also have aDEVICE
source forCONST
. - Stage: Lowering, Scheduling.
- Notes: Multiple
VIEW
ops are merged into one.CONTIGUOUS
ops often trigger realization before aVIEW
.VIEW
ops are pushed towards memory operations (LOAD
/STORE
) or constants during simplification.
INDEX
- Description: Calculates a memory address/index based on a buffer and logical indices, potentially applying a validity mask.
- Purpose: Translates multi-dimensional logical indexing into a linear memory index for
LOAD
andSTORE
. Encapsulates theShapeTracker.to_indexed_uops
logic. - Args: None.
- Sources:
(buffer_uop, logical_indices_uop, Optional[valid_uop])
- The buffer (e.g.,DEFINE_GLOBAL
), the calculated index expression, and an optional validity mask (dtypes.bool
). - Stage: Lowering (post
rewrite_shapetracker_with_index
), Codegen. - Notes: This is part of the “new style” load/store introduced to simplify rendering.
VALID
- Description: Represents the validity mask derived from a
ShapeTracker
’s mask attribute. - Purpose: Computes whether a given logical index corresponds to a valid element within the original (unpadded, unshrunk) data. Used for masking operations, especially loads/stores near boundaries.
- Args: None.
- Sources:
(VIEW,)
- TheVIEW
UOp containing theShapeTracker
with mask information. - Stage: Lowering.
- Notes: Often simplified or combined with index calculations. Becomes the
valid
part ofINDEX
or the gate inLOAD
/STORE
.
GEP
(Get Element Pointer)
- Description: Extracts specific elements from a vector UOp.
- Purpose: Accessing individual lanes of a vectorized operation or constant.
- Args:
tuple[int, ...]
- The indices of the elements to extract. - Sources:
(UOp,)
- The vector UOp. - Stage: Codegen, Final Rendering.
- Notes: The inverse of
VECTORIZE
. Allows scalar operations on elements previously combined into a vector.
Memory Ops
These UOps deal directly with memory allocation, definition, and access.
BUFFER
- Description: Represents a raw memory buffer allocated on a specific device. This is the “base” UOp for most tensor data after initial allocation.
- Purpose: Holds the reference to the actual allocated memory used by tensors.
- Args:
int
- Size of the buffer in elements. - Sources:
(DEVICE, UNIQUE)
- The device and a unique identifier. - Stage: Graph Construction, Scheduling.
- Notes:
BUFFER
UOps map toBuffer
objects which manage the actual memory.BUFFER
itself doesn’t have aShapeTracker
;VIEW
ops are applied on top.
BUFFER_VIEW
- Description: Represents a view of an existing
BUFFER
UOp, potentially with an offset. Introduced for DISK buffers. - Purpose: Allows accessing parts of a larger buffer (e.g., a file on disk) without loading the entire thing.
- Args:
(size: int, offset: int)
- Size in elements and offset in elements from the base buffer. - Sources:
(BUFFER,)
- The base buffer. - Stage: Scheduling (Specific backends like DISK).
DEFINE_GLOBAL
- Description: Defines a global buffer in the kernel arguments.
- Purpose: Declares input/output buffers passed into the kernel.
- Args:
int
(optional, buffer index) orNone
. - Sources: None.
- Stage: Final Lowering (inside
linearize_uop
), Codegen. - Notes: Has a
PtrDType
. The renderer uses this to generate kernel signatures.
DEFINE_LOCAL
- Description: Defines a buffer in local (shared) memory.
- Purpose: Allocation of shared memory for intermediate results accessible by threads within a workgroup.
- Args:
str
- Name for the local buffer. - Sources: None.
- Stage: Lowering, Codegen.
- Notes: Has a
PtrDType
withlocal=True
. Requires synchronization (BARRIER
).
DEFINE_ACC
- Description: Defines an accumulator register or variable, typically initialized to an identity element and used within reduction loops.
- Purpose: Holds the intermediate state during reduction operations.
- Args:
(int,)
- An accumulator index/identifier. - Sources:
(initial_value, *reduce_ranges)
- The identity element (CONST
) and theRANGE
UOps defining the reduction loops. - Stage: Lowering (created from
REDUCE_AXIS
). - Notes: Lowered
REDUCE_AXIS
becomesDEFINE_ACC -> ALU -> ASSIGN(acc)
.
LOAD
- Description: Loads data from memory (global or local).
- Purpose: Reading data from a buffer into registers/variables for computation.
- Args: Optional load configuration (e.g., cache hints, not currently used extensively).
- Sources:
- Old style (pre-linearize):
(BUFFER, VIEW, Optional[STORE])
- Buffer, ShapeTracker view, optional dependency. - New style (post-linearize):
(INDEX, Optional[alt_value], Optional[gate], Optional[BARRIER])
- Indexed address, value if gate is false, gate condition, barrier dependency.
- Old style (pre-linearize):
- Stage: Lowering, Codegen.
- Notes: Validity checks (from
ShapeTracker
masks) are incorporated into theINDEX
orgate
.
STORE
- Description: Stores data into memory (global or local).
- Purpose: Writing computation results back to a buffer.
- Args: None.
- Sources:
- Old style (pre-linearize):
(BUFFER, VIEW, value)
- Buffer, ShapeTracker view, value to store. - New style (post-linearize):
(INDEX, value, Optional[gate])
- Indexed address, value to store, optional gate condition.
- Old style (pre-linearize):
- Stage: Lowering, Codegen.
- Notes: Often the final operation(s) feeding into a
SINK
.
Core Compute / ALU Ops
These perform basic element-wise arithmetic, logical, comparison, and transcendental operations. They generally expect sources to have the same shape and dtype (except for comparisons and specific cases like WHERE
).
Unary (EXP2
, LOG2
, SIN
, SQRT
, RECIP
, NEG
)
- Description: Apply standard unary mathematical functions.
- Purpose: Element-wise computation.
- Args: None.
- Sources:
(UOp,)
- The input UOp. - Stage: All stages.
- Notes:
NEG
is often represented asx * -1
. Transcendental ops (EXP2
,LOG2
,SIN
) might be rewritten to use approximations or backend-specific implementations.
Binary (ADD
, MUL
, IDIV
, MAX
, MOD
, CMPLT
, CMPNE
, XOR
, SHL
, SHR
, OR
, AND
, SUB
, FDIV
, POW
)
- Description: Apply standard binary mathematical or logical functions.
- Purpose: Element-wise computation between two operands.
- Args: None.
- Sources:
(UOp, UOp)
- The two input UOps. - Stage: All stages.
- Notes:
- Commutative ops (
ADD
,MUL
,MAX
,CMPNE
,XOR
,AND
,OR
) might have sources swapped during optimization. IDIV
is integer division (truncates towards zero).FDIV
(internal) represents float division (x / y
), often lowered tox * RECIP(y)
.CMPLT
/CMPNE
outputbool
dtype.SUB
is often represented asx + (-y)
.
- Commutative ops (
Ternary (WHERE
, MULACC
)
- Description: Apply standard ternary functions.
- Purpose: Element-wise computation involving three operands.
- Args: None.
- Sources:
WHERE
:(condition, true_value, false_value)
- Condition must bebool
.MULACC
:(a, b, c)
- Computesa * b + c
.
- Stage: All stages.
- Notes:
MULACC
(Multiply-Accumulate) can often map efficiently to hardware FMA (Fused Multiply-Add) instructions.
CAST
- Description: Changes the data type of the elements.
- Purpose: Type conversion (e.g., float to int, int to float, float16 to float32).
- Args: None.
- Sources:
(UOp,)
- The input UOp. - Stage: All stages.
- Notes: Behavior depends on the source and destination types (truncation, rounding, etc.).
BITCAST
- Description: Reinterprets the bits of the elements as a different data type of the same size.
- Purpose: Low-level manipulation, often used for specific algorithms or interacting with hardware types (e.g., float <-> int).
- Args: None.
- Sources:
(UOp,)
- The input UOp. - Stage: All stages.
- Notes: Does not change the underlying bit pattern, only the type interpretation. Requires source and destination dtypes to have the same
itemsize
.
Reduce Ops
REDUCE_AXIS
- Description: Performs a reduction operation (like sum, max) along specified axes.
- Purpose: Aggregates data across dimensions.
- Args:
(Ops, tuple[int, ...])
- The reduction operation (e.g.,Ops.ADD
,Ops.MAX
) and the axes to reduce. - Sources:
(UOp,)
- The input UOp. - Stage: High-level, Lowering.
- Notes: Lowered into a combination of
DEFINE_ACC
,RANGE
loops,ALU
ops, andASSIGN
. Can be split or grouped during optimization.
WMMA
(Warp Matrix Multiply Accumulate)
- Description: Represents a hardware-accelerated matrix multiplication operation performed cooperatively by a group of threads (warp/wavefront).
- Purpose: Leverages specialized hardware units (like Tensor Cores on NVIDIA GPUs, Matrix Cores on AMD GPUs, AMX on Apple Silicon) for high-performance matrix multiplication.
- Args:
(name, dims, dtype_in, dtype_out, device, threads, upcast_axes, reduce_axes)
- Detailed configuration for the WMMA operation. - Sources:
(A, B, C)
- Input matrices A, B, and accumulator C. - Stage: Codegen (inserted by optimization passes like
apply_tensor_cores
). - Notes: Highly backend-specific. Requires specific data layouts and operand types.
Control Flow Ops
RANGE
- Description: Represents a loop range, typically used for iterating over tensor dimensions.
- Purpose: Defines the iteration space for loops in the generated code.
- Args:
int
- An identifier for the loop variable (axis index). - Sources:
(start, end)
- UOps defining the start (inclusive) and end (exclusive) of the loop. - Stage: Lowering (created by
get_index
), Codegen. - Notes: Rendered as
for
loops in C-style backends. Used as sources forDEFINE_ACC
.
IF
- Description: Represents a conditional block.
- Purpose: Conditional execution in the generated code.
- Args: None.
- Sources:
(condition, Optional[BARRIER])
- The boolean condition UOp and an optional barrier dependency. - Stage: Lowering, Codegen.
- Notes: Requires a corresponding
ENDIF
. Code betweenIF
andENDIF
is executed only if the condition is true. Used for gatingLOAD
/STORE
.
ENDRANGE
/ ENDIF
- Description: Marks the end of a
RANGE
orIF
block, respectively. - Purpose: Defines the scope of loops and conditionals.
- Args: None.
- Sources:
(RANGE,)
or(IF,)
- The corresponding start block UOp. - Stage: Lowering, Codegen.
- Notes: Rendered as closing braces
}
in C-style backends.
BARRIER
- Description: Represents a synchronization point, typically for local (shared) memory.
- Purpose: Ensures that all threads in a workgroup reach this point before any thread proceeds, making writes to shared memory visible to other threads.
- Args: None.
- Sources: Usually
(STORE,)
operations to local memory that need to complete before subsequent reads. Can also be a source forIF
. - Stage: Lowering, Codegen.
- Notes: Essential for correctness when using shared memory for reductions or caching.
Vectorization / Structure Ops
VECTORIZE
- Description: Combines multiple scalar UOps into a single vector UOp.
- Purpose: Explicitly represents vector operations for backends that support them. Also used internally as a structural node (e.g., for
VCONST
). - Args: None.
- Sources:
(scalar_uop_1, scalar_uop_2, ...)
- The scalar UOps to be combined. - Stage: Codegen, Final Rendering.
- Notes: The inverse of
GEP
. Renderers translate this into vector types and operations if supported.
UNROLL
- Description: Represents loop unrolling during code generation. Structurally similar to
VECTORIZE
but used for axes that are fully unrolled rather than vectorized. - Purpose: Optimization to reduce loop overhead by duplicating the loop body. Also used structurally in Tensor Core lowering.
- Args:
tuple[tuple[int, int], ...]
- The axes being unrolled and their sizes((axis, size), ...)
. - Sources:
(UOp,)
- The UOp whose unrolled axes are represented. - Stage: Codegen (inserted by
Kernel.apply_opt
), Expansion. - Notes: The expander pass (
do_expand
) processesUNROLL
ops, effectively performing the unrolling by duplicating and adjusting source UOps.
CONTRACT
- Description: Represents the contraction (summation) part of a vectorized reduction or Tensor Core operation. Inverse of
UNROLL
for specific axes. - Purpose: Used structurally during the expansion/contraction passes related to vectorization and Tensor Cores.
- Args:
tuple[tuple[int, int], ...]
- The axes being contracted and their sizes((axis, size), ...)
. - Sources:
(UOp,)
- The UOp (often anUNROLL
) being contracted. - Stage: Expansion.
CAT
- Description: Concatenates multiple vector UOps. (Internal use).
- Purpose: Used during expansion/vectorization passes to combine vectors.
- Args: None.
- Sources: Multiple vector UOps.
- Stage: Expansion.
- Notes: Lowered to
VECTORIZE
withGEP
sources before rendering.
Internal / Removed Early Ops
These ops exist briefly during graph construction or early simplification but are typically removed before significant lowering or scheduling.
CONTIGUOUS
/ CONTIGUOUS_BACKWARD
- Description: Marks a requirement for the data to be in contiguous memory layout.
CONTIGUOUS_BACKWARD
affects the backward pass only. - Purpose: Triggers realization or specific memory layouts. Often used before operations that require contiguous inputs (like some
COPY
operations or external calls). - Args: None.
- Sources:
(UOp,)
- The input UOp. - Stage: Graph Construction, Early Simplification.
- Notes: Usually removed by simplification rules (
sym
) if the input is already known to be contiguous or if the operation can be achieved via aVIEW
.
DETACH
- Description: Removes the UOp from the computation graph for gradient calculation purposes.
- Purpose: Implements
Tensor.detach()
. - Args: None.
- Sources:
(UOp,)
- The input UOp. - Stage: Graph Construction, Early Simplification.
- Notes: Removed by simplification rules (
sym
).
BLOCK
/ BLOCKSTART
/ BLOCKFORK
/ BLOCKEND
- Description: Internal ops used by
linearize_uop
to group UOps into basic blocks based on control flow (RANGE
,IF
). - Purpose: Facilitates structuring the UOp list into code blocks for rendering.
- Args:
BasicBlock
orint
. - Sources: Variable, depending on the block structure.
- Stage: Codegen (
linearize_uop
). - Notes: These are temporary structural nodes used only within the linearization process and do not appear in the final UOp list passed to the renderer.
This summary covers the primary UOps and their roles. The exact behavior and interactions can be complex, as they are subject to numerous rewrite rules during the compilation process.