import os
"CPU"] = "1"
os.environ[# os.environ["TRACEMETA"] = "0"
"DEBUG"]="4"
os.environ[# os.environ["NOOPT"]="1"
1 - UOps
As we saw in the previous chapter, UOps are the intermediate device-independent representation on the computation tree that sits between the user-facing Tensor
and device-specific code that is generated to perform the computations.
import tinygrad as tg
from tinygrad import Tensor, dtypes
UOp is the basic building block of TinyGrad, used to represent the computation tree that we create by manipulating Tensors:
class UOp(...):
op:Ops= dtypes.void
dtype:DType tuple[UOp, ...] = tuple()
src:= None
arg:Any
...
- The
op
is the type of the operation, likeADD
,MUL
orCONST
- The
dtype
is of of the tinygrad dtypes, likedtypes.float32
ordtypes.uint8
- The
source
is a tuple of UOps this UOp operates on - The
arg
changes meaning depending on theop
, for example for aCONST
it can be a number3.0
, or it can be the new shape for aVIEW
op.
UOp reference
UOps are used throughout TinyGrad, some are specific to certain stages of processing (from Tensors to code), some are valid at any stage.
Take a look at the UOp Reference to get a feel for the UOps we have.
UOp is a singleton
As noted by mesozoic-egg@github, UOp is a singleton.
It’s implemented using a MetaClass:
class UOpMetaClass(type):
dict[tuple, weakref.ReferenceType[UOp]] = {}
ucache:def __call__(cls, op:Ops, dtype:DType=dtypes.void, src:tuple[UOp,...]=tuple(), arg:Any=None, _buffer:Buffer|None=None):
if (wret:=UOpMetaClass.ucache.get(key:=(op, dtype, src, arg), None)) is not None and (ret:=wret()) is not None: return ret
= ref = weakref.ref(created:=super().__call__(*key))
UOpMetaClass.ucache[key]
...return created
@dataclass(eq=False, slots=True)
class UOp(MathTrait, metaclass=UOpMetaClass):
def __del__(self):
if (ref:=UOpMetaClass.ucache.get(k:=(self.op, self.dtype, self.src, self.arg))) is not None:
...del UOpMetaClass.ucache[k]
(TinyGrad really loves its :=
operators)
The main idea is, if you have 2 UOp (sub-)trees, it’s very easy to compare them, because the roots of both trees will be the same object if they are identical.
from tinygrad.ops import UOp, Ops
# Create two identical UOp trees (3 * 5 + 2)
= UOp(Ops.CONST, dtype=dtypes.int, arg=5)
x1 = UOp(Ops.MUL, dtype=dtypes.int, src=(UOp(Ops.CONST, dtype=dtypes.int, arg=3), x1))
mul1 = UOp(Ops.ADD, dtype=dtypes.int, src=(mul1, UOp(Ops.CONST, dtype=dtypes.int, arg=2)))
add1
# Second tree
= UOp(Ops.CONST, dtype=dtypes.int, arg=5)
x2 = UOp(Ops.MUL, dtype=dtypes.int, src=(UOp(Ops.CONST, dtype=dtypes.int, arg=3), x2))
mul2 = UOp(Ops.ADD, dtype=dtypes.int, src=(mul2, UOp(Ops.CONST, dtype=dtypes.int, arg=2)))
add2
id(add1) == id(add2)
True
# Third tree is different (3 * 5 + 1)
= UOp(Ops.CONST, dtype=dtypes.int, arg=5)
x3 = UOp(Ops.MUL, dtype=dtypes.int, src=(UOp(Ops.CONST, dtype=dtypes.int, arg=3), x3))
mul3 = UOp(Ops.ADD, dtype=dtypes.int, src=(mul3, UOp(Ops.CONST, dtype=dtypes.int, arg=1)))
add3
id(add1) == id(add3)
False
Symbolic evaluation
Another cool feature of UOps - if all inputs are constants and the result is a scalar, it can be evaluated without generating any device code at all:
add1
UOp(Ops.ADD, dtypes.int, arg=None, src=(
UOp(Ops.MUL, dtypes.int, arg=None, src=(
UOp(Ops.CONST, dtypes.int, arg=3, src=()),
UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
UOp(Ops.CONST, dtypes.int, arg=2, src=()),))
add1.simplify()
UOp(Ops.CONST, dtypes.int, arg=17, src=())
Another way to do the same - cast the UOp to float or an int depending on dtype.
int(add1)
17
UOp.render()
.render()
converts the UOp tree into a C expression.
This is not how TinyGrad generates the code normally, but it’s useful for debugging.
=False) add1.render(simplify
'((3*5)+2)'
This only works for a limited subset of trees with simple operations though.
If the tree can not be rendered, the function returns an str(op).
UOp creation helpers
In many cases, the UOp class has methods for creating specific UOps. It’s often more convenient and concise to use them
For example UOp.const()
creates either a CONST
or a VCONST
(vector const, used internally for buffers), and also takes care of the arg type matching dtype:
2) UOp.const(dtypes.float16,
UOp(Ops.CONST, dtypes.half, arg=2.0, src=())
Note the arg has been converted to a float
, even though we gave it an int
There are a few that are very straight-forward:
# The SINK is the end of a computation graph
def sink(self, *srcs:UOp): return UOp(Ops.SINK, dtypes.void, (self,)+srcs)
# Detach from the backprop
def detach(self): return UOp(Ops.DETACH, self.dtype, (self,))
def cast(self, dtype:DType): return UOp(Ops.CAST, dtype, (self,))
def bitcast(self, dtype:DType): return UOp(Ops.BITCAST, dtype, (self,))
def load(self, *src:UOp, **kwargs): return UOp(Ops.LOAD, src=(self,)+src, **kwargs)
def store(self, *src:UOp, **kwargs): return UOp(Ops.STORE, dtypes.void, (self,)+src, **kwargs)
# The RANGE is actually a `for` loop
def range(dtype:DType, start:sint, end:sint, idx:int): return UOp(Ops.RANGE, dtype=dtype, src=(sint_to_uop(start), sint_to_uop(end)), arg=idx)
def assign(self, x:UOp): return UOp(Ops.ASSIGN, self.dtype, (self,x))
def contiguous(self): return self.alu(Ops.CONTIGUOUS)
def contiguous_backward(self): return self.alu(Ops.CONTIGUOUS_BACKWARD)
Toposort
Quite often we need to access a UOp tree in “topological order”.
UOp.toposort
is a property function that returns a dictionary with UOps being the keys, and the values being None.
Note: This emulates a sorted Set, which Python lacks
print("===== 3 * 5 + 2 =====")
for o in add1.toposort.keys():
print(o.op, o.arg)
===== 3 * 5 + 2 =====
Ops.CONST 3
Ops.CONST 5
Ops.MUL None
Ops.CONST 2
Ops.ADD None
You get the idea - the children always come before the parents
Other UOp methods
When reading the Tiny Grad code, you will often see other UOp methods called. To make this task easier, let’s go over some popular ones.
.replace()
Despite its name, this does not replace, but rather creates a new UOp that is a copy of the original UOp, except for the args (op, dtype, arg, src) you want to change:
=Ops.SUB) add1.replace(op
UOp(Ops.SUB, dtypes.int, arg=None, src=(
UOp(Ops.MUL, dtypes.int, arg=None, src=(
UOp(Ops.CONST, dtypes.int, arg=3, src=()),
UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
UOp(Ops.CONST, dtypes.int, arg=2, src=()),))
add1
did not change:
add1
UOp(Ops.ADD, dtypes.int, arg=None, src=(
UOp(Ops.MUL, dtypes.int, arg=None, src=(
UOp(Ops.CONST, dtypes.int, arg=3, src=()),
UOp(Ops.CONST, dtypes.int, arg=5, src=()),)),
UOp(Ops.CONST, dtypes.int, arg=2, src=()),))
UOps are actually supposed to be immutable, but this is not enforced for performance reasons:
# NOTE: this should be frozen, but frozen is slower
@dataclass(eq=False, slots=True)
class UOp(MathTrait, metaclass=UOpMetaClass):
...
UOp to code
from tinygrad.engine.schedule import create_schedule_with_vars
from tinygrad.engine.realize import lower_schedule_item
You did a bunch of Tensor operations, constructed a chonky UOp tree, and now you want to actually compute it.
= (Tensor.full((10, 10), 1) + Tensor.full((10, 10), 2)).contiguous()
a a.lazydata
UOp(Ops.CONTIGUOUS, dtypes.int, arg=None, src=(
UOp(Ops.ADD, dtypes.int, arg=None, src=(
UOp(Ops.EXPAND, dtypes.int, arg=(10, 10), src=(
UOp(Ops.RESHAPE, dtypes.int, arg=(1, 1), src=(
UOp(Ops.CONST, dtypes.int, arg=1, src=(
x4:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(), strides=(), offset=0, mask=None, contiguous=True),)), src=(
UOp(Ops.DEVICE, dtypes.void, arg='CPU', src=()),)),)),)),)),
UOp(Ops.EXPAND, dtypes.int, arg=(10, 10), src=(
UOp(Ops.RESHAPE, dtypes.int, arg=(1, 1), src=(
UOp(Ops.CONST, dtypes.int, arg=2, src=(
x4,)),)),)),)),))
The first step is to “schedule” the computation. This converts the UOp tree to a lover level one. You might also notice that it computed the 1+2=3
.
Note: We will cover the
ShapeTracker
in a separate chapter soon
vars = a.schedule_with_vars()
schedule, vars schedule,
([ScheduleItem(ast=UOp(Ops.SINK, dtypes.void, arg=None, src=(
UOp(Ops.STORE, dtypes.void, arg=None, src=(
UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(100), arg=0, src=()),
UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(10, 10), strides=(10, 1), offset=0, mask=None, contiguous=True),)), src=()),
UOp(Ops.CONST, dtypes.int, arg=3, src=(
UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(10, 10), strides=(0, 0), offset=0, mask=None, contiguous=False),)), src=()),)),)),)), bufs=(<buf real:False device:CPU size:100 dtype:dtypes.int offset:0>,), metadata=(contiguous, __add__))],
{})
The next step is to convert the ScheduleItem
into executable code.
= lower_schedule_item(schedule[0])
ei ei
opened device CPU from pid:549064
E_25_4
0: (25, 4) int.ptr(100) (4, 1) ShapeTracker(views=(View(shape=(25, 4), strides=(4, 1), offset=0, mask=None, contiguous=True),))
[Opt(op=OptOps.UPCAST, axis=0, arg=4)]
void E_25_4(int* restrict data0) {
for (int ridx0 = 0; ridx0 < 25; ridx0++) {
int alu0 = (ridx0<<2);
*(data0+alu0) = 3;
*(data0+(alu0+1)) = 3;
*(data0+(alu0+2)) = 3;
*(data0+(alu0+3)) = 3;
}
}
ExecItem(prg=<tinygrad.engine.realize.CompiledRunner object>, bufs=[<buf real:False device:CPU size:100 dtype:dtypes.int offset:0>], metadata=(contiguous, __add__))
This brings the UOp tree to the lowest level, that maps ~1:1 to the generated code:
for o in ei.prg.p.uops:
print(o.op, o.arg, [s.arg for s in o.src if s.op == Ops.CONST] if o.src else "")
Ops.NAME E_25_4
Ops.DEFINE_GLOBAL 0
Ops.CONST 0
Ops.CONST 1
Ops.CONST 2
Ops.CONST 3
Ops.CONST 25
Ops.RANGE 0 [0, 25]
Ops.SHL None [2]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ADD None [1]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ADD None [2]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ADD None [3]
Ops.INDEX None []
Ops.STORE None [3]
Ops.ENDRANGE None []
print(ei.prg.p.src)
void E_25_4(int* restrict data0) {
for (int ridx0 = 0; ridx0 < 25; ridx0++) {
int alu0 = (ridx0<<2);
*(data0+alu0) = 3;
*(data0+(alu0+1)) = 3;
*(data0+(alu0+2)) = 3;
*(data0+(alu0+3)) = 3;
}
}
Let’s compile and run the code. We will go into much more details on individual steps later.
ei.run()
*** CPU 1 E_25_4 arg 1 mem 0.00 GB tm 7.96us/ 0.01ms ( 0.00 GFLOPS 0.1|0.1 GB/s) ['contiguous', '__add__']
7.961993105709553e-06
The result has been stored to the buffer:
import numpy as np
= memoryview(a.lazydata.base.realized._buf)
view =np.int32).reshape(a.shape) np.frombuffer(view, dtype
array([[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]], dtype=int32)