import os
# os.environ["DEBUG"] = "7"
"CACHELEVEL"] ="0"
os.environ["NOOPT"] = "1" os.environ[
Misc 1 - elf.py and the ELF format
ELF is the standard format for executables and libraries, as well as intermediate object files (.o) on Linux.
TinyGrad uses it for 2 somewhat distinct roles:
- When it generates CPU code (clang or LLVM IR backends), it then needs to load the generated binary code into memory to run it.
We could have built a shared library (.so), but then I think we’d have to save it to use
dlopen()
, and it would not be portable anyway. So instead of this, TinyGrad only does the compilation step (generates a .o) and implements a minimal ELF loader.
@uuuuvn also pointed out that doing a proper linking step is slow, and there is a bug in the OSX linker that can’t output to stdout.
- When it generates cuda code, it also comes out as ELF. The sections and relocation rules are different from those found on Linux, but the format is the same.
The CUDA Runtime library would be able to load this object file, and that’s what we do for the
CUDA
backend, but theNV
backend does not rely on the CUDA Runtime libraries, so we parse the ELF manually.
The loader and relocation code for host (x86_64 and ARM 64) platforms is implemented in tinygrad/runtime/support/elf.py
Since we only deal with self-contained object files (they don’t access data/function from other files), the task is less danting than you might thnk.
Let’s first look at the ELF format, at least the bits relevant here:
ELF format
ELF Header (Elf64_Ehdr
)
- Located at the very beginning of the file.
- Contains essential metadata:
e_ident
: Magic number (\x7fELF
) and info like 64-bit (ELFCLASS64
), endianness (ELFDATA2LSB
), OS ABI.e_type
: File type (e.g.,ET_REL
for relocatable/object file,ET_EXEC
for executable,ET_DYN
for shared object).elf.py
expectsET_REL
.e_machine
: Target architecture (e.g.,EM_X86_64
,EM_AARCH64
).e_shoff
: Offset to the Section Header Table.e_shnum
: Number of entries in the Section Header Table.e_shstrndx
: Index of the section header table entry that contains section names.
Section Header Table:
- An array of
Elf64_Shdr
structures. - Each entry describes a section in the file.
- Key
Elf64_Shdr
fields used byelf.py
:sh_name
: Offset into the Section Name String Table (.shstrtab
) for this section’s name.sh_type
: Type of section (e.g.,SHT_PROGBITS
for code/data,SHT_SYMTAB
for symbols,SHT_RELA
/SHT_REL
for relocations,SHT_STRTAB
for strings,SHT_NOBITS
for.bss
).sh_flags
: Attributes likeSHF_ALLOC
(load into memory),SHF_WRITE
,SHF_EXECINSTR
.sh_addr
: The intended virtual memory address during execution. For.o
files (typeET_REL
), this is often 0 for most sections, as the final address isn’t known yet.elf.py
updates this for sections it places.sh_offset
: Offset of the section’s content within the ELF file itself.sh_size
: Size of the section’s content in the file (or memory size forSHT_NOBITS
).sh_link
,sh_info
: Used by specific section types (e.g.,.symtab
usessh_link
to point to its string table).sh_addralign
: Required alignment constraint for the section’s start address.sh_entsize
: Size of each entry if the section holds a table (like symbol table or relocation table).
Sections:
- Contiguous chunks of bytes (or just metadata for
SHT_NOBITS
) described by the Section Header Table. - Common sections relevant here:
.text
: Executable code (SHT_PROGBITS
)..data
: Initialized global/static variables (SHT_PROGBITS
)..rodata
: Read-only data (constants, strings) (SHT_PROGBITS
)..bss
: Uninitialized global/static variables (SHT_NOBITS
). Occupies no space in the file but needs space allocated in memory. (elf.py
doesn’t explicitly handle.bss
size allocation based onsh_size
, it only lays outSHT_PROGBITS
)..symtab
: Symbol Table (SHT_SYMTAB
). Lists defined and referenced symbols (functions, variables)..strtab
: String Table (SHT_STRTAB
). Stores names for symbols..shstrtab
: Section Header String Table (SHT_STRTAB
). Stores names for sections..rela.text
,.rel.text
, etc.: Relocation sections (SHT_RELA
,SHT_REL
). Contain instructions for patching code/data.
Relocations
Even for a self-contained file, the compiler generates an object file with symbols (constants, global vars, functions) that don’t have their final location assigned. This code might:
- Access a function defined elsewhere in the same file (e.g., a helper function in another part of
.text
). - Reference constants or static data (e.g., accessing a string literal in
.rodata
or a static array in.data
).
When generating CPU code, PC-relative addressing are used, so the code does not care where it will be placed in memory for execution. Still, the code will refernce stuff in .rodata, so it needs to know the address.
TinyGrad (at least for now) does not generate code with subroutines or global variables, which makes out task easier.
Relocations are metadata entries that tell our loader how to patch the machine code once we’ve determined where each section will be placed in memory. For self-contained files, we only need to resolve references within the same object file, not external dependencies.
There are 2 types of relocation entries, REL
and RELA
:
- For rel, the instruction will contain the address of relative to Program Counter (PC), so we will read this address, add the offset, and patch the instruction.
- For rela, the instruction contains zeros (or any value really) for the address, and the rela entry contains the address. We add offset to it, and rewrite the value from the instruction.
Relocation Sections:
- Contain arrays of relocation entries.
- The section name indicates which other section the relocations apply to (e.g.,
.rela.text
contains patches for the.text
section).elf.py
usessh.name[5:]
orsh.name[4:]
to find the target section name. - There are two main types:
SHT_RELA
: Entries contain an explicit addend. (Elf64_Rela
)SHT_REL
: Entries have an implicit addend (usually the value already present at the location being patched). (Elf64_Rel
)
Relocation Entries (Elf64_Rela
/ Elf64_Rel
):
Each entry describes a single patch. Key fields:
r_offset
: The offset within the target section where the patch needs to be applied. Let’s call the final addressP = section_base + r_offset
.r_info
: A combined field containing:- Symbol Index (
ELF64_R_SYM(r_info)
): An index into the.symtab
section. This identifies the symbol whose address is needed for the calculation. Let the symbol’s final address beS
. - Relocation Type (
ELF64_R_TYPE(r_info)
): Specifies how to calculate the value to be patched and how to insert it atP
. This is architecture-specific. Examples:- For X86_64, we only support
R_X86_64_PC32
, calculateS + A - P
(Symbol + Addend - PatchLocation) and store the 32-bit result. - For ARM 64, we have more options, and we actually have to patch some bits in the target instruction.
- For X86_64, we only support
- Symbol Index (
r_addend
(Elf64_Rela
only): An explicit constant value (A
) to be used in the relocation calculation (e.g.,S + A - P
). ForElf64_Rel
, the addendA
is implicitly the value already stored in the instruction/data atP
before patching.
Running code on CPU
Let’s look at an example:
elftest.c
float foo(int x) { return x + 12345678.f; } // I use float because an int const stays in .text for some reason
!clang -c -x c -march=native --target=x86_64-none-unknown-elf -O0 -fPIC \
-ffreestanding -fno-math-errno -fno-ident -nostdlib elftest.c -o elftest.o
!objdump -r elftest.o
elftest.o: file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000010 R_X86_64_PC32 .LCPI0_0-0x0000000000000004
We have a single relocation entry, it’s for that float constant that went onto .rodata.
!objdump -d elftest.o
elftest.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <foo>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 7d fc mov %edi,-0x4(%rbp)
7: c5 fa 2a 45 fc vcvtsi2ssl -0x4(%rbp),%xmm0,%xmm0
c: c5 fa 10 0d 00 00 00 vmovss 0x0(%rip),%xmm1 # 14 <foo+0x14>
13: 00
14: c5 fa 58 c1 vaddss %xmm1,%xmm0,%xmm0
18: 5d pop %rbp
19: c3 ret
You can see for that vmovss
instruction, the address part is currently 00 00 00 00
. The address part is located at 0x10
, which matches the offset in the relocation table.
Let’s look at the .rodata.cst4
section. The .cst4
means this .rodata sectoin is for 4-byte constants.
!objdump -s -j .rodata.cst4 elftest.o
elftest.o: file format elf64-x86-64
Contents of section .rodata.cst4:
0000 4e613c4b Na<K
Must be the float
import numpy as np
hex(np.float32(12345678.).view(np.uint32))
'0x4b3c614e'
Yes it is.
The bytes are in reverse order because x86 is a little-endian architecture.
Let’s load it with elf_loader
from tinygrad.runtime.support.elf import elf_loader, relocate
with open('elftest.o', 'rb') as f:
= f.read()
elf_bytes
= elf_loader(elf_bytes)
image, sections, relocs
print(f"Image size: {len(image)} bytes")
print("\nSections:")
for i, section in enumerate(sections):
print(f" {i}: {section.name} (size: {section.header.sh_size}, addr: {hex(section.header.sh_addr)})")
if relocs:
print("\nRelocations:")
for i, (offset, target, r_type, addend) in enumerate(relocs):
print(f" {i}: offset={hex(offset)}, target={hex(target)}, type={r_type}, addend={addend}")
Image size: 32 bytes
Sections:
0: (size: 0, addr: 0x0)
1: .strtab (size: 94, addr: 0x0)
2: .text (size: 26, addr: 0x0)
3: .rela.text (size: 24, addr: 0x0)
4: .rodata.cst4 (size: 4, addr: 0x1c)
5: .note.GNU-stack (size: 0, addr: 0x20)
6: .llvm_addrsig (size: 0, addr: 0x0)
7: .symtab (size: 96, addr: 0x0)
Relocations:
0: offset=0x10, target=0x1c, type=2, addend=-4
Disassemble the code in the image:
from tinygrad.helpers import capstone_flatdump
0x1a]) capstone_flatdump(image[:
0x000000: push rbp
0x000001: mov rbp, rsp
0x000004: mov dword ptr [rbp - 4], edi
0x000007: vcvtsi2ss xmm0, xmm0, dword ptr [rbp - 4]
0x00000c: vmovss xmm1, dword ptr [rip]
0x000014: vaddss xmm0, xmm0, xmm1
0x000018: pop rbp
0x000019: ret
The .rodata was added jught after .text (with 2 bytes to align it on a 4-byte boundary)
0x1c:0x20].hex() image[
'4e613c4b'
Now we can apply the relocations. This is pretty much what elf.py:jit_loader
does.
Note:
jit_loader
is not directly related toTinyJit
, JIT is an overloaded term, and we compile the kernels … just in time.
import struct
for ploc,tgt,r_type,r_addend in relocs:
print(f"Relocating at address {hex(ploc)}, PC at {hex(ploc+4)}")
print(f"Before: 0x{image[ploc:ploc+4].hex()}")
+4] = struct.pack("<I", relocate(struct.unpack("<I", image[ploc:ploc+4])[0], ploc, tgt+r_addend, r_type))
image[ploc:plocprint(f"After: 0x{image[ploc:ploc+4].hex()}")
Relocating at address 0x10, PC at 0x14
Before: 0x00000000
After: 0x08000000
The 0x08000000
is actually just 8, keep the endianess in mind. While we are patching at address 0x10, the PC must be at the next instruction already, 0x14, skipping the 4 byes of 00 00 00 00
.
hex(0x14 + 8)
'0x1c'
And this is indeed the address of the constant. Now, let’s run the code. This is normally done in device.py:CPUProgram.__init__
from tinygrad.helpers import mv_address
from tinygrad.device import CPUProgram
import ctypes
from mmap import mmap, PROT_READ, PROT_WRITE, PROT_EXEC, MAP_ANON, MAP_PRIVATE
= mmap(-1, len(image), MAP_ANON | MAP_PRIVATE, PROT_READ | PROT_WRITE | PROT_EXEC)
mem
mem.write(image)
"__clear_cache"](ctypes.c_void_p(mv_address(mem)), ctypes.c_void_p(mv_address(mem) + len(image)))
CPUProgram.rt_lib[
= ctypes.CFUNCTYPE(ctypes.c_float)(mv_address(mem))
fxn
# CPUProgram.__call__
321)) fxn(ctypes.c_int32(
12345999.0
- Allocate a piece of memory, and make it executable. We can’t just mark our image as executable, because the flags are applied to memory pages (4096 bytes), and our image is not aligned to page boundary.
- Copy the image (lib) into that memory.
- Clear instruction cache. I’m not entirely sure how the instruction cache works on different architectures, I guess it might be required at least on some architecures. Skipping this step did not cause issues for me.
- Wrap it in a python function using
ctypes
and call it.
NV backend
Now, we also use the same loader with the NV
backend without relying on the CUDA Runtime library to load it. Let’s take a look at ops_nv.py:NVProgram
= elf_loader(self.lib, force_section_align=128)
image, sections, relocs self.lib_gpu = self.dev.allocator.alloc(round_up(image.nbytes, 0x1000) + 0x1000, BufferSpec(cpu_access=True))
Load the elf and allocate a buffer on the GPU.
Note: We use the CUDA Unified Memory, so the buffer is mapped at the same vitual address on both the host and the GPU. I did not know this was even possible.
self.prog_addr, self.prog_sz, self.regs_usage, self.shmem_usage, self.lcmem_usage = self.lib_gpu.va_addr, image.nbytes, 0, 0x400, 0
self.constbufs: dict[int, tuple[int, int]] = {0: (0, 0x160)} # dict[constbuf index, tuple[va_addr, size]]
for sh in sections:
if sh.name == f".nv.shared.{self.name}": self.shmem_usage = round_up(0x400 + sh.header.sh_size, 128)
if sh.name == f".text.{self.name}":
self.prog_addr, self.prog_sz, self.regs_usage = self.lib_gpu.va_addr+sh.header.sh_addr, sh.header.sh_size, max(sh.header.sh_info>>24, 16)
elif m:=re.match(r'\.nv\.constant(\d+)', sh.name): self.constbufs[int(m.group(1))] = (self.lib_gpu.va_addr+sh.header.sh_addr, sh.header.sh_size)
elif sh.name.startswith(".nv.info"):
for typ, param, data in self._parse_elf_info(sh):
if sh.name == f".nv.info.{name}" and param == 0xa: cbuf0_size = struct.unpack_from("IH", data)[1] # EIATTR_PARAM_CBANK
elif sh.name == ".nv.info" and param == 0x12: self.lcmem_usage = struct.unpack_from("II", data)[1] + 0x240 # EIATTR_MIN_STACK_SIZE
Extract information from the Nvidia-specific headers, like the size of shared memory, registers, etc, used by the kernel.
for apply_image_offset, rel_sym_offset, typ, _ in relocs:
# These types are CUDA-specific, applying them here
if typ == 2: image[apply_image_offset:apply_image_offset+8] = struct.pack('<Q', self.lib_gpu.va_addr + rel_sym_offset) # R_CUDA_64
elif typ == 0x38: image[apply_image_offset+4:apply_image_offset+8] = struct.pack('<I', (self.lib_gpu.va_addr + rel_sym_offset) & 0xffffffff)
elif typ == 0x39: image[apply_image_offset+4:apply_image_offset+8] = struct.pack('<I', (self.lib_gpu.va_addr + rel_sym_offset) >> 32)
else: raise RuntimeError(f"unknown NV reloc {typ}")
Apply relocations. Nvidia has its own types of relocations for different addressing modes on GPU.
Note that
va_addr
is valid on both the host and the GPU.
self.lib_gpu.va_addr, mv_address(image), image.nbytes) ctypes.memmove(
Copy the relocated image into the buffer.
A bit more Nvidia magic needs to happen to kick off the execution, but you get the idea.