Misc 1 - elf.py and the ELF format

import os
# os.environ["DEBUG"] = "7"
os.environ["CACHELEVEL"] ="0"
os.environ["NOOPT"] = "1"

ELF is the standard format for executables and libraries, as well as intermediate object files (.o) on Linux.

TinyGrad uses it for 2 somewhat distinct roles:

When it generates CPU code (clang or LLVM IR backends), it then needs to load the generated binary code into memory to run it.

We could have built a shared library (.so), but then I think we’d have to save it to use dlopen(), and it would not be portable anyway. So instead of this, TinyGrad only does the compilation step (generates a .o) and implements a minimal ELF loader.

@uuuuvn also pointed out that doing a proper linking step is slow, and there is a bug in the OSX linker that can’t output to stdout.

When it generates cuda code, it also comes out as ELF. The sections and relocation rules are different from those found on Linux, but the format is the same.

The CUDA Runtime library would be able to load this object file, and that’s what we do for the CUDA backend, but the NV backend does not rely on the CUDA Runtime libraries, so we parse the ELF manually.

The loader and relocation code for host (x86_64 and ARM 64) platforms is implemented in tinygrad/runtime/support/elf.py

Since we only deal with self-contained object files (they don’t access data/function from other files), the task is less danting than you might thnk.

Let’s first look at the ELF format, at least the bits relevant here:

ELF format

ELF Header (`Elf64_Ehdr`)

Located at the very beginning of the file.
Contains essential metadata:
- e_ident: Magic number (\x7fELF) and info like 64-bit (ELFCLASS64), endianness (ELFDATA2LSB), OS ABI.
- e_type: File type (e.g., ET_REL for relocatable/object file, ET_EXEC for executable, ET_DYN for shared object). elf.py expects ET_REL.
- e_machine: Target architecture (e.g., EM_X86_64, EM_AARCH64).
- e_shoff: Offset to the Section Header Table.
- e_shnum: Number of entries in the Section Header Table.
- e_shstrndx: Index of the section header table entry that contains section names.

Section Header Table:

An array of Elf64_Shdr structures.
Each entry describes a section in the file.
Key Elf64_Shdr fields used by elf.py:
- sh_name: Offset into the Section Name String Table (.shstrtab) for this section’s name.
- sh_type: Type of section (e.g., SHT_PROGBITS for code/data, SHT_SYMTAB for symbols, SHT_RELA/SHT_REL for relocations, SHT_STRTAB for strings, SHT_NOBITS for .bss).
- sh_flags: Attributes like SHF_ALLOC (load into memory), SHF_WRITE, SHF_EXECINSTR.
- sh_addr: The intended virtual memory address during execution. For .o files (type ET_REL), this is often 0 for most sections, as the final address isn’t known yet. elf.py updates this for sections it places.
- sh_offset: Offset of the section’s content within the ELF file itself.
- sh_size: Size of the section’s content in the file (or memory size for SHT_NOBITS).
- sh_link, sh_info: Used by specific section types (e.g., .symtab uses sh_link to point to its string table).
- sh_addralign: Required alignment constraint for the section’s start address.
- sh_entsize: Size of each entry if the section holds a table (like symbol table or relocation table).

Sections:

Contiguous chunks of bytes (or just metadata for SHT_NOBITS) described by the Section Header Table.
Common sections relevant here:
- .text: Executable code (SHT_PROGBITS).
- .data: Initialized global/static variables (SHT_PROGBITS).
- .rodata: Read-only data (constants, strings) (SHT_PROGBITS).
- .bss: Uninitialized global/static variables (SHT_NOBITS). Occupies no space in the file but needs space allocated in memory. (elf.py doesn’t explicitly handle .bss size allocation based on sh_size, it only lays out SHT_PROGBITS).
- .symtab: Symbol Table (SHT_SYMTAB). Lists defined and referenced symbols (functions, variables).
- .strtab: String Table (SHT_STRTAB). Stores names for symbols.
- .shstrtab: Section Header String Table (SHT_STRTAB). Stores names for sections.
- .rela.text, .rel.text, etc.: Relocation sections (SHT_RELA, SHT_REL). Contain instructions for patching code/data.

Relocations

Even for a self-contained file, the compiler generates an object file with symbols (constants, global vars, functions) that don’t have their final location assigned. This code might:

Access a function defined elsewhere in the same file (e.g., a helper function in another part of .text).
Reference constants or static data (e.g., accessing a string literal in .rodata or a static array in .data).

When generating CPU code, PC-relative addressing are used, so the code does not care where it will be placed in memory for execution. Still, the code will refernce stuff in .rodata, so it needs to know the address.

TinyGrad (at least for now) does not generate code with subroutines or global variables, which makes out task easier.

Relocations are metadata entries that tell our loader how to patch the machine code once we’ve determined where each section will be placed in memory. For self-contained files, we only need to resolve references within the same object file, not external dependencies.

There are 2 types of relocation entries, REL and RELA:

For rel, the instruction will contain the address of relative to Program Counter (PC), so we will read this address, add the offset, and patch the instruction.
For rela, the instruction contains zeros (or any value really) for the address, and the rela entry contains the address. We add offset to it, and rewrite the value from the instruction.

Relocation Sections:

Contain arrays of relocation entries.
The section name indicates which other section the relocations apply to (e.g., .rela.text contains patches for the .text section). elf.py uses sh.name[5:] or sh.name[4:] to find the target section name.
There are two main types:
- SHT_RELA: Entries contain an explicit addend. (Elf64_Rela)
- SHT_REL: Entries have an implicit addend (usually the value already present at the location being patched). (Elf64_Rel)

Relocation Entries (`Elf64_Rela` / `Elf64_Rel`):

Each entry describes a single patch. Key fields:

r_offset: The offset within the target section where the patch needs to be applied. Let’s call the final address P = section_base + r_offset.
r_info: A combined field containing:
- Symbol Index (ELF64_R_SYM(r_info)): An index into the .symtab section. This identifies the symbol whose address is needed for the calculation. Let the symbol’s final address be S.
- Relocation Type (ELF64_R_TYPE(r_info)): Specifies how to calculate the value to be patched and how to insert it at P. This is architecture-specific. Examples:
  - For X86_64, we only support R_X86_64_PC32, calculate S + A - P (Symbol + Addend - PatchLocation) and store the 32-bit result.
  - For ARM 64, we have more options, and we actually have to patch some bits in the target instruction.
r_addend (Elf64_Rela only): An explicit constant value (A) to be used in the relocation calculation (e.g., S + A - P). For Elf64_Rel, the addend A is implicitly the value already stored in the instruction/data at P before patching.

Running code on CPU

Let’s look at an example:

elftest.c

float foo(int x) { return x + 12345678.f; } // I use float because an int const stays in .text for some reason

!clang -c -x c -march=native --target=x86_64-none-unknown-elf -O0 -fPIC \
    -ffreestanding -fno-math-errno -fno-ident -nostdlib elftest.c -o elftest.o

!objdump -r elftest.o


elftest.o:     file format elf64-x86-64

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE
0000000000000010 R_X86_64_PC32     .LCPI0_0-0x0000000000000004

We have a single relocation entry, it’s for that float constant that went onto .rodata.

!objdump -d elftest.o


elftest.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   c5 fa 2a 45 fc          vcvtsi2ssl -0x4(%rbp),%xmm0,%xmm0
   c:   c5 fa 10 0d 00 00 00    vmovss 0x0(%rip),%xmm1        # 14 <foo+0x14>
  13:   00 
  14:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
  18:   5d                      pop    %rbp
  19:   c3                      ret

You can see for that vmovss instruction, the address part is currently 00 00 00 00. The address part is located at 0x10, which matches the offset in the relocation table.

Let’s look at the .rodata.cst4 section. The .cst4 means this .rodata sectoin is for 4-byte constants.

!objdump -s -j .rodata.cst4 elftest.o


elftest.o:     file format elf64-x86-64

Contents of section .rodata.cst4:
 0000 4e613c4b                             Na<K

Must be the float

import numpy as np
hex(np.float32(12345678.).view(np.uint32))

'0x4b3c614e'

Yes it is.

The bytes are in reverse order because x86 is a little-endian architecture.

Let’s load it with elf_loader

from tinygrad.runtime.support.elf import elf_loader, relocate

with open('elftest.o', 'rb') as f:
    elf_bytes = f.read()

image, sections, relocs = elf_loader(elf_bytes)

print(f"Image size: {len(image)} bytes")
print("\nSections:")
for i, section in enumerate(sections):
    print(f"  {i}: {section.name} (size: {section.header.sh_size}, addr: {hex(section.header.sh_addr)})")

if relocs:
    print("\nRelocations:")
    for i, (offset, target, r_type, addend) in enumerate(relocs):
        print(f"  {i}: offset={hex(offset)}, target={hex(target)}, type={r_type}, addend={addend}")

Image size: 32 bytes

Sections:
  0:  (size: 0, addr: 0x0)
  1: .strtab (size: 94, addr: 0x0)
  2: .text (size: 26, addr: 0x0)
  3: .rela.text (size: 24, addr: 0x0)
  4: .rodata.cst4 (size: 4, addr: 0x1c)
  5: .note.GNU-stack (size: 0, addr: 0x20)
  6: .llvm_addrsig (size: 0, addr: 0x0)
  7: .symtab (size: 96, addr: 0x0)

Relocations:
  0: offset=0x10, target=0x1c, type=2, addend=-4

Disassemble the code in the image:

from tinygrad.helpers import capstone_flatdump

capstone_flatdump(image[:0x1a])

0x000000: push  rbp
0x000001: mov   rbp, rsp
0x000004: mov   dword ptr [rbp - 4], edi
0x000007: vcvtsi2ss xmm0, xmm0, dword ptr [rbp - 4]
0x00000c: vmovss    xmm1, dword ptr [rip]
0x000014: vaddss    xmm0, xmm0, xmm1
0x000018: pop   rbp
0x000019: ret

The .rodata was added jught after .text (with 2 bytes to align it on a 4-byte boundary)

image[0x1c:0x20].hex()

'4e613c4b'

Now we can apply the relocations. This is pretty much what elf.py:jit_loader does.

Note: jit_loader is not directly related to TinyJit, JIT is an overloaded term, and we compile the kernels … just in time.

import struct
for ploc,tgt,r_type,r_addend in relocs:
    print(f"Relocating at address {hex(ploc)}, PC at {hex(ploc+4)}")
    print(f"Before: 0x{image[ploc:ploc+4].hex()}")
    image[ploc:ploc+4] = struct.pack("<I", relocate(struct.unpack("<I", image[ploc:ploc+4])[0], ploc, tgt+r_addend, r_type))
    print(f"After:  0x{image[ploc:ploc+4].hex()}")

Relocating at address 0x10, PC at 0x14
Before: 0x00000000
After:  0x08000000

The 0x08000000 is actually just 8, keep the endianess in mind. While we are patching at address 0x10, the PC must be at the next instruction already, 0x14, skipping the 4 byes of 00 00 00 00.

hex(0x14 + 8)

'0x1c'

And this is indeed the address of the constant. Now, let’s run the code. This is normally done in device.py:CPUProgram.__init__

from tinygrad.helpers import mv_address
from tinygrad.device import CPUProgram
import ctypes
from mmap import mmap, PROT_READ, PROT_WRITE, PROT_EXEC, MAP_ANON, MAP_PRIVATE

mem = mmap(-1, len(image), MAP_ANON | MAP_PRIVATE, PROT_READ | PROT_WRITE | PROT_EXEC)
mem.write(image)

CPUProgram.rt_lib["__clear_cache"](ctypes.c_void_p(mv_address(mem)), ctypes.c_void_p(mv_address(mem) + len(image)))

fxn = ctypes.CFUNCTYPE(ctypes.c_float)(mv_address(mem))

# CPUProgram.__call__
fxn(ctypes.c_int32(321))

12345999.0

Allocate a piece of memory, and make it executable. We can’t just mark our image as executable, because the flags are applied to memory pages (4096 bytes), and our image is not aligned to page boundary.
Copy the image (lib) into that memory.
Clear instruction cache. I’m not entirely sure how the instruction cache works on different architectures, I guess it might be required at least on some architecures. Skipping this step did not cause issues for me.
Wrap it in a python function using ctypes and call it.

NV backend

Now, we also use the same loader with the NV backend without relying on the CUDA Runtime library to load it. Let’s take a look at ops_nv.py:NVProgram

    image, sections, relocs = elf_loader(self.lib, force_section_align=128)
    self.lib_gpu = self.dev.allocator.alloc(round_up(image.nbytes, 0x1000) + 0x1000, BufferSpec(cpu_access=True))

Load the elf and allocate a buffer on the GPU.

Note: We use the CUDA Unified Memory, so the buffer is mapped at the same vitual address on both the host and the GPU. I did not know this was even possible.

    self.prog_addr, self.prog_sz, self.regs_usage, self.shmem_usage, self.lcmem_usage = self.lib_gpu.va_addr, image.nbytes, 0, 0x400, 0
    self.constbufs: dict[int, tuple[int, int]] = {0: (0, 0x160)} # dict[constbuf index, tuple[va_addr, size]]
    for sh in sections:
      if sh.name == f".nv.shared.{self.name}": self.shmem_usage = round_up(0x400 + sh.header.sh_size, 128)
      if sh.name == f".text.{self.name}":
        self.prog_addr, self.prog_sz, self.regs_usage = self.lib_gpu.va_addr+sh.header.sh_addr, sh.header.sh_size, max(sh.header.sh_info>>24, 16)
      elif m:=re.match(r'\.nv\.constant(\d+)', sh.name): self.constbufs[int(m.group(1))] = (self.lib_gpu.va_addr+sh.header.sh_addr, sh.header.sh_size)
      elif sh.name.startswith(".nv.info"):
        for typ, param, data in self._parse_elf_info(sh):
          if sh.name == f".nv.info.{name}" and param == 0xa: cbuf0_size = struct.unpack_from("IH", data)[1] # EIATTR_PARAM_CBANK
          elif sh.name == ".nv.info" and param == 0x12: self.lcmem_usage = struct.unpack_from("II", data)[1] + 0x240 # EIATTR_MIN_STACK_SIZE

Extract information from the Nvidia-specific headers, like the size of shared memory, registers, etc, used by the kernel.

    for apply_image_offset, rel_sym_offset, typ, _ in relocs:
      # These types are CUDA-specific, applying them here
      if typ == 2: image[apply_image_offset:apply_image_offset+8] = struct.pack('<Q', self.lib_gpu.va_addr + rel_sym_offset) # R_CUDA_64
      elif typ == 0x38: image[apply_image_offset+4:apply_image_offset+8] = struct.pack('<I', (self.lib_gpu.va_addr + rel_sym_offset) & 0xffffffff)
      elif typ == 0x39: image[apply_image_offset+4:apply_image_offset+8] = struct.pack('<I', (self.lib_gpu.va_addr + rel_sym_offset) >> 32)
      else: raise RuntimeError(f"unknown NV reloc {typ}")

Apply relocations. Nvidia has its own types of relocations for different addressing modes on GPU.

Note that va_addr is valid on both the host and the GPU.

    ctypes.memmove(self.lib_gpu.va_addr, mv_address(image), image.nbytes)

Copy the relocated image into the buffer.

A bit more Nvidia magic needs to happen to kick off the execution, but you get the idea.