Leetcode-ing with Assembly: Add Two Numbers
I was bored yesterday, so I decided to play with Leetcode, but by coding it in assembly. I initially got inspired by the Indonesian tech discourse to solve the simple “Add Two Numbers” problem using assembly (Link to tweet). Then it comes to my mind, how if I solve other problems with assembly as well. Then I delved into the rabbit hole of playing with the x86 instruction sets.
I am trying to solve this linked list problem called Add Two Numbers
Problem Description
We need to implement a function that accepts two linked lists, let’s call them A and B. For each node in lists A and B, the function must perform base-10 addition, meaning the first node in list A is added to the first node in list B, and so on until the end of the list. The addition must only produce a single-digit base-10 value. If the addition results in two digits, the carry must be added to the next node. It is basically a regular base-10 addition but the number is represented as a linked list.
Solution
Writing Assembly in C
Since there is no option to code in assembly directly, we use inline assembly in the C code. Here’s the placeholder that extends from the template provided by Leetcode. The template already provided the function declaration and we just need to define it. I removed the definition and made it just a declaration/prototype, as I am implementing the function directly in assembly. The assembly code comes in the subsequent line with the __asm__
block. See that we can also use verbatim string (R"()"
quote) so we can just directly write proper assembly within the C source code.
Function Definition
We need to define the function name. In assembly, a function name is just a label. You can just basically treat any label as a function. But it is important to define the label as a specific type and introduce it for global linkage. In this way, the compiler and linker will know that this label is intended to be linked in the later stage of compilation. We do this by using assembly pseudo instruction .glob
and .type
.globl addTwoNumbers .type addTwoNumbers,@function addTwoNumbers: ...
Function Prolog and Epilog
Compilers and operating systems agree on the Application Binary Interface (ABI) to enable interoperability between programs. One aspect in the ABI is calling convention. A calling convention defines how the function should be called, receives values from its caller, limitations in messing around within the processor, and communicates back the result of the operation. There are several calling conventions that may be different from machines, operating systems, and runtime systems. x86 calling convention is different with ARM machines. 32-bit and 64-bit machines also have different calling conventions, also between Windows and Linux/Unix/POSIX. Calling convention is not enforced in any way by the machine, but a well-behaved program must adhere to the convention. Failing to do so may make your program misbehave.
As Leetcode compiles and executes the program on a 64-bit x86 Linux machine, we can assume that the function will have to follow the System V AMD64 ABI calling convention. This calling convention adopts Microsoft’s fastcall convention from the 32-bit era to put some of the function parameters on the register, instead of to the stack like the previous cdecl convention. This is to leverage the higher number of registers available in x86-64 architecture, and it is definitely faster to pass value through register than through stack (which is in memory).
In AMD64 ABI, the function arguments are passed via RDI, RSI, RDX, RCX, R8, and R9 registers as needed. If there is a floating point argument, it is stored in XMM0 to XMM7 registers. If the registers cannot contain all required parameters, then the function argument is stored in the stack. The addTwoNumbers function has four integer/pointer parameters. Therefore, it can be stored entirely in the registers. RDI stores the pointer to the first linked list, while RSI stores the pointer to the second linked list.
The function is allowed to use any of the parameter registers for its internal operation. In addition to that, the function can also use R10 and R11 registers without having to respect its caller. However, if the function requires to use more registers, it has to respect its caller. Those registers are by convention owned by the caller function, hence the callee must preserve any of its previous value before using the register, and restore its value when the control is given back to the caller.
In my solution, I decide to use RDX, RCX, R8, and R9 as the scratch register for its internal. Those registers are not required to be preserved. So, it is up to the caller if they still require the value from that register. Don’t forget to also initialize the register value, such as setting the value to 0, as the previous register is a garbage value from the caller function. To initialize value, we can use xor
instruction to itself.
.globl addTwoNumbers .type addTwoNumbers,@function addTwoNumbers: xor %rdx, %rdx xor %r8, %r8 xor %rcx, %rcx xor %r9, %r9 exit: ... ret .data .align 8 .globl linkedListRet linkedListRet: .zero 1600 .text
Preparing for Return Value
Our function needs to return a pointer to the resulting linked list. By calling convention, the return value needs to be stored in the RAX register. But the problem is: what value will you store there?
A linked list is typically known as a dynamic data structure. Many linked list implementations use dynamically allocated memory using malloc. But to solve this simple fizz-buzz challenge, I am too lazy to use malloc as it creates more hassle. Also, the problem statement does not mandate that the return value must be a dynamically allocated pointer. So, let’s just allocate some static space beforehand to store the linked list.
According to the ListNode definition in the solution template, a linked list node requires 16 bytes of space. Since the problem statement mentioned that the maximum list is 100, so we only requires to allocate 1600 bytes of memory to store all the linked list nodes. We can do this by allocating it to a data section in the assembly. I put the label linkedListRet
to refer to the memory location and initialized 1600 bytes of zeros to it. We then can store the address to linkedListRet
to the register RAX for our use as well as the return value. The allocated space acts as an array of ListNode, although this fact is unknown from the perspective of the caller.
.globl addTwoNumbers .type addTwoNumbers,@function addTwoNumbers: xor %rdx, %rdx xor %r8, %r8 xor %rcx, %rcx xor %r9, %r9 lea linkedListRet(%rip), %rax ... exit: ... ret .data .align 8 .globl linkedListRet linkedListRet: .zero 1600 .text
Walking the List
The algorithm is basically simple: walk the two lists at the same time until both of them reaches the end of list, add the numbers from each node, then insert a new node to the return value. The first basic block load_first_list
loads the value from the first list if the pointer to the linked list is not null. Otherwise, it will skip to load and add the value from the second linked list through the load_second_list
basic block. At the same time, we move the linked list to the next node. Remember that the linked list pointer is stored in RDI and RSI registers respectively, as they come from the function parameter. To access the linked list node, we just need to dereference the pointer in those registers.
load_first_list: xor %edx, %edx test %rdi, %rdi jz load_second_list mov (%rdi), %edx mov 8(%rdi), %rdi load_second_list: test %rsi, %rsi jz store_list add (%rsi), %edx mov 8(%rsi), %rsi
The next basic block, compute_sum
, computes the addition result to a new node for the return value. It checks if the result of the addition is above 10, which indicates it needs to store a carry for the next node. To compute the last digit, we simply subtract the addition result with 10. This is way simpler than using modulo 10 as division operation is way more expensive. We leverage the fact that the result of the addition may never exceed 19 as the value in the linked list only contains a single digit, of which the maximum is 9. We store back the indicator to indicate a carry value in the R9 register, which is added to the result beforehand. If the addition does not produce a carry, the control flow jumps directly to store_node
label, where the storing is performed.
compute_sum: add %r9, %rdx xor %r9, %r9 cmp $10, %edx jb store_node mov $1, %r9 sub $10, %edx
As mentioned before, EAX contains an array of linked list nodes. The index of the current element is stored in the R8 register. We first store the value of the current digit in the new node currently pointed by the index. If both linked lists have been walked entirely, indicated by both pointers pointing to 0, it jumps out of the loop to the loop_exit
label. Otherwise, it obtains the address of the next row of the array for the next linked list node, and stores it as the next node pointer in the current node. The loop then continues back to the load_first_list
block.
store_node: mov %edx, (%rax, %r8, 8) cmpq %rdi, %rsi je loop_exit lea 16(%rax,%r8,8), %rcx mov %rcx, 8(%rax, %r8, 8) add $2, %r8 jmp load_first_list
The final loop_exit
block checks if there is a remaining carry to be put in the node. The last summation may still produce a carry, which then must also be put in the last node of the linked list. This loop_exit
block performs final insertion if required, otherwise, it will set the next node pointer of the final node to 0, indicating a null pointer.
loop_exit: test %r9, %r9 jz exit lea 16(%rax,%r8,8), %rcx mov %rcx, 8(%rax, %r8, 8) add $2, %r8 mov %r9, (%rax, %r8, 8) exit: movq $0, 8(%rax, %r8, 8) ret