Leetcode-ing with Assembly: Add Two Numbers

I was bored yesterday, so I decided to play with Leetcode, but by coding it in assembly. I initially got inspired by the Indonesian tech discourse to solve the simple “Add Two Numbers” problem using assembly (Link to tweet). Then it comes to my mind, how if I solve other problems with assembly as well. Then I delved into the rabbit hole of playing with the x86 instruction sets.

I am trying to solve this linked list problem called Add Two Numbers

Problem Description

We need to implement a function that accepts two linked lists, let’s call them A and B. For each node in lists A and B, the function must perform base-10 addition, meaning the first node in list A is added to the first node in list B, and so on until the end of the list. The addition must only produce a single-digit base-10 value. If the addition results in two digits, the carry must be added to the next node. It is basically a regular base-10 addition but the number is represented as a linked list.

Solution

Writing Assembly in C

Since there is no option to code in assembly directly, we use inline assembly in the C code. Here’s the placeholder that extends from the template provided by Leetcode. The template already provided the function declaration and we just need to define it. I removed the definition and made it just a declaration/prototype, as I am implementing the function directly in assembly. The assembly code comes in the subsequent line with the __asm__ block. See that we can also use verbatim string (R"()" quote) so we can just directly write proper assembly within the C source code.

C Code Placeholders
/**
 * Definition for singly-linked list.
 * struct ListNode {
 *     int val;
 *     struct ListNode *next;
 * };
 */
struct ListNode* addTwoNumbers(struct ListNode* l1, struct ListNode* l2);
__asm__(R"( ... )")

Function Definition

We need to define the function name. In assembly, a function name is just a label. You can just basically treat any label as a function. But it is important to define the label as a specific type and introduce it for global linkage. In this way, the compiler and linker will know that this label is intended to be linked in the later stage of compilation. We do this by using assembly pseudo instruction .glob and .type

.globl addTwoNumbers
.type addTwoNumbers,@function
addTwoNumbers:
    ...

Function Prolog and Epilog

Compilers and operating systems agree on the Application Binary Interface (ABI) to enable interoperability between programs. One aspect in the ABI is calling convention. A calling convention defines how the function should be called, receives values from its caller, limitations in messing around within the processor, and communicates back the result of the operation.  There are several calling conventions that may be different from machines, operating systems, and runtime systems. x86 calling convention is different with ARM machines. 32-bit and 64-bit machines also have different calling conventions, also between Windows and Linux/Unix/POSIX. Calling convention is not enforced in any way by the machine, but a well-behaved program must adhere to the convention. Failing to do so may make your program misbehave.

As Leetcode compiles and executes the program on a 64-bit x86 Linux machine, we can assume that the function will have to follow the System V AMD64 ABI calling convention. This calling convention adopts Microsoft’s fastcall convention from the 32-bit era to put some of the function parameters on the register, instead of to the stack like the previous cdecl convention. This is to leverage the higher number of registers available in x86-64 architecture, and it is definitely faster to pass value through register than through stack (which is in memory).

In AMD64 ABI, the function arguments are passed via RDI, RSI, RDX, RCX, R8, and R9 registers as needed. If there is a floating point argument, it is stored in XMM0 to XMM7 registers. If the registers cannot contain all required parameters, then the function argument is stored in the stack. The addTwoNumbers function has four integer/pointer parameters. Therefore, it can be stored entirely in the registers. RDI stores the pointer to the first linked list, while RSI stores the pointer to the second linked list.

The function is allowed to use any of the parameter registers for its internal operation. In addition to that, the function can also use R10 and R11 registers without having to respect its caller. However, if the function requires to use more registers, it has to respect its caller. Those registers are by convention owned by the caller function, hence the callee must preserve any of its previous value before using the register, and restore its value when the control is given back to the caller.

In my solution, I decide to use RDX, RCX, R8, and R9 as the scratch register for its internal. Those registers are not required to be preserved. So, it is up to the caller if they still require the value from that register. Don’t forget to also initialize the register value, such as setting the value to 0, as the previous register is a garbage value from the caller function. To initialize value, we can use xor instruction to itself.

.globl addTwoNumbers
.type addTwoNumbers,@function
addTwoNumbers:   
    xor %rdx, %rdx
    xor %r8, %r8
    xor %rcx, %rcx
    xor %r9, %r9
    
exit:
    ...
    ret
    
.data
.align 8
.globl linkedListRet
linkedListRet:
    .zero 1600
.text

Preparing for Return Value

Our function needs to return a pointer to the resulting linked list. By calling convention, the return value needs to be stored in the RAX register. But the problem is: what value will you store there?

A linked list is typically known as a dynamic data structure. Many linked list implementations use dynamically allocated memory using malloc. But to solve this simple fizz-buzz challenge, I am too lazy to use malloc as it creates more hassle. Also, the problem statement does not mandate that the return value must be a dynamically allocated pointer. So, let’s just allocate some static space beforehand to store the linked list.

According to the ListNode definition in the solution template, a linked list node requires 16 bytes of space. Since the problem statement mentioned that the maximum list is 100, so we only requires to allocate 1600 bytes of memory to store all the linked list nodes. We can do this by allocating it to a data section in the assembly. I put the label linkedListRet to refer to the memory location and initialized 1600 bytes of zeros to it. We then can store the address to linkedListRet to the register RAX for our use as well as the return value. The allocated space acts as an array of ListNode, although this fact is unknown from the perspective of the caller.

Why the struct requires 16 bytes instead of 12 bytes?

The struct ListNode contains a 4-byte integer and a pointer, which at a glance, should require 12-bytes in a 64-bit machine where the pointer size is 8 bytes. But if you check using sizeof operator, it will return 16 instead. Why so? It is because there is a pointer type that is 8-bytes aligned. Since the previous stack member is only 4 bytes long, it needs to add a 4-byte padding to put the pointer member in the next 8-bytes alignment. Alignment defines the requirement where the object should be placed in the memory. Failing to adhere to this alignment results in undefined behavior. Although nowadays machines would accept unaligned memory access, it may result in performance penalties, especially if the misalignment crosses cache boundaries.

.globl addTwoNumbers
.type addTwoNumbers,@function
addTwoNumbers:   
    xor %rdx, %rdx
    xor %r8, %r8
    xor %rcx, %rcx
    xor %r9, %r9
    
    lea linkedListRet(%rip), %rax
    ...
    
exit:
    ...
    ret
    
.data
.align 8
.globl linkedListRet
linkedListRet:
    .zero 1600
.text

Walking the List

The algorithm is basically simple: walk the two lists at the same time until both of them reaches the end of list, add the numbers from each node, then insert a new node to the return value. The first basic block load_first_list loads the value from the first list if the pointer to the linked list is not null. Otherwise, it will skip to load and add the value from the second linked list through the load_second_list basic block. At the same time, we move the linked list to the next node. Remember that the linked list pointer is stored in RDI and RSI registers respectively, as they come from the function parameter. To access the linked list node, we just need to dereference the pointer in those registers.

load_first_list:
    xor %edx, %edx
    test %rdi, %rdi
    jz load_second_list
    mov (%rdi), %edx
    mov 8(%rdi), %rdi
load_second_list:
    test %rsi, %rsi
    jz store_list
    add (%rsi), %edx
    mov 8(%rsi), %rsi

The next basic block, compute_sum, computes the addition result to a new node for the return value. It checks if the result of the addition is above 10, which indicates it needs to store a carry for the next node. To compute the last digit, we simply subtract the addition result with 10. This is way simpler than using modulo 10 as division operation is way more expensive. We leverage the fact that the result of the addition may never exceed 19 as the value in the linked list only contains a single digit, of which the maximum is 9. We store back the indicator to indicate a carry value in the R9 register, which is added to the result beforehand. If the addition does not produce a carry, the control flow jumps directly to store_node label, where the storing is performed.

compute_sum:
    add %r9, %rdx
    xor %r9, %r9
    cmp $10, %edx
    jb store_node
    mov $1, %r9
    sub $10, %edx

As mentioned before, EAX contains an array of linked list nodes. The index of the current element is stored in the R8 register. We first store the value of the current digit in the new node currently pointed by the index. If both linked lists have been walked entirely, indicated by both pointers pointing to 0, it jumps out of the loop to the loop_exit label. Otherwise, it obtains the address of the next row of the array for the next linked list node, and stores it as the next node pointer in the current node. The loop then continues back to the load_first_list block.

store_node:
    mov %edx, (%rax, %r8, 8)
    cmpq %rdi, %rsi
    je loop_exit
    lea 16(%rax,%r8,8), %rcx
    mov %rcx, 8(%rax, %r8, 8)
    add $2, %r8
    jmp load_first_list

The final loop_exit block checks if there is a remaining carry to be put in the node. The last summation may still produce a carry, which then must also be put in the last node of the linked list. This loop_exit block performs final insertion if required, otherwise, it will set the next node pointer of the final node to 0, indicating a null pointer.

loop_exit:
    test %r9, %r9
    jz exit
    lea 16(%rax,%r8,8), %rcx
    mov %rcx, 8(%rax, %r8, 8)
    add $2, %r8
    mov %r9, (%rax, %r8, 8)
    
exit:
    movq $0, 8(%rax, %r8, 8)
    ret

Final Assembly Code

Assembly Code Solution
.globl addTwoNumbers
.type addTwoNumbers,@function
addTwoNumbers:   
    xor %rdx, %rdx
    xor %r8, %r8
    xor %rcx, %rcx
    xor %r9, %r9
    
    lea linkedListRet(%rip), %rax
    
load_first_list:
    xor %edx, %edx
    test %rdi, %rdi
    jz load_second_list
    mov (%rdi), %edx
    mov 8(%rdi), %rdi
load_second_list:
    test %rsi, %rsi
    jz store_list
    add (%rsi), %edx
    mov 8(%rsi), %rsi
    
compute_sum:
    add %r9, %rdx
    xor %r9, %r9
    cmp $10, %edx
    jb store_node
    mov $1, %r9
    sub $10, %edx
    
store_node:
    mov %edx, (%rax, %r8, 8)
    cmpq %rdi, %rsi
    je loop_exit
    lea 16(%rax,%r8,8), %rcx
    mov %rcx, 8(%rax, %r8, 8)
    add $2, %r8
    jmp load_first_list
    
loop_exit:
    test %r9, %r9
    jz exit
    lea 16(%rax,%r8,8), %rcx
    mov %rcx, 8(%rax, %r8, 8)
    add $2, %r8
    mov %r9, (%rax, %r8, 8)
    
exit:
    movq $0, 8(%rax, %r8, 8)
    ret
    
.data
.align 8
.globl linkedListRet
linkedListRet:
    .zero 1600
.text

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *