RISC-V RV32I assembly with Ripes simulator

Assembly is the closest resembling programming language to pure machine code instructions. The available instructions depend on the architecture and even supported extensions. In this tutorial the available instructions will be limited to the most basic set of RISC-V instructions. This set of instructions is denoted as RV32I meaning that it entails RISC-V 32 bit basic integer instructions. Not only will this limited set simplify the explanation and subsequently aid the understanding. More important, the Ripes simulator only supports RV32I and RV32M extensions. Here the RV32M identifies that multiplication instructions are also available to be performed on integers. Before continuing with the installation of Ripes lets briefly consider the available extensions for RISC-V today. Take into account that these extensions are being actively worked on, new extensions get drafted and existing extensions get updated.

Extension Overview [1]

A – Atomic instructions
B – Bit manipulation instructions
C – Compressed instructions
D – Double-precision floating-point instructions.
E – Embedded applications, resource constrained subset
F – Single-precision floating-point instructions
G – General (I + M + A + F + D)
I – Integer instructions
J – Dynamically translated languages
L – Decimal floating-point instructions
M – Integer multiplication and division instructions
N – User-level interrupt instructions
T – Transactional memory instructions
V – Vector operations instructions

These extensions are the instruction extensions as defined for the user-level ISA specification 2.2 [2]. In addition to this specification there are also the unprivileged ISA specification [3], privileged ISA Specification [4] and a debug specification [5]. All these extensions and specifications might seem excessive but they allow for great flexibility. This flexibility will allow to create low power tiny microprocessors with only integer arithmetic. As well as full desktop processors with vector extensions. Furthermore, even special designed processors such as secure processors with separate program and data memory are possible.

Understanding Basic Assembly

In a nutshell, assembly instructions perform read, write or arithmeti operations on either registers or memory addresses. These registers are locations inside the processors of a specific length to which can be read from or written to. Typically the available registers depends on the extensions implemented by the processor. One can think of registers as short term memory. Places to store temporary information so it can be written down (to memory) later. For the RV32I extensions there will be 32 individual 32bit long registers labeled x0 to x31. For Convenience many of these registers have specific names which indicates their purpose. In reality this is just for semantics and one is free to use any register for their desired purpose. Compilers, however, operate under the assumption that they can use certain registers consistently. To correctly identify return types for example.

First Instruction

The first instruction to consider is called addi this is short for add immediate. Just like functions these instructions take arguments where in the case of addi three arguments are expected. First is the register to store the immediate into, second is the first integer and third is the second integer. The addition of the first and second integer will be stored in the specified register. Consider the example shown below where the addition of 0 + 5 is stored in the a0 (x11) register.

addi a0, zero, 5

Since an integer is a 32bit value it can only represent a limited range of discrete numbers. When adding to sufficiently large integers the addition can no longer represent the entire discrete number and it overflows. In other ISAs such as X86 such an overflow will set a flag in specific register. Such flags allow the program to detect that an overflow has occurred. However, in RISC-V it is up to the individual implementation how to handle such overflows, divide by zero, carries and other arithmetic events that might occur when executing instructions. Of course just adding, subtracting or multiplying integers is only going to get us so far. Most programs rely on conditions where a certain set of instructions is only executed if the condition holds. With subsequent assembly instructions the equivalent pseudocode as it could be interrupted in a higher level language such as Java or C is shown beside it.

Conditional Branches / Jumps

blt a0, a1, .conditional # if(a0 < a1) goto conditional;

Above is the branch less than instruction which jumps to the specified label if the first parameter is less than the second parameter. The third argument is the label to jump to. Labels are declared in assembly programs with specific names so they can be jumped back to later. Efficient assembly code uses small general purpose snippets of labeled code which can be reused many times. This effectively reduces the size of the program although it can sometimes be at the cost of performance. In addition to conditional jump instructions or branch instructions are they are called in RISC-V. There is also the unconditional jump. This instruction will always jump to the specified label unconditionally and in RISC-V it is denoted simply as j.

j .exit # goto exit;

Why labels are more comparable to goto than a function call is because of the assembly instruction call which performs the actual call to functions. Calls are more involved than jumps requiring the correct setup of stack frames and dealing with return data. For now, it is important to know that it exists but basic loops are covered first before dealing with proper functions and their corresponding stack frames.

Storing and Loading from Memory

The final instructions to consider before the first program can be analyzed are the load and store instructions. Known as lw and sw in RISC-V these mean load word and store word respectivelu. Read the following examples below but do not worry if it is not immediately clear how they operate.

sw a0, 0(sp) # store the contents of a0 in the address of sp
lw a0, 0(sp) # load the contents of the address sp into register a0

An important property of a register is that it can be used to store a address in memory. This is known as an reference, subsequently with such a reference we can point to specific addresses in memory. The lw and sw allow to store and load information from registers into these memory references. The example below will show the importance of reading and writing to memory although arguably the most complicated instruction shown yet.

addi sp, zero, 0xff # set the sp register to 0xff in hexadecimal
sw   a0, 0(sp) # use the value 0xff from sp as an memory address to store a0
addi, a0, zero, zero # set register a0 to 0 losing its previous value
lw  a0, 0(sp) # use the value 0xff from sp as memory address to restore a0

This covers all assembly instructions that will subsequently be used in the Ripes simulator. When using this simulator the execution pipeline and results on registers or memory locations can be viewed graphically.

Installing Ripes

There are multiple methods to install Ripes for most operating systems. By far the simplest method is to download the release from Github. The information in this post will be based on Ripes version 1.0.3 [6].

All the three executables the .exe, app and AppImage can be executed directly. As a result, the download location is up for the end user to decide and afterwards the file can be removed if desired. OSX users are entrusted to be familiar with allowing third party executables to be allowed to run. Similarly, Linux users are expected to know how to set file permissions to make files executable. Now start Ripes and lets get started.

Ripes simulator processor view

The First Basic Loop

The first example will entail a basic for loop that will start at 5 and decrement until smaller than 0. At first a minimal assembly representation of this will be covered. Slowly other important aspects that also need to be handled in assembly will be covered.

#pseudo code
for(int i = 5; i > 0; i--) {}

# assembly equivalent
addi a0, zero, 5
addi a1, zero, 1
j .loop

.loop:
    blt a0, a1, .exit_loop
    addi a0, a0, -1
    j .loop

.exit_loop:
    ...

Before starting to announce improvements lets analyze the assembly shown above. Initially the value 5 is put into the a0 register following 1 being put into a1. Now a unconditional jump is performed to enter .loop. Inside the loop (a0 < a1) is evaluated and if true a jump to .exit_loop is performed. Obviously 5 is not less than 1 so the execution continuous normally. Next a0 is decremented by one and the subsequent jump goes back to the start of the loop. Now 4 is still not less than 1 but the looping pattern has become apparent. When the value in a0 has become 0 it is less than 1 which is in the a1 register and a jump to .exit_loop will be performed. Below the operation of this simple loop is visually demonstrated step-by-step, of course it is encouraged to perform this stepping in Ripes yourself.

Functions Calls and Stack Frames

Inevitably a program at some point becomes so significantly complex that it would be impossible to keep an overview without functions. Even in assembly the concept of functions is well know, however, in assembly a programmer needs to perform various tasks to ensure a call executes correctly. Before the next two assembly instructions call and ret can be properly introduced, some registers their special purpose has to be described. Lets take a look at the registers sp(x2), ra(x1), s0(x8). Here the labels are no longer of significant value so they will be omitted in the future. It no longer matters that a0 is label x11 for example. Here sp stand for stack pointer, ra stands for return address and s0 or otherwise known as fp stands for frame pointer. The basic purposes of these registers will be explained. However, for a more thorough guide on register conventions one is advised to read Harry H. Porter III his guide [7] starting on page 146.

Return address (ra)

The return address (ra) is used to return from a call back to where the program was executing instructions before the call. In other architectures such as X86 the return address is stored in main memory. Using registers instead allows the executions of calls to be performed much faster and with lower overhead However, the storing or the return address in main memory can only be avoided if the function makes no subsequent calls to other functions.

Stack pointer (sp)

The stack pointer (sp) reserves space for calls to store variables in main memory. It must always be modified by multiples of 16 bytes and reserving space must be done by decrementing it. This is very typical for most architectures where variables on the stack and heap are directly opposite from one another. This allows to effectively use the entirety of available memory without variables on the stack overwriting variables on the heap. After the reserving of space by decrementing sp it must be subsequently incremented again by the same amount before returning from the call.

Frame pointer (a0, fp)

The frame pointer (fp) is not always needed with every call, furthermore, it is likely that most calls won’t need to use the frame pointer. The frame pointer is used to determine offsets for calls that create variables with dynamic sizes. Additionally, a frame pointer is needed when using a large amount local variables due to a limitation in RISC-V on the maximum size of relative offsets. The dynamic use of the frame pointer is outside the scope of this tutorial, however, the value will correctly be incremented and stored in examples.

Stack frames

Stack frames are sized in 16 bytes chunks, naturally, this makes the smallest possible stack frame 16 bytes large. A stack frame is constructed by decrementing the current value of sp and subsequently storing ra and fp in main memory. The storing of ra and fp is done with a relative offset to sp. Finally fp incremented by the size of the stack frame. When a stack frame is destructed the original value of fp is restored from main memory. Followed by restoring the value of ra. Finally, the value of sp is incremented back up to its original value. Below is an example of the construction and destruction of the most basic stack frame.

# construction
addi    sp, sp, -16
sw      ra, 12(sp)
sw      s0, 8(sp)
addi    s0, sp, 16
# destruction
lw      s0, 8(sp)
lw      ra, 12(sp)
addi    sp, sp, 16

Before putting this all together and introducing the call and ret instructions. Consider the following stack frame that takes one argument and returns this argument after performing an addition with itself. The argument taken is a regular 32bit integer and all operations with this argument use the a0 register. Upon careful inspection of the assembly instructions it can be observed that many of the sw and lw instructions could be optimized away. Partly this is the job of the compiler but also of the programmer. Writing return num + num; would have removed almost all of the sw and lw instructions shown below for example.

#pseudo code
int sum(int num) {
    num += num;
    return num;
}

# construction
addi    sp, sp, -16
sw      ra, 12(sp)
sw      s0, 8(sp)
addi    s0, sp, 16
# store first argument
sw      a0, -12(s0)
# addition of a0 + a0
lw      a0, -12(s0)
add     a0, a0, a0
sw      a0, -12(s0)
# destruction
# load first argument
lw      a0, -12(s0)
lw      s0, 8(sp)
lw      ra, 12(sp)
addi    sp, sp, 16

Function Calls

The fundamental understanding of critical registers as well as the stack frames allows to define complete functions which can be called in assembly. The call instruction is similar to the jump in that it will jump to a label, however, it expects the subsequent instructions to properly construct and destruct a stack frame. When the destruction of the stack frame is completed the ret instruction should be called. This instruction will ensure that execution continues from where the call was made. To demonstrate consider the same example as previously but now with appropriate function calls.

int sum(int num) {
    num += num;
    return num;
}

int main(int argc, char** argv) {
    return sum(5);
}

From main the sum function is called with the integer value 5 as it’s first argument. For simplicity, let’s show the resulting assembly without the construction and destruction of the main stack frame. Instead focus on how call and ret are used to jump to parts of the assembly code.

sum(int):
addi    sp, sp, -16
sw      ra, 12(sp)
sw      s0, 8(sp)
addi    s0, sp, 16
sw      a0, -12(s0)
lw      a0, -12(s0)
add     a0, a0, a0
sw      a0, -12(s0)
lw      a0, -12(s0)
lw      s0, 8(sp)
lw      ra, 12(sp)
addi    sp, sp, 16
ret

main:
...
addi    a0, zero, 5
call    sum(int)
...
ret

Before calling the sum function the literal 5 is stored in register a0. This is because a0 is the first argument by convention. Now call will jump to the sum(int): label and start executing instructions from there. In sum it can be seen that a0 is assumed to contain the value for the first argument because of the sw a0, -12(s0) instruction. Naturally, the following lw a0, -12(s0) instruction is redundant and only placed by the compiler if the -O flag is set to zero. The keen eyed might notice the compiler has still managed to sneak one optimization into the assembly as the return value after the call to sum is assumed to still be in a0 when the main calls ret.

Simulating Function Calls in Ripes

When observing the resulting assembly in Ripes the call and ret instructions no longer exist. This is because these instructions are converted by the assembler depending on the addressing mode [8]. The most common mode is called PC-relative and this uses relative offsets from the program counter pc current address. This special register is only accessible through specific instructions as is shown later.

With the first call to sum the instruction is translated into two separate instructions. These instructions are auipc x6, 0 and jalr x1, x6, 24. Firstly let’s translate the registers to the names we have used throughout this post, these now become auipc t1, 0 and jalr ra, t1, 24. Here auipc stores the pc register in t1 with an additional offset of 0. The jalr instruction jumps to t1 + 24 and stores this address + 4 in ra. The +4 is needed so that upon returning the jalr instruction is not executed again. While the 24 is the relative distance between the auipc instruction and the start of .sum:. The special auipc register allows to store the current value of the program counter pc in a general purpose register such as ra.

The ret calls are translated to jalr x0, x1, 0 since this will jump back to the address stored in x1 / ra. The return value of jalr is stored in the unchangeable x0 register since there is no desire to modify any registers upon returning.

Although, it is important to understand that the call and ret instructions are actually translated into separate instructions that differ per memory model. The underlying instructions used to facilitate these instructions are typically not shown in a disassembly or while assembling.

Compiler Optimizations

As a final note, an example of the incredible optimizations compilers are able to generate is shown. The same c code where the sum of 5 is computed in a function is used but now the compiler -O0 flag is changed to -O3.

sum(int):
  slli a0, a0, 1
  ret
main:
  addi a0, zero, 10
  ret

The compiler has managed to infer the computation the sum function is performing and optimize them away by moving 10 into a0 and returning. Meaning, it has completely optimized away any function calls, construction and destruction of any stack frames. However, it still left the label to the sum function and the operations it needs to perform. This is because the function could be referenced from other libraries and files that might want to call it. It should be noted that this optimization can only be performed because the value 5 is known at compile time. Should the value be retrieved as program argument this optimization would not have been possible.

The Next Steps

Keep a look out for subsequent blog posts on RISC-V where working with the Maix bit and accompanying JTAG debugger will be covered. And maybe consider donating so I can buy one of those SiFive processors as well.