MicroB. Writing BASIC in assembler language and squeezing it into 512 bytes

Want some practice in assembler? Today, I will show step-by-step how to write a BASIC interpreter and run it from the boot sector of your PC. My interpreter includes overlapping subprograms with branching recursion – otherwise, BASIC won’t fit in 512 bytes. It’s quite possible that this project becomes the most sophisticated program in your life. But after writing it by your own hands, you will be able to rightfully call yourself a hacker.

How to use the interpreter

By writing BASIC code for the boot sector, I, in fact, transform my PC into an analogue of old home computers, like Commodore 64 or ZX Spectrum: this programming language was hardcoded in their RAM, and users could start programming in BASIC immediately after the boot-up.

For convenience purposes, I formulate the terms of reference (i.e. requirements to my interpreter) as a user manual.

The interpreter should support two modes: (1) interactive mode; and (2) command line mode. In the interactive mode, commands are executed immediately after the input.

In the command line mode, you input the program code into the memory and then execute the run command.

If you need to delete a string from the program code, simply enter its number in the command line.

How does the interpreter know what mode to use for processing of the entered text? If the string starts with a digit, the interpreter processes it in the command line mode; if not, in the interactive mode.

The maximum program size is 999 strings. The maximum string length is 19 symbols. Note that Backspace key operates as usual – even though the characters don’t disappear on the screen, in the buffer everything is OK.

The programmer operates with:

  • three commands: run (run program), list (display program code on the screen), and new (delete program);
  • 26 variables (from a to z): two-byte integer numbers without signs;
  • expressions that can include numbers, four arithmetic operations, brackets, and variables;
  • three operators: : if, goto, and =; and
  • two functions: print and input.

The interpreter understands the following language constructs:

  • var=expr: assign the expr value to the var variable (from a to z);
  • print expr: display the expr value and move the cursor to the next line;
  • print expr;: display the expr value and leave the cursor on the current line;
  • print "][ello": print the line and move the cursor to the next line;
  • print "][ello";: print the line and leave the cursor on the current line;
  • input var: read value from the keyboard and place it into the var variable (from a to z);
  • goto expr: jump to the specified string in the program; and
  • if expr1 goto expr2: if expr1 is not equal to 0, jump to the line expr2; otherwise go to the line specified after if.

Example if c-5 goto 2 (if c-5 is not equal to 0, jump to line 2).

Coding the interpreter

First, I define memory areas to be used by the program:

  • buffer for text entered in the command line;
  • buffer storing the program code;
  • data array to store variables (from a to z); and
  • pointer to the current program line.

I start from pointing all segment registers to CS. Then I remove the direction flag so that the lines are processed only from left to right (even when the program calls instructions like stosb). I fill the buffer intended for the program source code by the 0x0D symbol (the carriage return character also known as the Enter key).

The program code written in BASIC will be processed as a two-dimensional character array 1000 × 20 in size.

If the user enters more than 19 symbols in one line, this line will extend to the next one. In the current interpreter version, this bug is not fixed. Just remember not to enter more than 19 symbols in one line.

Launching the main loop

First, I restore the stack pointer (SP register) – in case the program in BASIC crashes due to an error.

Then I discard the running pointer (current line of the program) and call the input_line subprogram, which is waiting until the programmer types something. The subprogram saves the inputted line in the SI register.

Next, I check whether the line starts from a digit or not. If it does, it must be written to the buffer allocated for the program code. To do this, I first calculate the address to write the string to. The find_address subprogram is responsible for this (it saves the result to the DI register). After finding the required address, I copy the string there: rep movsb.

If there is no digit in the beginning of the string, it is executed immediately: execute_statement.

Processing program strings

Program strings are processed as follows: I take the first word from the string and sequentially compare it with each record in the @@statements table (see the final piece of code in the end of the article). This table stores commands, operators, and functions understandable for the interpreter.

Note the heuristics I use to save bytes on processing the conditional operator: before the entry point execute_statement, I add an additional entry to the same subprogram: @@if_handler.

As a result, I avoid the need to write an additional handler for constructs like if a-2 goto 10. If the expression result (in this particular case a-2) equals zero, the program does not go into if, i.e. ignores the rest of the string (goto 10).

Done with if. Then I process other commands, operators, and functions. I start from skipping extra spaces added by the programmer for convenience purposes. If a string contains nothing but spaces, it is ignored.

If the string is not empty, I examine it in more detail. First, I go through the @@statements table and compare the string with each of its records. To do this, I read the string size (in case of run, it is 3) and then perform the comparison using repe / cmpsb.

If a match is detected, then the DI register starts pointing to the address of the respective handler, and I jump there: jmp [di]. To get a better understanding of how it works, check the end of the article and review the @@statements table. Hint: labels starting with @@ are the handlers’ addresses.

If the entire table has been checked and no matches identified, then the current string of the program is neither a command, nor an operator, nor a function. Perhaps, this is a variable’s name? I jump to @@to_get_var to check this.

I scroll through the DI register to the next record of the table by adding CX (length of the name of the current command, operator, or function plus two more bytes (the handler’s address)). Then I restore the value of the SI register (rep cmpsb has scrolled it forward) so that it points again to the beginning of the string compared with the table of operators.

Now DI points to the next record in the table. If this is a nonzero record, I jump to @@next_entry to compare the program string (to be specific, its beginning) with this record.

If the entire table has been checked and no matches identified, then the current string of the program is neither a command, nor an operator, nor a function. If so, this is likely an assignment construct (e.g. var=expr). In theory, no more variants are possible (unless the source code contains a syntax error).

Now I have to calculate the expr expression and save the result at the address linked with the var variable. The get_variable subprogram calculates the required address and pushes it into the stack.

After finding the address, I check whether an assignment operator exists after the variable name. If yes, it must be executed. But in order to save a few bytes, I won’t do this right here.

Later in the code, I will have to implement an assignment inside the input function. So, I jump to that piece of code: @@assign. I don’t need the entire input function, only its final part. And I won’t return to execute_statement anymore: the input function will perform the required ret instruction.

If there is no assignment symbol, I print the error notification and terminate the program (i.e. jump to @@main_loop). In the main loop, the interpreter will restore the stack pointer and continue its operation despite the encountered syntax error.

The “list” command

The list command displays on the screen the program listing saved in the buffer.

First, I reset to zero the current string number in the program: xor ax, ax. Then I use the string number to calculate the address from where to read the program string: find_address. After finding the string address, I compare its first symbol with 0x0D (i.e. check whether the string is empty). If it’s empty, I move to the next string.

If the string is not empty, I display it on the screen. First, I display its number: output_number. Then I read the string character-by-character from the buffer until I stumble upon 0x0D. In a similar way, I display it on the screen character-by-character.

After that, I move to the next string of the program (inc ax) and repeat the same operations until I reach max_line (i.e. 1000).

The “input” function

The input function enables you to enter a number from the keyboard. It works as follows.

First, the interpreter calculates the address of the variable specified in the program. Then it displays a question mark and waits until the user prints something.

In case the user inputs not just a number but a sophisticated expression, the interpreter processes the inputted string with process_expr and saves the result at the address calculated in the beginning of the subprogram. The stosw instruction will assist me in this.

Processing expressions

A three-level processing scheme will be applied to expressions:

  1. Addition and subtraction;
  2. Multiplication and division; and
  3. Nested expressions (in brackets), names of variables, and numbers.

At the first level (whose priority is the lowest), I immediately pass the control to the second level by calling expr2_left. After processing higher-priority operations, I look at the next symbol. If this is an addition or subtraction sign, I process it.

First, I save the left value: push ax. Then I pass the right part of the expression to the second level: expr2_right. After performing higher-priority operations, I take the left value (pop cx) and perform the required operation with the right value: add ax, cx or sub ax, cx.

Finally, I jump to @@next_sub_add and launch a cycle to calculate expressions like 1+2+5-4.

At the second level, I perform the same operations as described above. First, I pass the control to the higher-priority level by calling expr3_left. Then I look at the next symbol. If this is a multiplication or division sign, I process it. Similarly to the previous level, I launch a cycle (@@next_div_mul) in the end to process expressions like 3*4*2/1.

Important: thanks to the three-level structure described above, the interpreter automatically takes into account the priority of operations. For instance, 5*6+3*4 is calculated as (5*6)+(3*4).

At the third layer, I process brackets (nested expressions), names of variables, and numbers.

First, I remove spaces from the inputted string. Then I look at its left symbol. If this is an opening bracket, I start recursion by calling process_expr. After processing the nested expression, I check whether it has a closing bracket. If not, I display the error message. If the closing bracket exists, I skip spaces going after it and execute ret.

But what if some other character (i.e. not an opening bracket) is present on the left? Perhaps, this is a variable’s name? Checking this: cmp al 0x40 / jnc @@yes_var.

If my assumption is confirmed, I start reading the variable’s value. If not, then the current symbol is a fragment of a number. I make a step back (dec si) and call dec_str_to_number to read the number.

Subprogram: calculate variable’s address on the basis of its name

This subprogram has two entry points. At the first entry point (get_var), I read the variable’s letter using lodsb; at the second one (get_var_2), I use the variable’s name transmitted via the AL register. Yes, this is a pretty twisted way, but it saves me a few bytes.

Now that I have the variable’s name, I should calculate the address it’s linked with. This operation is performed in three steps:

  1. I execute and al, 0x1F to get the variable’s number; at this point, ASCII values 0x61-0x7A are converted into 0x01-0x1A;
  2. I multiply the result by 2 because each variable uses a 16-bit word in the memory; and
  3. I add to the address its upper part: mov ah, vars>>8.

Voila! I know the memory cell the variable is linked with.

Note that this code is interlinked with the subprogram that skips spaces. The subprogram also has two entry points. The first one (skip_spaces) skips spaces located after the variable; the second one (skip_spaces_2) performs the same operation but leaves the first space intact.

Subprogram: print decimal numbers

The subprogram prints on the screen numbers transmitted via the AX register. It works as follows: AX is recursively divided by 10. After each operation, the remainder in division is pushed into the stack. After reaching zero, I start sequentially exiting the recursion. Prior to each exit, I pop the respective reminder from the stack and display it on the screen.

A few examples. If AX = 4, then after the division by 10, AX will contain 0, and the output_number won’t start a recursion. Instead, it will just display the reminder (i.e. 4), and that’s it.

If AX = 15, then after the division by 10, AX` will contain 1. Accordingly, the subprogram starts a recursion. It displays 1 there, then passes from the internal call into the main one, and prints figure 5 there.

Important: (1) jmp is present in the end; and (2) the program doesn’t have its own ret instruction. This is another hack enabling me to save a few bytes. I don’t call the subprogram displaying the character but jump to it instead. Then this subprogram executes ret.

Subprogram: from decimal string to hexadecimal number

The subprogram translates a string containing a decimal number (the SI register points to it) and writes the result to the AX register.

A decimal string is converted into a hexadecimal number as follows. I use a cycle to go through all digits from left to right. Every time I multiply the accumulator by 10 (at the first time, it contains zero) and add the current digit to it. If I encounter in the string a character that is not a digit, an error occurs, and I exit.

Note that two operations are present between cmp al, 10 and the conditional transfer. So, the final result is always written to the AX register, even in the case of an error.

“Run/goto” commands

Run и goto commands launch the program saved in the buffer. Run launches it from the beginning, while goto, from a certain line. In the interpret, both commands are processed by the same subprogram.

When the interpreter sees the run command, it converts it into goto 0 first and then handles it similarly to goto. To convert run into goto 0, the interpreter resets the AX register to zero and jumps to @@goto_handler.

I start executing the goto command from calculating the expression written after goto. The result (i.e. number of the line to jump to) is written to AX.

Then I use the line number to calculate its address: find_address.

Note that the goto command can be used both in the interactive and command line modes (i.e. it can be used as an operator or entered from the keyboard for immediate execution).

A comparison instruction called cmp word [running], 0 checks whether I have got here from the interactive mode or from the command line mode.

If I got there from the command line mode, I write in the running variable the address from where the program execution should continue. At the next cycle, the interpreter will retrieve the required string from this address and execute it.

But if I get there from the interactive mode (the programmer uses goto, not run, to enter the program), then I jump to the program string specified in goto.

Below is a subprogram that calculates the string address in the program code by its number. The string number is saved in AX; I multiply it by the standard string length and add the address of the first string in the program. This is how the required address is calculated.

Subprogram: input strings from the keyboard

The interpreter must be able to accept program strings inputted from the keyboard. The input_line subprogram is responsible for this. At the input, it receives the command line symbol (>) via the AL register.

First, I display the command line. Then I point to the buffer where the inputted text should be saved. Next, I read the character inputted from the keyboard and save it in the buffer (provided that this is not the Backspace key (0x08)). If this is Backspace, I reduce the DI register by one so that it points to the previous character.

The “print” function

The print function must be able to understand various syntax variants:

  • print moves cursor to the next line;
  • print "Hello" prints text and moves cursor to the next line;
  • print "Hello;" prints text and leaves cursor on the current line;
  • print 5 prints number and moves cursor to the next line; and
  • print 5+2; prints number and leaves cursor on the current line.

The first comparison, cmp al, 0x0D, checks whether this is the first syntax variant (no arguments).

The second comparison, cmp al, '"', checks whether this is one of the two syntax variants used to print text. In this case, the program displays all characters one-by-one until it stumbles upon a closing quote or 0x0D. If it encounters 0x0D, this means that the programmer forgot to enter the second quote. No error messages are displayed in this situation.

Then I check whether a semicolon is present after the quote. If not, the cursor moves to the next line.

At the label @@no_quote, the program processes the two syntax variants used to print numbers. If there is an expression instead of a number, it is calculated. Then the program prints its result and checks whether a semicolon is present after it. If not, the cursor moves to the next line.

Subprograms: input and print characters

The input_key subprogram reads the pressed key using BIOS interrupt 0x16, function 0x00. The ASCII code of the pressed key is written to the AL register.

Important: this program overlaps with the next one; therefore, it does not have the ret instruction in the end. Why use such tricks? To display the character received from the keyboard without spending extra bytes.

The output_char subprogram first checks whether the Enter key is pressed: cmp al, 0xD. If yes, then 0xA is printed in addition to the 0xD character. Without 0xA, the cursor moves left without jumping to the next line. The characters are printed using BIOS: interrupt 0x10, function 0x0E.

Table of commands, functions, and operators

The next step is to create a table of commands, functions, and operators understandable to the interpreter. Each record consists of three fields: string length; command/function/operator name; and entry point for the processing subprogram. Zero in the end of the table notifies the interpreter that the table is over.

The above structure makes it easy to add new commands, functions, and operators. You just add one more line, then insert the processing subprogram in a suitable location, and that’s it. No more changes have to be made in the code.

The last two bytes notify BIOS that the code written in the boot sector is not some nonsense – but an interpretable program (in this particular case, a BASIC interpreter).

Testing the interpreter: the Pascal’s triangle program

Pascal’s triangle is a triangular array of binomial coefficients. The upper row contains figure 1. Each entry of each subsequent row is constructed by adding the number above and to the left with the number above and to the right. Below is the source code to be fed to the interpreter.

If you have done everything right, the interpreter should display something like this.

The interpreter works correctly? If yes, congrats! You have just created a fully featured BASIC interpreter. You may also want to expand its capacity by adding new functions, for instance:

  • add the cls command (clear the screen);
  • add gosub and return operators to be able to write subprograms; and
  • implement the support of data arrays.

Important:

  1. Use NASM to compile the program: nasm -f bin MicroB.asm.asm -o MicroB.asm.com; and
  2. If you don’t want to make changes in the boot sector of your PC, you can test the interpreter in a DOS emulator, for instance, DOSBox.


Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">