Disassembler

By Chris Dewhurst

Originally published in EUG #60

Thanks to the versatility of BBC Basic, assembly programs are relatively easy to write. We just type in an assembly language program of 6502 mnemonics like 'LDA counter' and 'STA byte', which the BBC or Electron then compiles into machine code proper.

Provided we still have the source code (the assembly language listing) we can make alterations quickly. Perhaps we wish to adjust memory locations addressed by particular instructions, or relocate the whole code by changing the value of P% (where the code starts). Both can usually be achieved with only a few changes to Basic variables.

But what if we are faced with a machine code program with no source code? Perhaps you have been given a machine code file with no corresponding source program. One example of this is a machine code which you normally *RUN. Another example is the Basic ROM itself which is just a gigantic machine code program on a ROM chip.

We need some way of converting the otherwise unintelligable streams of ones and zeroes back into mnemonics, and that's the purpose of a DISASSEMBLER. A disassembler examinines the bytes in memory and tries to work out a meaningful assembler instructions from them.

Of course, not every byte in memory will form part of 6502 assembly language instructions because a byte might be part of a string of text. So disassemblers usually include some form of memory dumping feature which displays the value of the bytes together with their ASCII representations.

DISASS is a small-scale yet powerful disassembler which does just this - and it boasts an additional feature not usually found on commercial equivalents. You can dump the memory in decimal as well as hexadecimal; similarly you can have decimal as well as hex figures after immediate instructions. For example, LDA #32 can be displayed as well as LDA #&20.

This is of immense benefit if (like me) you know what ASCII codes mean in decimal but have to convert them to hex to understand the display in other disassemblers. The program can also recognise common operating system calls and display them accordingly, for example JSR &FFEE will be shown as JSR oswrch.

The disassembler is written in machine code for speed and there is a short pause while the code is being assembled. DISASS makes extensive use of Basic ROM routines to save on space. I discussed most of the ROM routines employed here in EUG#53, but there are one or two extras which I'll outline later on. The amount of code saved by using ROM routines means that larger machine code programs can be resident in memory alongside the disassembler itself.

You are asked where you would like disassembly to begin. The number you type in must be a hexadecimal address of four digits or less without the ampersand ('&') prefix. For example, typing in 1100 will start disassembly at location &1100. This is where the disassembler code itself resides so you can disassemble the disassembler!

Type D to display the next machine code instruction. If the disassembler can't recognise the byte as being part of an instruction, question marks (???) are shown instead of a mnemonic. At any time you can type N and input another address and disassembly will continue at the new address. Pressing M will display a memory dump of several bytes starting from the current address.

How many bytes are displayed on one line depends on whether you are in decimal or hexadecimal mode. Press # (hash) to toggle between the two. In decimal mode five bytes will be displayed per line, in hexadecimal eight bytes are shown across each line of the screen. Having this memory dumping feature is extremely useful for displaying strings of data like VDU sequences.

Pressing # also affects the way numbers are printed after an immediate instruction, for example LDX #16 and LDX #&10 in decimal and hexadecimal mode respectively. It's surprising that most disassemblers don't have this feature. Even professionals don't use hexadecimal numbers all the time in their source code!

So, how does it work? When an assembly language listing is assembled, each mnemonic is translated into an opcode, a number between zero and 255. Not all values between zero and 255 are used, for example 0 is BRK and 1 is ORA, but no mnemonic exists which has an opcode value of 2. In fact, of the 256 possiblities only 152 are used.

Now, there are only 56 actual machine code instructions, but one instruction can have several addressing modes thus making the total of 152 possible and valid ("documented") opcodes.

Therefore the first thing that DISASS does is to get the byte at the current address and convert it - assuming it's an opcode - to an index into the 152 documented codes. The routine 'find' uses the 32 bytes of data at 'opbits'. A bit is set if the corresponding opcode is a documented instruction. The opcode becomes a counter which is decremented for every bit that is set up to the value of the counter.

The result is an index into 'tokens'. Each mnemonic is referred to by a number includng zero which is the 'non-instruction', 0=???. If the 'find' routine failed then the index will be 0 and the three question marks will be printed. The rest of the tokens are numbered as follows: 1=ADC, 2=AND, 3=ASL, 4=BCC, ... 56=TYA.

The second entry in 'tokens' is 11 which corresponds to BRK. The third entry is 35 which is ORA. The fourth entry is 35=ORA as well - but ORA with a different addressing mode. Opcode 0 is BRK, opcode 1 is ORA with indirect X addressing, opcodes 2,3, and 4 do not exist, and opcode 5 is ORA with zero page addressing. Therefore opcode number 5 is documented opcode number 3 as stored in the 'tokens' table.

OK, so we mentioned two addressing modes there - indirect X and zero page. There are 13 different addressing modes in 6502 assembly language which are summarized in the table below:

   Number  Addressing Mode     Example      No.extra bytes expected
   ------  ---------------     -------      -----------------------
   0.      Zero Page (ZP)      LDA &70                1
   1.      Absolute            LDX &3000              2
   2.      Immediate           LDY #54                1
   3.      ZP,X                LDA &50,X              1
   4.      ZP,Y                LDA &70,Y              1
   5.      Absolute,X          LDX &900,Y             2
   6.      Absolute,Y          LDY &A00,X             2
   7.      (Indirect,X)        ORA (&70,X)            1
   8.      (Indirect),Y        LDA (&80),Y            1
   9.      (Indirect)          JMP (&20E)             2
   10.     Accumulator         ASL A                  0
   11.     Implied             PHA                    0
   12.     Relative            BCC skip               1
The addressing mode number (0-12) is stored in one nybble of memory and, knowing the number of the documented opcode (0-151), 'getam' extracts the relevant nybble from the 'admodes' table. This becomes an index into 'nobytes', the number of extra bytes expected after the mnemonic for that particular addressing mode. There can be none, one, or two extra bytes. For example, LDA (&80),Y has a normal opcode value of 177, a documented opcode value of 104, an addressing mode value of 8 and so one byte is expected after the opcode.

The extra byte(s) are stored in &75 and &76. The addressing mode is also used as an index into a list of 13 two-byte addresses (stored as 13 low bytes followed by 13 high bytes) of the routines that deal with each addressing mode. For example, 'zp' will print the value of &75 using the routine in the Basic ROM to print a number in hex. 'abs' just prints the value of &76, the high byte of an absolute address in absolute addressing mode, and calls 'zp' to print the low byte.

In the case of relative addressing, the data displayed after the menmonic is a two-byte address. This is in common with normal assembly language even though there is only one byte actually stored after the opcode. A negative byte is a backwards branch and a forwards branch is a positive number. The correct address is worked out and printed on the screen making use of the 'abs' routine.

In immediate mode, and also in the decimal memory dump, the flag at &77 is examined which describes whether the number is to be printed in decimal (&77=0) or hexadecimal (&77=1). The flag is toggled between zero and one when the hash key is pressed. Decimal numbers are printed using the ROM routine at &991F, labelled 'prdec' in line 100.

I described the routine at &991F, plus 'prahex' (print accumulator in hex, &B545), 'iwa_zp' (copy integer work area to zero page variable, &BE44), and 'prstr' (print a string, &BFCF) in EUG #53. The other ROM routines are detailed below:

'inpstr' is the equivalent of Basic's INPUT. On entry to the routine, &37/38 points to the address in memory where the characters typed in at the keyboard will be stored, in this case &0100. This is the bottom of the stack (page 1) and is usually safe to use. If you call the input routine at &BC0D the maximum number of characters allowed is set to 238; however you can specify your own maximum and call the routine at &BC0F with A set to the maximum number of characters. It's 4 here because we are seeking a 16-bit address at which the disassembly is to start.

'dcdhex' converts a string of hexadecimal characters to a binary number. On entry &19/1A points to the address at which the string of characters starts. Location &1B is used as an index into the string and should be set to zero. On exit the converted number resides on the IWA (locations &2A-2D). The IWA can be copied to another zero page location with 'iwa-zp'.

'escerr' is just the location of Basic's "Escape" error message: BRK followed by the error number (17), then the string ("Escape") terminated by zero. Utilizing this data in the Basic ROM means we don't have to duplicate it in our own programs.

With the help of DISASS you can see how all these ROM routines work - go on, take a peek! An imporant point to note is that the routines are only applicable to Basic 2 on the BBC B and Electron. There isn't much point in providing addresses in Basic 4 on the BBC Master because the Master uses 65C12 assembler. There are many more opcodes and addressing modes in this advanced 6502; significant alterations would have to be made to DISASS in order to recognise them.