ASSEMBLER PROJECT MANUAL

1.1 Introduction

The purpose of this project is to design and implement an assembler for the 6502 using the UNIX operating system. An assembler accepts a "source module" and produces an "object module". More specifically, this project involves writing a cross assembler because the assembly language programs will be written on a host machine and assembled on the host machine but cannot be executed on the host machine. Typically, the object module will be downloaded to a machine containing a 6502 microprocessor and executed. Although several object modules may be linked together and then relocated to produce a "load module" that can be loaded into memory and executed, it is the intent of this project to skip the linking process and produce an absolute "load module". The modules will be set up in such a way that a linker could link the "object modules".

The linking process resolves unsatisfied external references (i.e. references to symbols defined in another module) that may exist in each object module. The relocation process binds relocatable addresses to absolute memory addresses. Relocatable addresses are expressed in terms of an offset from the starting address of the object module which is not determined until relocation time.

After completion of this project you will completely understand a number of concepts used in language translation among those being parsing, lexical analysis, symbol tables, assembly language processing, the representation of programs in relocatable form, conditional translation, and macros. You will also show that you can design and implement a nontrivial project.

1.2 Target Computer Architecture

This section describes the basic 6502 architecture which is an accumulator-based architecture as opposed to a general register architecture. The major difference between the two architectures is that in an accumulator based architecture there is only one or two accumulators and all arithmetic and logical operations must take place using the accumulator as one of the operands. General-register architectures such as the 68000 and Vax-11 have 8 or 16 general registers that can act as accumulators.

Due to the complexity of the project and limited amount of time we will choose a managable project of writing a 6502 assembler. The 6502 microprocessor has an instruction set of 56 instructions with 13 addressing modes. In comparison, the 68000 has around 60 instructions and 20 addressing modes while the Vax-11 has over 240 instructions and approximately 23 addressing modes. If you can write a 6502 assembler, you can write any other assembler.

The 6502 contains five main 8-bit registers which are the accumulator, the X and Y registers (used mainly for indexing), the stack pointer, and the processor status register. The one 16-bit register is the PC (program counter)

1.3 Instruction Format

Instructions on the 6502 consist of a variable number of bytes, either 1,2, or 3 bytes. The first byte contains the opcode. The next bytes (if they exist) contain either data or an address.

1.4 Addressing Modes

One of the main advantages of the 6502 is its 13 different addressing modes. This allows the programmer a great deal of flexibility. The mode of addressing is given by the operand if it exists; otherwise, the addressing is implied.

1.4.1 Immediate Addressing

This addressing mode specifies an 8-bit constant as an operand. An immediate operand is identified to the assembler using a '#' sign as a prefix.

Example: LDA #05 - Loads the ACC with immediate value 5

1.4.2 Absolute Addressing Mode

Allows direct addressing of any of the 65,536 memory locations. The operand can be a variable or have a prefix of one of the allowable numeric constants.

Example: LDA 0xA000 - Loads the ACC with the 8-bit contents of location A000

1.4.3 Zero Page Addressing Mode

This is a form of absolute addressing but only the first 256 memory locations are addressable (i.e. 0000 - 00FF). The operand can be a variable or have a prefix of one of the allowable numeric constants.

Example: LDA 0x07 - Loads the ACC with the 8-bit contents of location 0007

1.4.4 Accumulator Addressing Mode

Four instructions allow shifting or rotating the contents of the accumulator or a memory location. If the operand is an 'A' or 'a' then accumulator addressing is being used.

Example: ASL A - Shifts the contents of the ACC one bit to the left and shifts a zero into the LSB

1.4.5 Implied Addressing Mode

These instructions have no operand because the 6502 opcode provides enough information.

Example CLC - Clears the carry flag

1.4.6 Absolute X-index Addressing Mode

The effective address is found by adding the contents of the X-register to the absolute address in the instruction.

Example LDA 0x1234,X - Loads the ACC with the 8-bit contents found at location 1234 + the value of the X-register

1.4.7 Absolute Y-index Addressing Mode

The effective address is found by adding the contents of the Y-register to the absolute address in the instruction.

Example LDA 0x1234,Y - Loads the ACC with the 8-bit contents found at location 1234 + the value of the Y register

1.4.8 Zero-page X-indexed Addressing Mode

The effective address is found by adding the contents of the X-register to the zero-page address in the instruction.

Example LDA 0x12,X - Loads the ACC with the 8-bit contents found at location 12 + the value of the X register

1.4.9 Zero-page Y-indexed Addressing Mode

The effective address is found by adding the contents of the Y-register to the zero-page address in the instruction.

Example LDA 0x12,Y - Loads the ACC with the 8-bit contents found at location 12 + the value of the Y register

1.4.10 Indirect Addressing Mode

The JMP instruction is the only instruction that can use this mode. With this addressing mode, the operand specifies the address of the two memory locations that contain the destination address. The operand is enclosed in parentheses.

Example JMP (0x0200) - Jumps to the location of the address found at 200 with bytes reversed

1.4.11 Indexed Indirect Addressing Mode

Is a combination of indirect addressing and indexed addressing. The contents of the X-register is added to the absolute address which will be used to find the destination address. Only the X-register can be used.

Example ADC (0xFA,X) - The contents of the X-register is added to address FA and then indirect addressing is used to yield the effective address whose 8-bit contents are added to the ACC

1.4.12 Indirect Indexed Addressing Mode

Is a combination of indirect addressing and indexed addressing. The contents of the Y-register is added to the indirect address which will be used to find the destination address. Only the Y-register can be used.

Example ADC (0xFA),Y - The contents of the Y-register is added to the indirect address found at location FA, FB (with bytes reversed) to yield the effecive address whose 8-bit contents are added to the ACC

1.4.13 Relative Addressing Mode

The effective address is specified relative to the address of the next instruction to be executed. The effective address is computed by adding a positive or negative displacement to the current value of the program counter. Relative addressing is only used by the branch instructions.

Example TOP: INX

BNE TOP - Branches to TOP until X-register becomes zero

1.5 Assembler Specification

This section describes the assembly language input to the assembler, the form of the source listing, and the object module produced by the assembler as output.

1.5.1 Assembly Language Format

Assembly language statements are of the following forms:

1) label: opcode op1 ; comment

2) label: pseudo-op op1,op2,op3,op4,op5, op6 ;comment

1.5.2 Labels

Labels are symbols (see section 1.5.9.1.2) that are optional. They do not have to begin in the first position, but if present, are the first symbol in the line. They are followed by a colon (optionally preceded and/or followed by blanks and/or tab characters).

1.5.3 Opcodes

A symbol is used to designate a machine operation (opcode) or a pseudo-operation (pseudo-op) which is a directive to the assembler. Opcodes are defined in section 1.5.10.4 and pseudo-ops are defined in section 1.5.10.5. Opcodes and pseudo-ops may begin anywhere on the source line (but after the label if present). One or more blanks and/or tab characters separate the opcode from the operand field(s).

1.5.4 Operand Field

An operand field can contain operand specifiers for instructions or arguments for pseudo-ops. Operands for instructions are described in section 1.5.6. Arguments for pseudo-ops must meet the format requirements specified in section 1.5.10.5. Operand fields are separated by a comma which is optionally preceded and/or followed by one or more blanks and/or tab characters.

1.5.5 Comment Field

The comment field follows the operand field and begins with a ';' (optionally preceded by blanks and/or tab characters). A comment could be on a line by itself. The comment can contain any character and is terminated by a newline (carriage return) character.

1.5.6 Operand Specifiers

Mode(Assembler Syntax)

1) Immediate (#constant)
2) Absolute (absolute address)
3) Zero-page (zero page address)
4) Accumulator (A)
5) Implied (no operand specification)
6) Absolute X-index (absolute address,X)
7) Absolute Y-index (absolute address,Y)
8) Zero-page X-index (zero-page address,X)
9) Zero-page Y-index (zero-page address,Y)
10) Indirect (address)
11) Indexed indirect (address,X)
12) Indirect indexed (address),Y
13) Relative (label)

1.5.7 Constants

A constant may be an expression (see section 1.5.9) or an ASCII character string denoted as: 'string' where the string is delimited by single quote characters.

1.5.8 Address

An address is simply an expression ( see section 1.5.9)

1.5.9 Expressions

1.5.9.1 Term

A term can be any one of the following:

1.5.9.1.1 Number

A number can be represented in four different bases. A decimal number consists of a sequence of digits without a leading zero. A binary number consists of the characters '0b' followed by a sequence of digits. A hexadecimal number consists of the characters '0x' followed by a sequence of digits. An octal number consists of a sequence of digits with a leading zero. Capitals and/or small letter digits are allowed. A number is terminated by a character that is outside the set of valid characters of that base.

1.5.9.1.2 Symbol

A symbol is a string of eight or fewer characters. Symbols are made up of letters (upper or lower case), digits, periods, underscores, and/or dollar signs but must begin with a letter. There is no distinction between upper and lower case letters in identifying a symbol.

1.5.9.1.3 Current Location Counter

The location counter is syntactically represented by the period (.). It is maintained by the assembler and has the value of the address of the current byte or word. Note: when the location counter is used in the operand field of an instruction, it has the value of the address of that operand and not the value of the address of the beginning of the instruction.

1.5.9.1.4 Evaluation of an Expression

An expression is a combination of terms joined by binary operators. The leading term will not contain a unary operator (+,-) but you can allow this. Expressions are evaluated from left-to-right with no operator precedence. Binary operators are: + (sum), - (difference), * (product), / (integer division)

1.5.10 Mode of an Expression

1.5.10.1 Absolute Mode

An expression is absolute mode if its value is an assembly-time constant. An expression that consists entirely of numeric terms is absolute. An expression that has the same number of relocatable terms added as subtracted is also of absolute mode.

1.5.10.2 Relocatable Mode

An expression that has exactly one more relocatable term added as subtracted is of relocatable mode; i.e. its value is fixed relative to the start of the program.

1.5.10.3 External Mode

An expression is of external mode if it consists entirely of an external symbol (see section 1.5.10.5.7)

1.5.10.4 Opcodes

Opcodes are one byte and are determined according to the instruction and the mode of addressing being used. There are 56 instructions but over 150 opcodes. As an example LDA #07 uses immediate addressing; therefore, the opcode for LDA is A9. The LDA instruction has 8 different opcodes.

1.5.10.5 Pseudo-Instructions

Pseudo-instructions are directives or commands to the assembler. This assembler will define the following pseudo-ops:

1.5.10.5.1 title name

This pseudo-op assigns a name to the object module. 'name' conforms to the definition of a symbol.

1.5.10.5.2 end

This pseudo-op terminates the source program. Any text in the source file following this statement is ignored.

1.5.10.5.3 blkb/blkw expression

These pseudo-ops reserve storage of bytes or words. The expression which represents the number of information units allocated must be of absolute mode and all symbols used MUST be previously defined.

1.5.10.5.4 byte/word constant-list

These pseudo-ops generate successive bytes or words of data. The constant-list is a list of expressions separated by commas (optionally preceded and/or followed by blanks and/or tab characters). No more than six expressions in the list can exist.

1.5.10.5.5 set symbol,expression

This pseudo-op enters symbol into the symbol table with the value of the expression. Symbols within expression must be previously defined. Multiple set statements with the same symbol are allowed. The most recent value replaces the previous value. Redefinition of symbols defined other than by a set operation are not allowed.

1.5.10.5.6 entry symbol-list

The symbol(s) specified are to be known globally; i.e as entry points into this module.

1.5.10.5.7 extern symbol-list

The symbol(s) specified are not defined in this module but are considered to be external symbols

Note: External references are matched up with entry points defined in other object modules by the relocating linking loader which combines one or more object modules into a single "load module".

1.5.10.5.8 ascii 'string'

This pseudo-op generates successive bytes of ascii characters. The string begins with a single quote and ends with a single quote. Any legal characters can exist between the delimeters.

1.5.11 Source Listing Format

The assembler must print a listing of the source module where the format of the print line is:

columns:format

  1-4: hex relocatable addresses
    5: pipe character
 6-13: object module representation (in hex)
14-15: blanks
16-19: decimal sequence number (starting with 1)
   20: pipe character
   21: blank
22-80: source line

If an error occurs, the error message must be printed under the line with the error. Also, the object module representation is specified in hex from left to right with addresses or words split into two bytes separated by a space and in reverse order. Further, if a line is to continue because of length, continue starting in column 22 of the next line down.

Example:

file: prog1 
loc  obj rep        line source 
---  --------       ---- ------ 
0000|               1|        title prog2 
0000|               2|        set count,0x0a 
0000|18             3|        clc 
0001|6D 00 01       4|   top: adc 0x100 

1.5.12 Object Module Representation

The principle output from the assembler is an object module. This module is a representation of the source module which is acceptable as input to the relocating linking loader. The object module is broken down into eleven parts (some of which are not pertinent for this project description) which are described separately.

1.5.12.1 Magic number

The first longword (4 bytes) contains a "magic number" that tells how the object module has been set up. Use a value of 0x107 which stands for "old format impure" (this means the data and text areas are contiguous which isn't good for sharing data between processes).

1.5.12.2 Size of Text Area

The next longword contains a value indicating the number of bytes the text area occupies (see section 1.5.12.8).

1.5.12.3 Size of Data Area

The next longword contains a value indicating the number of bytes the data area occupies. Use a value of 0 since the data is embedded within the text area in the "old impure format".

1.5.12.4 Size of BSS Area

The next longword contains the size of the BSS on uninitialized data area. Use a value of 0.

1.5.12.5 Size of OMST

Portions of the assembler's symbol table are dumped into the object module to allow passing of information to the linking loader. This information concerns the symbols that are entry points or external symbols. The symbol table does not actually contain the symbols; it contains references into a string table described in section 1.5.12.11. The next longword contains the length in bytes of the symbol table (see section 1.5.12.10).

1.5.12.6 Size of Text Relocation Area

The next longword contains the size (in bytes) of the text relocation information. This area is described in section 1.5.12.9.

1.5.12.7 Size of Data Relocation Area

The next longword contains the size of the data relocation area. Use a value of 0.

1.5.12.8 Text Area

This area contains the actual numeric representation of the source module as a sequence of consecutive bytes. Symbolic opcodes, addressing modes, and symbols are represented as absolute or relocatable values. Relocation information is found in the relocation area ( see section 1.5.12.9).

1.5.12.9 Relocation Area

The relocation area consists of entries, each occupying two longwords. The first longword of each entry is the relocatable address in the text area whose value is to be relocated. The second longword is broken into five fields. The low order 24 bits is a local symbol ordinal. The next higher bit is the pc relative bit. The next higher two-bit field contains the length of the relocatable symbol (0=byte, 1=word). The next higher bit is set if the expression whose value is found in the byte or word contained a reference to an external symbol. The high order four bits are unused and set to zero.

32-bits: relocatable address
4-bits:  unused
1-bit:   external
2-bits:  length
1-bit:   pc relative
24-bits: ordinal

(32:4:1:2:1:24)

The local symbol ordinal is used for two purposes, depending on the value of "external". If "external" is true (1), then the ordinal is an index (starting at 0) into the OMST, specifying which symbol was used as an external reference. If "external" is false, then the ordinal contains a value specifying the mode of the expression. Use a value of 0 for this case. The pc relative bit is set if the assembler assembled the value relative to the pc.

Note: There are four different combinations of "external" and "pc relative" which imply different actions to the linking loader which will be discussed in a later handout.

1.5.12.10 Object Module Symbol Table (OMST)

The symbol table found in the object module consists of zero or more entries, each occupying three longwords. The first longword contains a byte offset into the string area where the symbol name (as a sequence of ASCII representations in successive bytes) begins. The next longword contains a mode description of the value of the symbol (0=undefined, 2=absolute, 4=text, 6=data, 8=bss; note that data and bss will never be used since data is placed in the "text" area). If the symbol is also external add a 1. The third longword contains the value of the symbol.

1.5.12.11 String Area

The first longword of the string area contains the length in bytes of the string area. After this, successive bytes contain the symbol names (in ASCII representation, one character per byte) where each is terminated by a byte of zero.

1.6 Assembler Operation

1.6.1 Algorithms

An assembler performs the task of translating the symbolic assembly language program into the binary machine language program. Forward references usually make the choice of a two-pass assembler desirable.

1.6.2 Pass1

The primary purpose of pass1 is symbol definition. Symbols are defined in one of two ways:

1) by the use of a symbol in a label field
2) by the use of the set pseudo-op

The definition of a symbol is made by associating a value/mode pair with the symbol name in a symbol table. The processing of pass1 is as follows:

lc := 0
repeat
  read source program line
  analyze line to determine linetype
  case linetype of
    comment: pass1ignore
    pseudo: pass1pseudo
    machine: pas1machine
    undef: pass1undefined
end
write record to scratch file
until end of pass1

The various kinds of lines are processed differently which will be discussed later. Note: The pseudo-ops cause a variety of pass1 activity see section 1.6.4. Machine operations have their labels defined and their sizes computed so that the location counter may be incremented appropriately. It is recommended for the user's assistance and for debugging your assembler that the symbol table be printed at the end of pass1. When you turn in your final assembler, have the symbol table printed at the end of pass2.

1.6.3 Pass2

The second pass of the assembler produces the object module and the source listing.

rewind scratch file
lc := 0
repeat
  read a record from the scratch file
  case linetype of
  comment: pass2ignore
  pseudo: pass2pseudo
  machine: pass2machine
end
until end of pass2

As instructions are assembled, entries are made into the object module's text area and the appropriate source listing lines are printed.

1.6.4 Pseudo-op Processing

1.6.4.1 title

Pass1: must be the first source line
Pass2: enter title name into string area

1.6.4.2 end

Pass1: none
Pass2: print symbol table

1.6.4.3 blkb/blkw

Pass1: if labeled, enter symbol into ST; evaluate and increment lc
Pass2: increment lc and output appropriate number of cleared bytes to object module

1.6.4.4 byte/word

Pass1: if labeled, enter symbol into ST; count number of operands and increment lc appropriately
Pass2: evalutate each operand and place in appropriate unit in object module; increment lc appropriately

1.6.4.5 set

Both passes: evaluate value and enter symbol and value into ST

1.6.4.6 entry

Pass1: enter symbol(s) into ST as entry points
Pass2: no action

1.6.4.7 extern

Pass1: enter symbol(s) into ST as external symbols
Pass2: no action

1.6.4.8 ascii

Pass1: determine the number of characters in the string and increment the lc by the number of characters
Pass2: evaluate each character placing the ascii values in the object module; increment lc appropriately

1.6.5 Data Structures

1.6.5.1 Symbol Table

Each entry in the ST consists of at least the following specifications:

- symbol name
- value/mode pair
- was the value defined by a set pseudo-op
- is the symbol an entry point

The following operations must be performed efficiently:

- insert an entry
- search for a symbol
- access and/or alter the information associated with a symbol

1.6.5.2 Scratch File

A two-pass assembler that does not require a programmer to input the source module twice must make the source module available to the second pass after processing in the first pass. A scratch file will therefore be used. This file can contain MUCH more information than just a copy of the source module that can make the second pass much more efficient at little expense to pass1 processing.

Hint: What information is known in pass1 that would have to be recomputed in pass2 if it wasn't saved in a scratch file?

1.6.5.3 Opcode Table

The opcode table provides a mechanism whereby the numeric represenation of an instruction mnemonic may be retrieved. The opcode table routines will need to know:

mnemonic - symbolic assembly language symbol
class - the class of the mnemonic i.e. machine, pseudo-op, or undefined mnemonic
operand mode - one of thirteen addressing modes

The only significant operation on this data structure is a search for a matching mnemonic

1.6.5.4 Error Processing

Clearly, many kinds of errors can be made by an assembly language programmer. In addition, problems can arise for other reasons; e.g. storage overflow in the ST. Every possible error must be considered in the design process. Error processing which is "tacked on" after design is a very poor substitute for careful consideration in the design process.

For the purpose of the project, you are required to handle only the following types of errors. An error code should be printed on the source listing under the source line indicating the first error that occurred.

CODE CAUSE ACTION

01 a symbol occurs as a label more than once or in conflicting contexts ignore second occurrence

02 EVAL detected a badly formed expression EVAL returns a value of 0,ab

03 Syntax error. Poorly formed expression Treat entire instruction as a comment

04 An expression gives a wrong mode Use correct mode if possible

05 Too many operands Assemble as much as possible

06 Illegal opcode Use an opcode of 00

07 Poorly formed operand specification syntax assemble a byte or word of 00

08 Symbol too long use first eight characters

09 Value range error use a value of 00

10 undefined label use a value of 00

11 branch out of range use a calculated value mod 128

12 EVAL detected a relocation error EVAL returns a value of 0,ab

13 Illegal operand mode use a value of 00

Your assembler must run through an entire assembly language program regardless of errors. For each numbered error that occurs, print the cause under the source line. At the end of the assembly print a message of the form:

0 ERROR(s) or

3 ERROR(s).


© Douglas J. Ryan
Douglas J. Ryan/ryandj.pacificu.edu