Assembly Language Style Guidelines - Data Types

8.0 Data Types

Prior to the arrival of MASM, most assemblers provided very little capability for declaring and allocated complex data types. Generally, you could allocate bytes, words, and other primitive machine structures. You could also set aside a block of bytes. As high level languages improved their ability to declare and use abstract data types, assembly language fell farther and farther behind. Then MASM came along and changed all that[24]. Unfortunately, many long time assembly language programmers haven't bothered to learn the new MASM syntax for things like arrays, structures, and other data types. Likewise, many new assembly language programmers don't bother learning and using these data typing facilities because they're already overwhelmed by assembly language and want to minimize the number of things they've got to learn. This is really a shame because MASM data typing is one of the biggest improvements to assembly language since using mnemonics rather than binary opcodes for machine level programming.

Note that MASM is a "high-level" assembler. It does things assemblers for other chips won't do like checking the types of operands and reporting errors if there are mismatches. Some people, who are used to assemblers on other machines find this annoying. However, it's a great idea in assembly language for the same reason it's a great idea in HLLs[25]. These features have one other beneficial side-effect: they help other understand what you're trying to do in your programs. It should come as no surprise, then, that this style guide will encourage the use of these features in your assembly language programs.

8.1 Defining New Data Types with TYPEDEF

MASM provides a small number of primitive data types. For typical applications, bytes, sbytes, words, swords, dwords, sdwords, and various floating point formats are the most commonly used scalar data types available. You may construct more abstract data types by using these built-in types. For example, if you want a character, you'd normally declare a byte variable. If you wanted a 16-bit integer, you'd typically use the sword (or word) declaration. Of course, when you encounter a variable declaration like "answer byte ?" it's a little difficult to figure out what the real type is. Do we have a character, a boolean, a small integer, or something else here? Ultimately it doesn't matter to the machine; a byte is a byte is a byte. It's interpretation as a character, boolean, or integer value is defined by the machine instructions that operate on it, not by the way you define it. Nevertheless, this distinction is important to someone who is reading your program (perhaps they are verifying that you've supplied the correct instruction sequence for a given data object). MASM's typedef directive can help make this distinction clear.

In its simplest form, the typedef directive behaves like a textequ. It let's you replace one string in your program with another. For example, you can create the following definitions with MASM:

char            typedef byte
integer         typedef sword
boolean         typedef byte
float           typedef real4
IntPtr          typedef far ptr integer

Once you have declared these names, you can define char, integer, boolean, and float variables as follows:

MyChar          char    ?
I               integer ?
Ptr2I           IntPtr  I
IsPresent       boolean ?
ProfitsThisYear float   ?
Use the existing MASM data types as type building blocks. For most data types you create in your program, you should declare explicit type names using the typedef directive. There is really no excuse for using the built-in primitive types[26].

8.2 Creating Array Types

MASM provides an interesting facility for reserving blocks of storage - the DUP operator. This operator is unusual (among assembly languages) because its definition is recursive. The basic definition is (using HyGram notation):

	DupOperator = expression ws* 'DUP' ws* '('  ws*  operand ws* ') %%

Note that "expression" expands to a valid numeric value (or numeric expression), "ws*" means "zero or more whitespace characters" and "operand" expands to anything that is legal in the operand field of a MASM word/dw, byte/db, etc., directive[27]. One would typically use this operator to reserve a block of memory locations as follows:

ArrayName       integer 16 dup (?)      ;Declare array of 16 words.

This declaration would set aside 16 contiguous words in memory.

The interesting thing about the DUP operator is that any legal operand field for a directive like byte or word may appear inside the parentheses, including additional DUP expressions. The DUP operator simply says "duplicate this object the specified number of times." For example, "16 dup (1,2)" says "give me 16 copies of the value pair one and two. If this operand appeared in the operand field of a byte directive, it would reserve 32 bytes, containing the alternating values one and two.

So what happens if we apply this technique recursively? Well, "4 dup ( 3 dup (0))" when read recursively says "give me four copies of whatever is inside the (outermost) parentheses. This turns out to be the expression "3 dup (0)" that says "give me three zeros." Since the original operand says to give four copies of three copies of a zero, the end result is that this expression produces 12 zeros. Now consider the following two declarations:

Array1		integer	4 dup ( 3 dup (0))
Array2		integer	12 dup (0)

Both definitions set aside 12 integers in memory (initializing each to zero). To the assembler these are nearly identical; to the 80x86 they are absolutely identical. To the reader, however, they are obviously different. Were you to declare two identical one-dimensional arrays of integers, using two different declarations makes your program inconsistent and, therefore, harder to read.

However, we can exploit this difference to declare multidimensional arrays. The first example above suggests that we have four copies of an array containing three integers each. This corresponds to the popular row-major array access function. The second example above suggests that we have a single dimensional array containing 12 integers.

Take advantage of the recursive nature of the DUP operator to declare multidimensional arrays in your programs.

8.3 Declaring Structures in Assembly Language

MASM provides an excellent facility for declaring and using structures, unions, and records[28]; for some reason, many assembly language programmers ignore them and manually compute offsets to fields within structures in their code. Not only does this produce hard to read code, the result is nearly unmaintainable as well.

When a structure data type is appropriate in an assembly language program, declare the corresponding structure in the program and use it. Do not compute the offsets to fields in the structure manually, use the standard structure "dot-notation" to access fields of the structure.

One problem with using structures occurs when you access structure fields indirectly (i.e., through a pointer). Indirect access always occurs through a register (for near pointers) or a segment/register pair (for far pointers). Once you load a pointer value into a register or register pair, the program doesn't readily indicate what pointer you are using. This is especially true if you use the indirect access several times in a section of code without reloading the register(s). One solution is to use a textequ to create a special symbol that expands as appropriate. Consider the following code:

s               struct
a               Integer ?
b               integer ?
s               ends
r               s       {}
ptr2r           dword   r
                les     di, ptr2r
                mov     ax, es:[di].s.a         ;No indication this is
                mov     es:[di].b, bx           ;Really no indication!

Now consider the following:

s               struct
a               Integer ?
b               integer ?
s               ends

sPtr            typedef far ptr s
q               s       {}
r               sPtr    q
r@              textequ <es:[di].s>
                les     di, ptr2r
                mov     ax, r@.a        ;Now it's clear this is using r
                mov     r@.b, bx        ;Ditto.

Note that the "@" symbol is a legal identifier character to MASM, hence "r@" is just another symbol. As a general rule you should avoid using symbols like "@" in identifiers, but it serves a good purpose here - it indicates we've got an indirect pointer. Of course, you must always make sure to load the pointer into ES:DI when using the textequ above. If you use several different segment/register pairs to access the data that "r" points at, this trick may not make the code anymore readable since you will need several text equates that all mean the same thing.

8.4 Data Types and the UCR Standard Library

The UCR Standard Library for 80x86 Assembly Language Programmers (version 2.0 and later) provide a set of macros that let you declare arrays and pointers using a C-like syntax. The following example demonstrates this capability:

	integer	i, j, array[10], array2[10][3], *ptr2Int
	char	*FirstName, LastName[32]

These declarations emit the following assembly code:

i               integer ?
j               integer 25
array           integer 10 dup (?)
array2          integer 10 dup ( 3 dup (?))
ptr2Int         dword   ?

LastName        char    32 dup (?)
Name            dword   LastName

For those comfortable with C/C++ (and other HLLs) the UCR Standard Library declarations should look very familiar. For that reason, their use is a good idea when writing assembly code that uses the UCR Standard Library.

[1] Someone who uses TASM all the time may think this is fine, but consider those individuals who don't. They're not familiar with TASM's funny syntax so they may find several statements in this program to be confusing.
[2] Simplified segment directives do make it easier to write assembly language programs that interface with HLLs. However, they only complicate matters in stand-alone assembly language programs.
[3] A lot of old-time programmers believe that assembly instructions should appear in upper case. A lot of this has to do with the fact that old IBM mainframes and certain personal computers like the original Apple II only supported upper case characters.
[4] Note, by the way, that I am not suggesting that this error checking/handling code should be absent from the program. I am only suggesting that it not interrupt the normal flow of the program while reading the code.
[5] Doing so (inserting an 80x86 tutorial into your comments) would wind up making the program less readable to those who already know assembly language since, at the very least, they'd have to skip over this material; at the worst they'd have to read it (wasting their time).
[6] Or whatever other natural language is in use at the site(s) where you develop, maintain, and use the software.
[7] You may substitute the local language in your area if it is not English.
[8] In fact, just the opposite is true. One should get concerned if both implementations are identical. This would suggest poor planning on the part of the program's author(s) since the same routine must now be maintained in two different programs.
[9] Or whatever make program you normally use.
[10] This happens because shorter function invariable have stronger coupling, leading to integration errors.
[11] Technically, this is incorrect. In some very special cases MASM will generate better machine code if you define your variables before you use them in a program.
[12] Older assemblers on other machines have required the labels to begin in column one, the mnemonic to appear in a specific column, the operand to appear in a specific column, etc. These were examples of fixed-formant source line translators.
[13] See the next section concerning comments for more information.
[14] This document will simply use the term comments when refering to standalone comments.
[15] Since the label, mnemonic, and operand fields are all optional, it is legal to have a comment on a line by itself.
[16] It could be worse, you should see what the "superoptimizer" outputs for the signum function. It's even shorter and harder to understand than this code.
[17] This is true regardless of what metric you use to determine the "best" code sequence.
[18] Of course, if the program is a class assignment, you may want to check your instructor's cheating policy before showing your work to your classmates!
[19] The designer of the SNOBOL4 and Icon programming languages.
[20] Note that this does not infer that it is hard to write easy to read C programs. Only that if one is sloppy, one can easily write something that is near impossible to understand.
[21] Okay, this is a cheap shot. In fact, most of the assembly code on this planet is poorly written.
[22] Actually, MASM 6.x does, but we'll ignore that fact here.
[23] Sometimes, for performance reasons, the code sequence above is justified since straight-line code executes faster than code with jumps. If the program rarely executes the ELSE portion of an if statement, always having to jump over it could be a waste of time. But if you're optimizing for speed, you will often need to sacrafice readability.
[24] Okay, MASM wasn't the first, but such techniques were not popularized until MASM appeared.
[25] Of course, MASM gives you the ability to override this behavoir when necessary. Therefore, the complaints from "old-hand" assembly language programmers that this is insane are groundless.
[26] Okay, using some assembler that doesn't support typedef would probably be a good excuse!
[27] For brevity, the productions for these objects do not appear here.
[28] MASM records are equivalent to bit fields in C/C++. They are not equivalent to records in Pascal.

  Return to Assembly Language Style Guidelines Index.

  Still beautifying assembly code by hand?  Try to use SourceFormatX Asm Formatter to beautify assembly source code for you!