From: randyhyde@earthlink.net on 19 Nov 2005 16:44 Hi All, I've read several posts concerning structures and their implementation in assembly language. Given some misconceptions about structures in assembly language, I pieced together the following article about structures in assembly. Cheers, Randy Hyde Structures in Assembly Language Programs ======================================== Structures, or records, are an abstract data type that allows a programmer to collect different objects together into a single, composite, object. Structures can help make programs easier to read, write, modify, and maintain. Used appropriately, they can also help your programs run faster. Despite the advantages that structures offer, their appearance in assembly language is a relatively recent phenomenon (in the past two decades, or so), and many assemblers still do not support this facility. Furthermore, many "old-timer" assembly language programmers attempt to argue that the appearance of records violates the whole principle of "assembly language programming." This article will certain refute such arguments and describe the benefits of using structures in an assembly language program. Despite the fact that records have been available in various assembly languages for years (e.g., Microsoft's MASM assembler introduced structures in 80x86 assembly language in the 1980s), the "lack of support for structures" is a common argument against assembly language by HLL programmers who don't know much about assembly. In some respects, their ignorance is justified -- many assemblers don't support structures or records. A second goal of this article is to educate assembly language programmers to counter claims like "assembly language doesn't support structures." Hopefully, that same education will convince those assem- bly language programmers who've never bothered to use structures, to consider their use. This article will use the term "record" to denote a struc- ture/record to avoid confusion with the more general term "data structure". Note, however, that the terms "record" and "structure" are synonymous in this article. What is a Record (Structure)? ----------------------------- (for those who don't have any idea of how records are implemented in memory, you may want to consider reading the chapter on this subject in "The Art of Assembly Language Programming" at http://webster.cs.ucr.edu/AoA/Windows/HTML/RecordsUnionsNamespaces.html#1003722) The whole purpose of a record is to let you encapsulate different, but logically related, data into a single package. Here is a typical record declaration, in HLA using the RECORD / ENDRECORD declaration: type student: record Name: string; Major: int16; SSN: char[12]; Midterm1: int16; Midterm2: int16; Final: int16; Homework: int16; Projects: int16; endrecord; The field names within the record must be unique. That is, the same name may not appear two or more times in the same record. However, in reasonable assemblers (like HLA) that support true structures, all the field names are local to that record. With such assemblers, you may reuse those field names elsewhere in the pro- gram. The RECORD/ENDRECORD type declaration may appear in a variable declaration section (e.g., an HLA STATIC or VAR section) or in a TYPE declaration section. In the previous example the Student declaration appears in an HLA TYPE section, so this does not actually allocate any storage for a Student variable. Instead, you have to explicitly declare a variable of type Student. The following example demonstrates how to do this: var John: Student; This allocates 28 bytes of storage: four bytes for the Name field (HLA strings are four-byte pointers to character data found else- where in memory), 12 bytes for the SSN field, and two bytes for each of the other six fields. If the label John corresponds to the base address of this record, then the Name field is at offset John+0, the Major field is at offset John+4, the SSN field is at offset John+6, etc. To access an element of a structure you need to know the offset from the beginning of the structure to the desired field. For example, the Major field in the variable John is at offset 4 from the base address of John. Therefore, you could store the value in AX into this field using the instruction mov( ax, (type word John[4]) ); Unfortunately, memorizing all the offsets to fields in a record defeats the whole purpose of using them in the first place. After all, if you've got to deal with these numeric offsets why not just use an array of bytes instead of a record? Using Symbolic Equates to Implement Record Fields ------------------------------------------------- Some enterprising types have noted that they can improve the readability of their "structure" accesses by using symbolic equates rather than literal numeric constants. That is, they can play games such as this: const Name := 0; Major := Name+sizeOfString; SSN := Major+sizeOfInt16; Midterm1 := SSN+12*sizeOfChar; Midterm2 := Midterm1+sizeOfInt16; Final := Midterm2+sizeOfInt16; Homework := Final+sizeOfInt16; Projects := Homework+sizeOfInt16; sizeOfStudent := Projects+sizeOfInt16; Certainly it is the case that a statement like mov( ax, (type word John[Major] ); is far more readable than mov( ax, (type word John[4])); Now if your assembler doesn't support structs, this is about as good as it gets for you. Granted, it *is* more readable and maintainable than the earlier version, but that certainly doesn't mean that the result is readable and maintainable. After all, "mov( ax, (type word John[4]));" is far more readable than byte $66, $a3, $04, $00, $00, $00; but this doesn't imply that mov( ax, (type word John[4])); is a particularly readable, maintainable, or good way to access this field of the student structure. Better, unquestionably, but not good. We'll return to this subject of hacked attempts at record simulation later in this paper. Assemblers like HLA that support true records commonly let you refer to field names in a record using the same mechanism C/C++ and Pascal use: the dot operator. To store AX into the Major field, you could use "mov( ax, John.Major );" instead of the previous instruction. This is much more readable and certainly easier to use than other schemes people have invented. Record Constants ---------------- HLA lets you define record constants. In fact, HLA is probably unique among x86 assemblers insofar as it supports both symbolic record constants and literal record constants. Record constants are useful as initializers for static record variables. They are also quite useful as compile-time data structures when using the HLA com- pile-time language (that is, the macro processor language). This section discusses how to create record constants. A record literal constant takes the following form: RecordTypeName:[ List_of_comma_separated_constants ] The RecordTypeName is the name of a record data type you've defined in an HLA TYPE section prior to this point. To create a record constant you must have previously defined the record type in a TYPE section of your program. The constant list appearing between the brackets are the data items for each of the fields in the specified record. The first item in the list corresponds to the first field of the record, the second item in the list corresponds to the second field, etc. The data types of each of the constants appearing in this list must match their respective field types. The following example demonstrates how to use a lit- eral record constant to initialize a record variable: type point: record x:int32; y:int32; z:int32; endrecord; static Vector: point := point:[ 1, -2, 3 ]; This declaration initializes Vector.x with 1, Vector.y with -2, and Vector.z with 3. You can also create symbolic record constants by declaring record objects in the CONST or VAL sections of an HLA program. You access fields of these symbolic record constants just as you would access the field of a record variable, using the dot operator. Since the object is a constant, you can specify the field of a record constant anywhere a constant of that field's type is legal. You can also employ symbolic record constants as record variable initializ- ers. The following example demonstrates this: type point: record x:int32; y:int32; z:int32; endrecord; const PointInSpace: point := point:[ 1, 2, 3 ]; static Vector: point := PointInSpace; XCoord: int32 := PointInSpace.x; Arrays of Records ----------------- It is a perfectly reasonable operation to create an array of records. To do so, you simply create a record type and then use the standard array declaration syntax when declaring an array of that record type. The following example demonstrates how you could do this: type recElement: record << fields for this record >> endrecord; . . . static recArray: recElement[4]; Naturally, you can create multidimensional arrays of records as well. You would use the standard row or column major order func- tions to compute the address of an element within such records. The only thing that really changes (from the discussion of arrays) is that the size of each element is the size of the record object. static rec2D: recElement[ 4, 6 ]; Arrays and Records as Record Fields Records may contain other records or arrays as fields. Consider the following definition: type Pixel: record Pt: point; color: dword; endrecord; The definition above defines a single point with a 32 bit color component. When initializing an object of type Pixel, the first ini- tializer corresponds to the Pt field, not the x-coordinate field. The following definition is incorrect: static ThisPt: Pixel := Pixel:[ 5, 10 ]; // Syntactically incorrect! The value of the first field ('5') is not an object of type point. Therefore, the assembler generates an error when encountering this statement. HLA will allow you to initialize the fields of Pixel using declarations like the following: static ThisPt: Pixel := Pixel:[ point:[ 1, 2, 3 ], 10 ]; ThatPt: Pixel := Pixel:[ point:[ 0, 0, 0 ], 5 ]; Accessing Pixel fields is very easy. Like a high level language you use a single period to reference the Pt field and a second period to access the x, y, and z fields of point: stdout.put( "ThisPt.Pt.x = ", ThisPt.Pt.x, nl ); stdout.put( "ThisPt.Pt.y = ", ThisPt.Pt.y, nl ); stdout.put( "ThisPt.Pt.z = ", ThisPt.Pt.z, nl ); . . . mov( eax, ThisPt.Color ); You can also declare arrays as record fields. The following record creates a data type capable of representing an object with eight points (e.g., a cube): type Object8: record Pts: point[8]; Color: dword; endrecord; There are two common ways to nest record definitions. As noted earlier in this section, you can create a record type in a TYPE section and then use that type name as the data type of some field within a record (e.g., the Pt:point field in the Pixel data type above). It is also possible to declare a record directly within another record without creating a separate data type for that record; the following example demonstrates this: type NestedRecs: record iField: int32; sField: string; rField: record i:int32; u:uns32; endrecord; cField:char; endrecord; Generally, it's a better idea to create a separate type rather than embed records directly in other records, but nesting them is perfectly legal and a reasonable thing to do on occasion. Controlling Field Offsets Within a Record ----------------------------------------- By default, whenever you create a record, most assemblers automatically assign the offset zero to the first field of that record. This corresponds to records in a high level language and is the intu- itive default condition. In some instances, however, you may want to assign a different starting offset to the first field of the record. The HLA assembler provides a mechanism that lets you set the starting offset of the first field in the record. The syntax to set the first offset is name: record := startingOffset; << Record Field Declarations >> endrecord; Using the syntax above, the first field will have the starting offset specified by the startingOffset int32 constant expression. Since this is an int32 value, the starting offset value can be positive, zero, or negative. One circumstance where this feature is invaluable is when you have a record whose base address is actually somewhere within the data structure. The classic example is an HLA string. An HLA string uses a record declaration similar to the following: record MaxStrLen: dword; length: dword; charData: char[xxxx]; endrecord; However, HLA string pointers do not contain the address of the MaxStrLen field; they point at the charData field. The str.strRec record type found in the HLA Standard Library Strings module uses a record declaration similar to the following: type strRec: record := -8; MaxStrLen: dword; length: dword; charData: char; endrecord; The starting offset for the MaxStrLen field is -8. Therefore, the offset for the length field is -4 (four bytes later) and the offset for the charData field is zero. Therefore, if EBX points at some string data, then "(type str.strRec [ebx]).length" is equivalent to "[ebx-4]" since the length field has an offset of -4. Aligning Fields Within a Record ------------------------------- To achieve maximum performance in your programs, or to ensure that your records properly map to records or structures in some high level language, you will often need to be able to control the alignment of fields within a record. For example, you might want to ensure that a dword field's offset is an even multiple of four. You use the ALIGN directive in a record declaration to do this. The following example shows how to align some fields on important boundaries: type PaddedRecord: record c: char; align(4); d: dword; b: boolean; align(2); w: word; endrecord; Whenever HLA encounters the ALIGN directive within a record declaration, it automatically adjusts the following field's off- set so that it is an even multiple of the value the ALIGN directive specifies. It accomplishes this by increasing the offset of that field, if necessary. In the example above, the fields would have the fol- lowing offsets: c:0, d:4, b:8, w:10. If you want to ensure that the record's size is a multiple of some value, then simply stick an ALIGN directive as the last item in the record declaration. HLA will emit an appropriate number of bytes of padding at the end of the record to fill it in to the appropriate size. The following example demonstrates how to ensure that the record's size is a multiple of four bytes: type PaddedRec: record << some field declarations >> align(4); endrecord; Be aware of the fact that the ALIGN directive in a RECORD only aligns fields in memory if the record object itself is aligned on an appropriate boundary. Therefore, you must ensure appropriate alignment of any record variable whose fields you're assuming are aligned. If you want to ensure that all fields are appropriately aligned on some boundary within a record, but you don't want to have to man- ually insert ALIGN directives throughout the record, HLA provides a second alignment option to solve your problem. Consider the fol- lowing syntax: type alignedRecord3 : record[4] << Set of fields >> endrecord; The "[4]" immediately following the RECORD reserved word tells HLA to start all fields in the record at offsets that are multiples of four, regardless of the object's size (and the size of the objects preceeding the field). HLA allows any integer expression that pro- duces a value in the range 1..4096 inside these parenthesis. If you specify the value one (which is the default), then all fields are packed (aligned on a byte boundary). For values greater than one, HLA will align each field of the record on the specified boundary. For arrays, HLA will align the field on a boundary that is a multiple of the array element's size. The maximum boundary HLA will round any field to is a multiple of 4096 bytes. Note that if you set the record alignment using this syntactical form, any ALIGN directive you supply in the record may not pro- duce the desired results. When HLA sees an ALIGN directive in a record that is using field alignment, HLA will first align the current offset to the value specified by ALIGN and then align the next field's offset to the global record align value. Nested record declarations may specify a different alignment value than the enclosing record, e.g., type alignedRecord4 : record[4] a:byte; b:byte; c:record[8] d:byte; e:byte; endrecord; f:byte; g:byte; endrecord; In this example, HLA aligns fields a, b, f, and g on dword bound- aries, it aligns d and e (within c) on eight-byte boundaries. Note that the alignment of the fields in the nested record is true only within that nested record. That is, if c turns out to be aligned on some boundary other than an eight-byte boundary, then d and e will not actually be on eight-byte boundaries; they will, however be on eight-byte boundaries relative to the start of c. In addition to letting you specify a fixed alignment value, HLA also lets you specify a minimum and maximum alignment value for a record. The syntax for this is the following: type recordname : record[maximum : minimum] << fields >> endrecord; Whenever you specify a maximum and minimum value as above, HLA will align all fields on a boundary that is at least the minimum alignment value. However, if the object's size is greater than the minimum value but less than or equal to the maximum value, then HLA will align that particular field on a boundary that is a multiple of the object's size. If the object's size is greater than the maximum size, then HLA will align the object on a boundary that is a multiple of the maximum size. As an example, consider the fol- lowing record: type r: record[ 4:1 ]; a:byte; // offset 0 b:word; // offset 2 c:byte; // offset 4 d:dword[2]; // offset 8 e:byte; // offset 16 f:byte; // offset 17 g:qword; // offset 20 endrecord; Note that HLA aligns g on a dword boundary (not qword, which would be offset 24) since the maximum alignment size is four. Note that since the minimum size is one, HLA allows the f field to be aligned on an odd boundary (since it's a byte). If an array, record, or union field appears within a record, then HLA uses the size of an array element or the largest field of the record or union to determine the alignment size. That is, HLA will align the field without the outermost record on a boundary that is compatible with the size of the largest element of the nested array, union, or record. HLA sophisticated record alignment facilities let you specify record field alignments that match that used by most major high level language compilers. This lets you easily access data types used in those HLLs without resorting to inserting lots of ALIGN directives inside the record. Using Records/Structures in Assembly ------------------------------------ In the "good old days" assembly language programmers typically ignored records. Records and structures were treated as unwanted stepchildren from high-level languages, that weren't necessary in "real" assembly language programs. Manually counting offsets and hand-coding literal constant offsets from a base address was the way "real" programmers wrote code in early PC applications. Unfortunately for those "real programmers", the advent of sophisticated operating systems like Windows and Linux put an end to that nonsense. Today, it is very difficult to avoid using records in modern applications because too many API functions require their use. If you look at typical Windows and Linux include files for C or assembly language, you'll find hundreds of different structure declarations, many of which have dozens of different members. Attempting to keep track of all the field offsets in all of these struc- tures is out of the question. Worse, between various releases of an operating system (e.g., Linux), some structures have been known to change, thus exacerbating the problem. Today, it's unreasonable to expect an assembly language programmer to manually track such offsets - most programmers have the reasonable expectation that the assembler will provide this facility for them. Implementing Structures in an Assembler: Part I ----------------------------------------------- Unfortunately, properly implementing structures in an assembler takes considerable effort. A large number of the "hobby" (i.e., non-commercial) assemblers were not designed from the start to support sophisticated features such as records/structures. The symbol table management routines in most assemblers use a "flat" layout, with all of the symbols appearing at the same level in the symbol table database. To properly support structures or records, you need a hierarchical structure in your symbol table database. The bad news is that it's quite difficult to retrofit a hierarchical structure over the top of a flat database (i.e., the symbol "hobby assembler" symbol table). Therefore, unless the assembler was originally designed to handle structures properly, the result is usually a major hacked-up kludge. Four assemblers I'm aware of, MASM, TASM, OPTASM, and HLA, handle structures well. Most other assemblers are still trying to simulate structures using a flat symbol table database, with varying results. Probably the first attempt people make at records, when their assembler doesn't support them properly, is to create a list of constant symbols that specify the offsets into the record. Returning to our first example (in HLA): type student: record Name: string; Major: int16; SSN: char[12]; Midterm1: int16; Midterm2: int16; Final: int16; Homework: int16; Projects: int16; endrecord; One first attempt might be the following: const Name := 0; Major := 4; SSN := 6; Midterm1 := 18; Midterm2 := 20; Final := 22; Homework := 24; Projects := 26; size_student := 28; With such a set of declarations, you could reserve space for a student "record" by reserving "size_student" bytes of storage (which almost all assemblers handle okay) and then you can access fields of the record by adding the constant offset to your base address, e.g., static John : byte[ size_student ]; . . . mov( John[Midterm1], ax ); There are several problems with this approach. First of all, the field names are global and must be globally unique. That is, you cannot have two record types that have the same fieldname (as is possible when the assembler supports true records). The second problem, which is fundamentally more problematic, is the fact that you can attach these constant offsets to any object, not just a "student record" type object. For example, suppose "ClassAverage" is an array of words, there is nothing stopping you from writing the following when using constant equate values to simulate record offsets: mov( ClassAverage[ Midterm1 ], ax ); Finally, and probably the most damning criticism of this approach, is that it is very difficult to maintain code that accesses structures in this manner. Inserting fields into the middle of a record, changing data types, and coming up with globally unique names can create all sorts of problems. In particular, a change to the record in the middle of the record generally requires changing all the following "equates" (constant definitions) to allow for the insertion, deletion, or other modification. As noted earlier in this article, you *can* reduce the maintenance issues somewhat by defining your constants in terms of one another. E.g., const Name := 0; Major := Name+sizeOfString; SSN := Major+sizeOfInt16; Midterm1 := SSN+12*sizeOfChar; Midterm2 := Midterm1+sizeOfInt16; Final := Midterm2+sizeOfInt16; Homework := Final+sizeOfInt16; Projects := Homework+sizeOfInt16; sizeOfStudent := Projects+sizeOfInt16; Now when you insert, delete, or change a field definition, the offsets "percolate" through the remainder of the definition. However, you *do* have to adjust the definition of the object following the new insertion, deletion, or other modification. IOW, modifications to this "structure" scheme have dependencies. Adding (or otherwise modifying a field) requires that you change *other* fields in addition to the one you're modifying. This is not a good thing if you want easy to maintain code. Maintainable code allows you to make only the changes desired and the rest of the code adjusts appropriately, without having to change unrelated lines of code as well. One other problem with this approach is that it is difficult to read. Consider the statement: Final := Midterm2+sizeOfInt16; If you look closely, you discover that nothing in this statement tells you anything about what Final is other than it's value (that is, it's offset within the structure) is equal to Midterm2's value plus the value "sizeOfInt16" (which is presumably two, but no guarantees on that). In particular, this statement doesn't tell you *anything* about Final's type or size. To glean that information you have to look at a different, unrelated statement: Homework := Final+sizeOfInt16; It is the declaration of "Homework" where we learn that Final's size is two bytes. Sure, that's going to be on the next line (usually, not required), but the bottom line is that the statement declaring Final's value should contain this information, you shouldn't have to look elsewhere for it. Many high-level language programmers who've tried to learn assembly language have given up after discovering that they had to maintain records in this fashion in an assembly language program (too bad they didn't start off with a reasonable assembler that properly supports structures). Types and Sizes (a quick digression) ------------------------------------ One big problem you find with some assemblers is that they don't store any type information in the symbol table along with the symbols. The argument many assembler authors provide is that "type checking doesn't belong in assembly language." This is complete nonsense. All reasonable assemblers do *some* type checking. For example, every assembler I've seen will report an error if you attempt to do something like the following: mov( 123456, al ); The literal constant "123456" is too big for an eight-bit register. That is, the *type* of this constant is *not* byte (it's at least an unsigned 18-bit integer). Similarly, instructions like the following are also (generally) rejected by assemblers: mov( 1.2345, eax ); Even if the floating-point constant "1.2345" does fit in 32 bits, it generally isn't appropriate to load a floating-point constant into an integer register. Few assemblers will allow this. And most of the ones that do (e.g., HLA), require that you explicitly *state* that you really mean to move the 32-bit bit pattern corresponding to the floating-point value 1.2345 into a 32-bit integer register. For interested parties, here's how you do this in HLA: mov( @dword( real32(1.2345)), eax ); So the argument that "type checking doesn't belong in assembly language" is a non-starter right from the beginning. Of *course* assembly language has to check operand types in order to verify correct machine code generation. The most *fundamental* type of all, the one that most assemblers deal with, is the *size* of an operand. A typical late-model Intel CPU, for example, works with 8-, 16-, 32-, 64-, 80-, and 128-bit data types. Generally, the two operands of any instruction must all be the same size (with obvious exceptions for the "extension" instructions and the like). The following examples all demonstrate typical type mismatches that most assemblers will report: mov( al, ax ); mov( 12345, al ); mov( st0, eax ); mov( mm0, st0 ); mov( xmm0, al ); I cannot imagine any assembly language programmer questioning an assembler that reports an error for one of the above statements. So why, then, do programmers make ridiculous statements like: "type checking doesn't belong in assembly language." The problems start when you start considering memory access. As noted earlier in this section, most assemblers maintain little information in the symbol table beyond the symbol's name and a numeric or textual value associated with that name. In particular, a typical "low-level" assembler might take a declaration like the following: someVar db ? and keep only the string "someVar" and the offset to the storage associated with this name in the symbol table. In particular, these types of assemblers generally do *not* store information that tells the assembler that "someVar" was declared as a byte object. Therefore, when an assembler sees statements like the following: mov( someVar, al ); mov( someVar, ax ); mov( someVar, eax ); fld( someVar ); mov( someVar, mm0 ); mov( someVar, xmm0 ); the assembler has no information available to tell the assembler whether these statements are correct. So the assembler just blindly assumes that all these statements are correct; after all, the assembler can encode the offset to "someVar" into the machine instruction, so it must be a legal instruction, right? These is where some "less than sophisticated" assembly programmers start arguing about how great it is that a particular assembler doesn't do type checking. Why, they can access these objects anyway they please "without the assembler getting in their way." But does it really make sense to load the contents of a byte variable into an 80-bit floating point register (or worse yet, store an 80-bit floating-point value into a byte variable)? Most of the time the answer is "absolutely not." The vast majority of the time this is a programming error and it would be nice if the assembler reported that fact. Before you start hollering and hooting about how "this is assembly language, I should be allowed to do whatever I want" I would point out that the presence of type checking in an assembler in no way prevents you from storing an 80-bit floating-point variable starting at the address of a byte memory location, if this is what you really want to do. The only issue is that you must *explicitly* tell the assembler that you're bending the rules, but you know what you're doing and the assembler should let you do it. In HLA, for example, you could write: fstp( (type real80 someVar) ); and this would tell HLA that "Yes, I know that someVar isn't a real80 variable, but I want you to treat it as though it was such a variable." The old-time die-hard assembly programmer might argue "but this is a bunch of red tape that gets in the way of my code." This is absolute nonsense. If you're writing *good* code, you generally declare your variables to be the appropriate type and access those variables using instructions appropriate for that type. Type casting or coerce should be the rare exception. If you're constantly casting an object as some other type, perhaps you need to consider the design of your application rather than blaming the assembler for constantly warning you about how you're breaking the rules. Though type casting (implicit or explicit) probably occurs more often in assembly language than in other languages, it still should be a rare event. That being the case, the argument about how type casting "gets in my way" goes right out the window. In fact, few assemblers do any sort of sophisticated type checking above and beyond making sure the size of the operands agree. For example, HLA is probably the strongest typed "traditional" x86 assembler out there, yet it allows you to do something like this: static fvar:real32; . . . mov( fvar, eax ); The sizes of the two operands agree, so HLA accepts this instruction, even though (technically) it shoudl reject the statement because fvar is a floating-point variable and EAX is an integer register. (This was actually a hard design decision to make in HLA, I might point out. Ultimately, I chose to let this slide because it is often the case that you want to work on a floating-point value as a string of bits, and the EAX register [or any general-purpose register] is what you would use to operate on a bit string; that is, EAX doesn't simply hold integers, it also holds bit strings and floating-point values are legitimate bit strings; I could go into a long-winded discussion about why HLA will accept "mov( fvar, eax);" but won't accept "fld(ivar);", but that discussion will have to wait for another essay.) Again, size is the most fundamental piece of type information we can associate with a symbolic name in an assembly language program. You may recall in the earlier discussion of record simulation via equates an example like: Final := Midterm2+sizeOfInt16; As noted earlier, this statement doesn't tell us what the size of Final is, we have to look at the next statement to figure this out: Homework := Final+sizeOfInt16; And after all of this, we still can't ask the assembler "what is the size of Final?" That's because there is no type information associated with the "Final" field when implementing structures in this manner (actually, "Final" is just an assembly time equate, so symbol table information associated with "Final" really has no bearing on the information associated with the actual Final field in the resulting record). As a result, it is up to the programming to manually maintain size information, for use when manipulating "Final" in their program. One common manipulation of a record field is to obtain its size for use in some calculation. As you can see in this simple example, Final's size is equal to the constant "sizeOfInt16" (which, presumably, is two bytes). There is nothing magic about the symbol "sizeOfInt16", it's just another constant declaration appearing elsewhere in the program. Something the programmer had to create and has to manually maintain. In any case, whenever the programmer wants to write some code that utilizes the size of "Final", they'd just use the symbol "sizeOfInt16" in their code, e.g., mov( sizeOfInt16, eax ); But here we see one of the *huge* pitfalls to this scheme -- the code is less maintainable when you do this. Suppose, for example, you need to change Final so that it is a 32-bit integer, or an 8-bit integer, or some other type that doesn't fit in two bytes. Now what? Well, all of a sudden, all the code you've written like "mov( sizeOfInt16, eax );" is now broken. And a simple "search and replace" isn't going to do the job for you, because it's likely that *most* of the "sizeOfInt16" references *don't* have anything to do with "Final". You've got to manually search through your code and determine which occurences of "sizeOfInt16" apply to "Final" and which do not. What you've got is a maintenance nightmare. Somewhat more experienced assembly language programmers will note that they can create individual size constants for each record member, e.g., const NameField := 0; sizeOfNameField := sizeOfStringPtr; Major := Name+sizeOfNameField; sizeOfMajor := sizeOfInt16; SSN := Major+sizeOfMajor; sizeOfSSN := 12*sizeOfChar; Midterm1 := SSN+sizeOfSSN; sizeOfMidterm1 := sizeOfInt16; Midterm2 := Midterm1+sizeOfMidterm1; sizeOfMidterm2 := sizeOfInt16; Final := Midterm2+sizeOfMidterm2; sizeOfFinal := sizeOfInt16; Homework := Final+Final; sizeOfHomework := sizeOfInt16; Projects := Homework+sizeOfHomework; sizeOfHomework := sizeOfInt16; sizeOfStudent := Projects+sizeOfHomework; Now, if the programmer uses the constant "sizeOfFinal" everywhere they need the size of the final exam's data type, they've only got to make a single change to their program if they decide to change Final's type. So the result is quite a bit more maintainable than the earlier version, but it's still far from ideal. Indeed, we've just doubled the number of statements needed to declare our structure (making it less readable and maintainable) and we still haven't solved the problem of having to change multiple lines when we modify the record's layout (indeed, inserting and deleting fields in the record now require *more* work than before). The "Holy Grail" we're searching for here, of course, is to be able to change only *one* line in the declaration when we want to make a modification to one field of the record. The big problem with this current example is that the assembler (requiring you to code records this way) doesn't store away any form of size information in the symbol table, so it cannot provide that information back to the programmer. This forces the programmer to *manually* mantain this size information themselves. Fortunately, some of the better assemblers, e.g., FASM, MASM, TASM, and HLA, *do* maintain a little more information in the symbol table other than the symbol's name and some numeric value (like the variable's offset or the constant's value). In particular, these assemblers store away the size of a declared object (if appropriate) and provide a "compile-time function" that lets you determine the size of that object. For example, consider the following pseudo-record "by equates" declaration in HLA(*): const // Field offsets for a "point3D" record: x :dword := 0; y :dword := x + @size(x); z :dword := y + @size(y); (*) This is a specially-designed example that just happens to work for this one special case. This idea does not easily generalize to other pseudo-record types. The "@size" compile-time function in HLA (similar functions are available in assemblers like FASM, MASM, TASM, and OptASM) returns the size, in bytes, of the operand you pass it. As I've declare the x, y, and z constants in this example to be dwords, the @size function will return four when applied to these three names. Therefore, throughout my program I can use constructs like "@size(z)" and the assembler will automatically substitute the size of z, in bytes, in place of the compile-time function. Therefore, if I decide to use word or byte values, rather than double-word values, the program automatically adjusts, e.g., const // Field offsets for a "point3D" record: x :word := 0; y :word := x + @size(x); z :word := y + @size(y); Recompiling the program sets the offsets to 0, 2, and 4, and also automatically updates all the occurrences of @size(z) to be two, rather than four. So by providing a function that computes the size of an object at compile-time, it's much easier to create maintainable, readable, code. Of course, we still haven't reached the "Holy Grail" with this example, as you have to modify two lines when you insert or delete a field in the middle of the record, but things *are* getting better. Implementing Structures in an Assembler, Part II: Using Macros ------------------------------------------------- Manually maintaining all the constant offsets is a maintenance nightmare. So somewhere along the way, some assembly language programmers figured out that they could write macros to handle the declaration of constant offsets for them. For example, here's how you could do this in an HLA program: program t; // struct- // Declares a "structure". // Syntax: // struct // ( // structName, // field1:type1, // field2:type2, // . // . // . // fieldn:typen // ); // // Creates a "type declaration" that will // reserve sufficient storage for the record // and also creates a set of fieldname constants // initialized with the offsets of each of these // fields. // // Usage: see example immediately following. #macro struct( _structName_, _dcls_[] ): _dcl_, _id_, _type_, _colon_, _offset_; // _offset is the current field offset we're // going to use. Initialize it with zero for // the start of the struct. ?_offset_ := 0; // _dcl_ is going to be a string with the // current declaration we're processing (from // the _dcls_ array of strings, corresponding // to a variable parameter list passed to this // macro): ?_dcl_:string; #for( _dcl_ in _dcls_ ) // Declarations take the form // fieldName : typename // // The following statement locates the // position of the ":" in the _dcl_ string. ?_colon_ := @index( _dcl_ , 0, ":" ); #if( _colon_ = -1 ) // If we didn't find a ":", then we've // got a syntax error. Report it. #error ( "Expected <id>:<type> in struct " "definition, encountered: ", _dcl_ ) #else // Okay, now extract the field name // (which is all the text before the // colon) and the type name (which is // all the text after the colon): ?_id_ := @substr( _dcl_, 0, _colon_ ); ?_type_ := @substr ( _dcl_, @length( _dcl_ ) - _colon_ ); // Emit the fieldName as a constant // that is initialized to the current // offset we're computing. ?@text( _id_ ) := _offset_; // Adjust the current offset beyond the // length of the current field: ?_offset_ := _offset_ + @size( @text( _type_ )); #endif; #endfor // Create a string that we can use to allocate // storage for this "struct" type. That is, // the string expands to the syntax that // creates an HLA array of bytes, with a // sufficient number of bytes for this // struct: ?_structName_:text := "byte[" + @string( _offset_ ) + "]"; #endmacro // Declare a struct "threeItems" with the // three fields: i, j, k: struct( threeItems, i:byte, j:word, k:dword ) // Create a "threeItems" variable and allocate // storage for it in the static section: static aStruct: threeItems; begin t; // To access the fields, we must index into // the aStruct data structure. Note that we // also have to cast each of the fields as // HLA will complain if the types mismatch // (this wouldn't be a problem if the assembler // didn't support type checking at all, so // don't get too excited by the extra type // casting taking place; of course, it goes // without saying that this *isn't* the way // you'd do this in HLA, anyway. mov( (type byte aStruct[i]), al ); mov( (type word aStruct[j]), ax ); mov( (type dword aStruct[k]), eax ); end t; The "struct" macro expects a set of valid HLA variable declarations supplied as macro arguments. It generates a set of constants using the supplied variable names whose offsets are adjusted according to the size of the objects previously appearing in the list. In this example, HLA creates the following equates: i = 0 j = 1 k = 3 This declaration also creates a "data type" named "threeItems" which is equivalent to "byte[7]" (since there are seven bytes in this record) that you may use to create variables of type "threeItems", as is done in this example. Currently, the "macro" approach is the one used by assemblers such as NASM. From the NASM documentation: >>>>>>>>>>>>>>>>> 4.8.5 STRUC and ENDSTRUC: Declaring Structure Data Types The core of NASM contains no intrinsic means of defining data structures; instead, the preprocessor is sufficiently powerful that data structures can be implemented as a set of macros. The macros STRUC and ENDSTRUC are used to define a structure data type. STRUC takes one parameter, which is the name of the data type. This name is defined as a symbol with the value zero, and also has the suffix _size appended to it and is then defined as an EQU giving the size of the structure. Once STRUC has been issued, you are defining the structure, and should define fields using the RESB family of pseudo-instructions, and then invoke ENDSTRUC to finish the definition. For example, to define a structure called mytype containing a longword, a word, a byte and a string of bytes, you might code struc mytype mt_long: resd 1 mt_word: resw 1 mt_byte: resb 1 mt_str: resb 32 endstruc The above code defines six symbols: mt_long as 0 (the offset from the beginning of a mytype structure to the longword field), mt_word as 4, mt_byte as 6, mt_str as 7, mytype_size as 39, and mytype itself as zero. [rlh: note the use of a naming convention to overcome the global namespace solution, there is a solution to this using NASM local symbols, read on...] The reason why the structure type name is defined at zero is a side effect of allowing structures to work with the local label mechanism: if your structure members tend to have the same names in more than one structure, you can define the above structure like this: struc mytype .long: resd 1 .word: resw 1 .byte: resb 1 .str: resb 32 endstruc This defines the offsets to the structure fields as mytype.long, mytype.word, mytype.byte and mytype.str. NASM, since it has no intrinsic structure support, does not support any form of period notation to refer to the elements of a structure once you have one (except the above local-label notation), so code such as mov ax,[mystruc.mt_word] is not valid. mt_word is a constant just like any other constant, so the correct syntax is mov ax,[mystruc+mt_word] or mov ax,[mystruc+mytype.word]. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< As you can see, NASM macro implemention has all the typical problems I've mentioned with the macro implementation, particularly with respect to namespace pollution. This is not to pick on NASM unfairly, *all* macro implementations of records are going to have similar problems. Fancier macros may solve some of them, but you're not going to get a complete solution. Creating structures with macros solves one of the three major problems: it makes it easier to maintain the constant equates list, as you do not have to manually adjust all the constants when inserting and removing fields in a record. That is, we've achieved the "Holy Grail" of struct field maintenance -- if we want to change, insert, or delete only one field, we only have to manipulate that one field. Unfortunately, there are other maintenance problems associated with the "records as offsets" approach, and this macro does not address those problems. The Global Namespace Problem ---------------------------- One huge problem with the macro implementation of the previous section is that it does not help with the problem of "global namespace pollution". In true record implementations, the field names are *local* to that record. In the "records fields are just constant offsets" scheme described thus far, each and every field name is a global constant that must be unique in the name space. True record declarations (similar to those in HLLs) allow you to create record types like the following: type record1: record field1:byte; field2:word; field3:dword; endrecord; record2: record field1:string; field2:cset; field3:real80; endrecord; There is never any ambiguity between the field names in record1 or record2. The assembler/compiler is smart enough to maintain a separate list of identifers for each of these two records, so when the assembler sees something like "rec2Var.field3" it knows to use the second definition of field3 in record2 (assuming "rec2Var" is of type record2). With the macro given in the previous section, you'd get a "duplicate symbol error when you attempted to declare record2, i.e., struct ( record1, field1:byte, field2:word, field3:dword ); struct ( record2, field1:string, field2:cset, field3:real80 ); One solution that enterprising assembly programmers have come up with to solve (well, reduce the impact of) the global namespace problem is to adopt a naming convention that reduces the possibilty of name conflicts. One such naming convention is to rename all the fields using names like "recname_field1", "recname_field2", etc. For example, we could rewrite the above structs as: struct ( record1, record1_field1:byte, record1_field2:word, record1_field3:dword ); struct ( record2, record2_field1:string, record2_field2:cset, record2_field3:real80 ); Unfortunately, this has the tendency to create some very unwieldy names, particularly as you begin creating advanced data structures involving nested records and the like. Though "record1_field1" may not seem so bad by itself, keep in mind that this is just the offset, not the whole name. That is, you've still got to apply the variable name to the whole thing, too. Now, if you can imagine a record that has a nested record type, and that nested record type also has a nested record type, you wind up writing code like this: mov( structVar[record1_i_record2_j_record3_k], al ); rather than the standard record syntax (which is a bit easier to read): mov( structVar.i.j.k, al ); An even worse problem with the "records as constant offsets" approach is that it doesn't prevent you from applying arbitrary offsets to *any* record. Consider the following declarations which are legal (no duplicate symbol errors): struct ( record1, field1:byte, field2:byte, field3:byte ); struct ( record2, field4:dword, field5:dword, field6:dword ); static r1:record1; r2:record2; Now suppose you do this: mov( eax, r1[field5] ); There is no error. Yet r1 doesn't have a "field5" field associated with it and the offset associate with field5 takes you well beyond the allocated storage for r1. IOW, the results are generally undefined when this happens. Yet the assembler gladly accepts this without complaint and emits code that (most likely) is not going to produce desireable results. Fancier macros could be written, macros that generate identifiers like "objectname_fieldName" that help solve the globally unique problem. However, this introduces some problems of it's own. Suppose I write a "struct" macro similar to the one I gave earlier that only generates fieldnames and constructs them by appending the structure name to each of the field names. Now consider the following two structure declarations: struct ( record1, field1:byte, field2:word, field3:dword ); struct ( record1, field4:word, field5:word, field6:dword ); Having two structs like this with the same name could be perfectly allowable by the assembler, because the actual names these macros produce are record1_field1, record1_field2, record1_field3 and record1_field4, record1_field5, record1_field6 and all of these names are globally unique. Note that, effectively, what I've done is create a *union* of two record types. That's because the offsets for the fields field1, field2, and field3 overlap those of field4, field5, and field6 in these records. If I create two variables, r1 and r2, that old the record1 (#1) and record1 (#2) fields, respectively, I can get into a lot of trouble if I start writing code like: mov( al, r2[ record1_field1] ); //Whoops! or mov( eax, r1[record1_field4] ); //Whoops! There is no checking to see that r2 actually has a "record1_field1" associated with it, nor can the assembler check to see that r1 has a "record1_field4" associated with it. IOW, you can mix and match fields from one type declaration in variables that aren't of that type. What a recipe for a disaster! The bottom line is that naming conventions don't really solve the problem. A naming convention doesn't prevent naming conflicts between a local record fieldname and some global symbol (what's to prevent you from defining your own "record1_field1" symbol, for example?). Worse, the use of naming conventions does *not* prevent people from "inserting" fields into your structure. For example, if you've got a macro that defines the following symbols that correspond to fields of a record: record1_field1, record1_field2, record1_field3 there's nothing stopping someone from creating a new symbol, "record1_field4" that has nothing to do with your record, but sure *looks* like it's a field of "record1". Namespace pollution works both ways. Not only can a record pollute the global namespace, but the global namespace and pollute the record namespace as well. All of this leads to a bunch of maintenance and readability problems. These hacks really begin to fail when you attempt to declare nested records, arrays within records, and arrays of records (possibly containing nested records and arrays of records). Trying to keep track of all the possibilties is an open invitation for the introduction of defects into your code (as the author of one low-end assembler puts it "why would anyone do this?" [that is, create such advanced structures.] With an assembler that doesn't support true records, I'd have to ask the same question!) Pseudo-Structs -------------- It is amusing to watch the evolution of various assemblers for the x86 processor family, with respect to record support. MASM is typical of the progression that assemblers go through: 1) When the assembler is first introduced, there is no support for records at all; users who want to utilize records have to create symbolic constants to provide the offset into their records. 2) As users begin to complain about how nice it would be to have records, the assembler's author(s) discover that it's a *lot* of work to do structures properly, so they offer a stop-gap macro implementation of structures to allow people to get some work done without having to make any serious changes to the assembler itself. These macros are not unlike the examples I gave earlier (btw, I don't actually recall Microsoft going through this phase, but it is a phase that assemblers like NASM and FASM have or are going through). 3) The problem with the macro implementation is that all you really get are offsets. No type information is really associated with those offsets, making the use of such record implementions very weak. A scheme I call "pseudo-struct" implementation. (we're going to discuss this mechanism shortly.) 3) As the shortcomings of the pseudo-struct approach become ever more apparent, the assembler's author(s) bite the bullet and get around to adding true record support to their product. In the case of MASM, the first versions (IIRC) did not offer any struct support at all. Somewhere around MASM v4 or v5, Microsoft added struct support in a form I call "pseudo-structs". This was 4-7 years after the original introduction of the assembler. In version 6.0, nearly 10 years after MASM's introduction, Microsoft finally added *true* struct support. So exactly what is a "pseudo-struct"? Well, a pseudo-struct is a structure declaration, not unlike the "struct" macro I gave earlier in this article, that associates a set of offsets with a set of names. But beyond that, it also associates type information (e.g., object size). The declaration would be a "true" record declaration except for the fact that names are not local to the particular record (that is, the global namespace pollution problem still exists). Various assemblers address this problems in different ways, but it usually involves some sort of naming convention. The struc facilities in MASM v5.x were a good example of "pseudo-structs". You could declare actual record type declarations, create variables of those types, and access the fields of those variable using record-like "dot notation", but the field names were still global and had to be globally unique. Ultimately, of course, Microsoft bit the bullet and expanded their symbol table data structure to support true local symbols in a structure declaration (in MASM v6, at which point they supported true records). Type Declarations vs. Variable Instantiation -------------------------------------------- Instantiation is the process of binding a block of storage, and possibly a value, to some sort of object (I use this term in the generic sense, not in the OOP sense, here). Record instantiation is yet another area where true record/struct facilities in an assembler win big over other approaches. Allocation of storage is, perhaps, the *one* thing that most attempts at records get right. After all, to allocate storage all you really need to know is the size of the object, and then you can allocate a block of bytes of that size. Even the "records at constant offsets" implementation can do this. But ultimately, if all you have is a name that begins a block of allocated storage, you can get into a lot of trouble. E.g., mov( record1[someField], al ); How would the assembler ever notify you if "someField" is not actually a field of record1? It can't. The symbol "someField" is just a numeric constant and the assembler happily obliges your request to access the byte at offset "someField" from the base address of record1. This problem also afflict the common "macro implementations of structs" like the one that NASM uses. Note that this is not an example of the "evil type checking" that so many naive assembly programmers complain about. This has nothing to do with type checking. Types are involved here. What this has to do with is verifying that "someField" actually *is* a field of "record1". This is one of the more important benefits of record instantiation -- not only do you get the storage you ask for, but you get some built-in sanity checking for the record object as well. Along with storage allocation, providing an (optional) initial value for a record variable declaration is also a part of instantiation. This is one area where type checking turns out to provide an *extremely* invaluable service. First, let's consider how you'd initialize a record object using NASM's istruct, at, and iend macros. From the NASM manual: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4.8.6 ISTRUC, AT and IEND: Declaring Instances of Structures Having defined a structure type, the next thing you typically want to do is to declare instances of that structure in your data segment. NASM provides an easy way to do this in the ISTRUC mechanism. To declare a structure of type mytype in a program, you code something like this: mystruc: istruc mytype at mt_long, dd 123456 at mt_word, dw 1024 at mt_byte, db 'x' at mt_str, db 'hello, world', 13, 10, 0 iend The function of the AT macro is to make use of the TIMES prefix to advance the assembly position to the correct point for the specified structure field, and then to declare the specified data. Therefore the structure fields must be declared in the same order as they were specified in the structure definition. If the data to go in a structure field requires more than one source line to specify, the remaining source lines can easily come after the AT line. For example: at mt_str, db 123,134,145,156,167,178,189 db 190,100,0 Depending on personal taste, you can also omit the code part of the AT line completely, and start the structure field on the next line: at mt_str db 'hello, world' db 13,10,0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< This all looks well and good. But it hides an important pitfall -- symbols like mt_long and mt_word are only numeric offsets. No type information is associated with these names. So it is syntactically legal to do something like the following: mystruc: istruc mytype at mt_long, db 123 at mt_word, db 124 at mt_byte, dd 12345 at mt_str, dw 1024 iend NASM will happily inject these byte, dword, and word values at the offsets you specify, even though the actual data types do *not* correspond to the values in your struct declaration. Also consider the following: mystruc: istruc mytype at mt_word, dw 1024, 0 at mt_long, dd 123456,7890 at mt_byte, db 123 at mt_str, db "Hello World", 13, 10, 0 iend What does this generate? What does the assembler report? Why isn't it notifying you that you've messed up and the data definitions don't match the original declaration? The argument that "well, this is assembly language and the programmer *should* know what they're doing" is a ridiculous response. Nobody is perfect and everybody makes mistakes. It's nice when the assembler/compiler can help notify you when you've made some mistakes, rather than emitting bad code or data without any indication of the problem. A macro implementation of records need not suffer from the problem we're seeing with NASM here. Consider the definition of FASM structs (from the FASM documentation): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2.3.4 Structures struc directive is a special variant of macro directive that is used to define data structures. Macroinstruction defined using the struc directive must be preceded by a label (like the data definition directive) when it's used. This label will be also attached at the
From: hutch-- on 19 Nov 2005 23:35 I am much of the view that an assembler than does not have native structure and union support is a toothless terror for writing hacky little demos with. While most assembler programmers know how to dereference an address and add offsets to it for each structure member, it falls well short of being able to predefine a structure with or without embedded unions and nested structures. A perfect axample is the structures used for a PE header and the section data. While you can do it the slow hard and unreliable way, you write far cleaner and more reliable code when you have structures and unions available. The only reason why these toothless terrors are still with us is structure and union support is hard to write. Regards, hutch at movsd dot com
From: Annie on 20 Nov 2005 00:09 On 2005-11-19 randyhyde(a)earthlink.net said: > Hi All, > > I've read several posts concerning structures and their > implementation in assembly language. Given some > misconceptions about structures in assembly language, > I pieced together the following article about structures > in assembly. > _____ > [ snip ] ((( `\ _ _`\ ) Ummm...could you put this in (^ ) ) PDF format, Randy? Hehe! ~-( ) _'((,,,))) ,-' \_/ `\ ( , | `-.-'`-.-'/|_| \ / | | =()=: / ,' aa
From: James Buchanan on 20 Nov 2005 01:07 Annie wrote: > _ _`\ ) > Ummm...could you put this in (^ ) ) > PDF format, Randy? Hehe! ~-( ) > _'((,,,))) Another sophisticated and interesting contribution to the debate from Annie.
From: Betov on 20 Nov 2005 04:18
"hutch--" <hutch(a)movsd.com> ?crivait news:1132461357.066607.288060 @f14g2000cwb.googlegroups.com: > I am much of the view that an assembler than does not have native > structure and union support is a toothless terror Said by our prefered Power Basic Programmer, who redistributes illegaly a weird MicroSoft C-Side Toy, in the absurd hope of damaging the Open Source Mouvement. Funny. Betov. < http://rosasm.org > |