Table of contents 1. What you should know already 2. What you should learn now 2.1 AT&T-style 80386 assembly language 2.1.1 Source and destination operand ordering 2.1.2 Instruction naming 2.1.3 Register naming 2.1.4 Constants and addressing modes 2.1.4.1 Constants 2.1.4.2 Indirect addressing modes 2.1.4.2.1 Immediate indirect 2.1.4.2.2 Register indirect 2.1.4.2.3 Base register plus offset indirect 2.1.4.2.4 Index register times width plus offset indirect 2.1.4.2.5 Base register plus index register times width plus offset indirect 2.2 UNIX policies that affect debugging 2.2.1 Advanced 80386 registers not available 2.2.2 Self-modifying code generally not permitted 2.2.3 Symset is a tool that allows you to create your own debugging information for FreeBSD/i386 executables. Doing so makes it easier to reverse engineer an application with gdb (the FreeBSD debugger). By attaching debugging information to an executable you can give names to functions and variables, and even associate reverse engineered source code with a section of assembly. This document describes the basics of the a.out symbol table (the format in which debugging information is stored) and how to use symset to create debugging information for reverse engineering a target application with GDB. 1. What you should know already In this tutorial I assume that you already have the following. 1. Knowledge of the Intel 80386 instruction set. 2. Basic knowledge of UNIX and your way around it. 3. Access to an x86 machine running FreeBSD. 4. Basic knowledge of debugging terminology (breakpoints, examining data and code). 2. What you should learn now. I recognize that many readers have assembly language backgrounds in DOS and Windows, but little or no experience with assembly under UNIX. In this section I introduce you to some of the hurdles you may encounter in your transition. 2.1 AT&T-style 80386 assembly language. The AT&T assembly language format used in BSD-based UNIX operating systems (of which FreeBSD is a member), and consequently in the assembler listings generated by gdb, is quite different than the Intel format that you likely have learned. If you are not familiar with the AT&T format, you should read this section. 2.1.1 Source and destination operand ordering Among the most noticable differences between the AT&T and Intel formats is the way they refer to source and destination operands within an instruction. Under the Intel format an instruction's source and destination operands appear on the right and left of the comma which separates them, resepectively. Under the AT&T format, these roles are reversed: source operands appear on the left and destination operands appear on the right. For an example, look at the following set of instructions and how they are represented differently under the two formats. +---------------+-------------------+ | Intel | AT&T | +---------------+-------------------+ | PUSH EBP | pushl %ebp | | MOV EBP, ESP | movl %esp, %ebp | | SUB ESP, 48 | subl $0x48, %esp | +---------------+-------------------+ Table 1. AT&T format data assignment direction. Other differences aside, you will notice that the AT&T format makes assignments from left to right, and that modifying instructions modify their right-most arguments. The primary reason for this difference is due to the VAX assembly format for which the AT&T style was originally invented. (The Motorola 68000 and its descendents were heavily influenced by the VAX. Likewise, their assembly language format moves in this direction as well!) 2.1.2 Instruction naming As you probably noticed in the example in Table 1, the AT&T format uses slightly different names for 80386 instructions than the Intel format. They differ in keeping with VAX and Motorola traditions where instruction names include a suffix which describes the size of the data they modify. Under the Intel format, these data size directives are normally described using the 'BYTE PTR', 'WORD PTR', and 'DWORD PTR' prefix phrases (if at all). Table 2 illustrates an example. +------------------------------+------------------------+ | Intel | AT&T | +------------------------------+------------------------+ | MOVZX EAX, BYTE PTR [ESI+5] | movzbl 0x5(%esi), %eax | | SUB EAX, 30 | subl $0x30, %eax | | DEC WORD PTR [EBX] | decw (%ebx) | | INC CX | incw %cx | | CMP AL, 5 | cmpb $0x5, %al | +------------------------------+------------------------+ Table 2. Data typing in AT&T format instruction names Instruction suffixes are "b" for byte size operations (8 bits), "w" for word size operations (16 bits), and "l" for double-word operations (32 bits). As you may have noticed in the 'movzbl' (Move with zero-extend) instruction, more than one suffix letter is used when an instruction's source and destiniation operand differ in size. The first suffix letter describes the source operand while the second letter describes the destination. (In the remainder of this section I leave out explicit 'BYTE PTR', 'WORD PTR' and 'DWORD PTR' prefixes from all Intel format examples to save space, unless they are absolutely necessary. Most assemblers and debuggers follow this convention as well because they can determine the proper sizing of an instruction merely by looking at its operands. These Intel size prefixes are only really necessary for instructions that would have otherwise ambiguous sizings, such as the first example in Table 2). 2.1.3 Register naming All CPU register names in the AT&T format are prefixed with the percent ("%") character. (This differentiates them from labeled memory addresses of the same name). 2.1.4 Constants and addressing modes The AT&T assembly format also differs significantly from the Intel format in the way that it represents indirect addressing modes (that is, ways of reading to or writing to memory) and the way in which it represents constants. 2.1.4.1 Constants Constants under the AT&T format are written according to the same rules which govern C: All constants in hexadecimal are prefixed with the characters "0x" (or "0X"); all constants in octal are prefixed with a zero; and all constants in decimcal appear as-is, without any prefix. Constants can also be written in binary, in which case, they are prefixed with the characters "0b". If a constant is used as an immediate value operand inside an instruction (which is the most common place a constant is used) a special prefix of "$" is necessary. The dollar sign prefix differentiates the constant from an immediate indirect address, which is explained in the section that follows. ("2.1.4.2.1 Immediate indirect addressing mode"). 2.1.4.2 Indirect addressing modes Recall that the 80386 offers the programmer a choice of one of five indirect addressing modes when writing an instruction: they are "immediate indirect", "register indirect", "base register + offset indirect", "index register * width + offset indirect", and "base register + index register * width + offset indirect". Table 3 illustrates an example instruction from each of these categories in both formats. +-------------+----------------------------+-----------------------------+ | Mode | Intel | AT&T | +-------------+----------------------------+-----------------------------+ | Immediate | MOV EAX, [0100] | movl 0x0100, %eax | | Register | MOV EAX, [ESI] | movl (%esi), %eax | | Reg + Off | MOV EAX, [EBP-8] | movl -8(%ebp), %eax | | R*W + Off | MOV EAX, [EBX*4 + 0100] | movl 0x100(,%ebx,4), %eax | | B + R*W + O | MOV EAX, [EDX + EBX*4 + 8] | movl 0x8(%edx,%ebx,4), %eax | +-------------+----------------------------+-----------------------------+ Table 3. The five 80386 indirect addressing modes and their syntax. All AT&T format indirect addressing modes are written to the general form of "OFFSET(BASE, INDEX, WIDTH)". OFFSET, if present, must be a constant integer. BASE and INDEX, if either is present, must be registers. WIDTH, if present, applies to the register named in Index, and must be the constant 1, 2, or 4. If width is not specified, a default of '1' is assumed. The above paragraph may look intimidating, but it simply states a rule that you can use to create or comprehend any AT&T format indirect addressing mode you encounter. Under the Intel format, this syntax is equivalent to "[INDEX*WIDTH + BASE + OFFSET]"; if any of these paramaters doesn't apply to a particular instruction, it is simply left out of the form. 2.1.4.2.1 Immediate indirect addressing mode Under the AT&T format, all immediate indirect addresses are written simply as an OFFSET with a missing BASE, INDEX, and WIDTH parameter. Since all three of these parameters reside inside a parenthetical expression under the AT&T format, the resulting empty parenthetical expression itself is left out. This leaves an instruction with a remarkably simple appearance: the immediate indirect address is written by itself with no special prefix or suffix characters, and constitutes the entire operand! Thus, the instruction "MOV EAX, WORD PTR [0100]" (Intel format) is written as "movl 0x0100, %eax" under the AT&T format. (Recall from the discussion about constants that the immediate constant form of this instruction, "MOV EAX, 100", would be written as "movl $0x100, %eax". The dollar sign signifies that the 0x100 is an immediate constant, rather than an immediate address). 2.1.4.2.2 Register indirect A pure register indirect addressing mode instruction, such as "MOV EAX, [ESI]", is written as the general form with only a BASE parameter: "movl (%esi), %eax". 2.1.4.2.3 Register plus offset indirect A regsiter-plus-offset indirect addressing mode instruction, such as "MOV EAX, [EBP-8]", is written as the general form with a BASE and OFFSET parameter (but no INDEX or WIDTH): "movl -8(%ebp), %eax". 2.1.4.2.3 Index register times width plus offset indirect An index-register-times-width-plus-offset indirect addressing mode instruction, such as "MOV EAX, [EBX*4 + 0100]", is written as the general form with an INDEX, WIDTH, and OFFSET parameter (but no BASE): "movl 0x100(,%ebx,4)". 2.1.4.2.4 Base plus index register times width plus offset indirect A base-register-plus-index-register-times-width-plus-offset indirect addressing mode instruction, such as "MOV EAX, [EDX + EBX*4 + 8]", is written as the general form, with all parameters in place. Namely, EDX as the BASE register, EBX as the INDEX register, 4 as the WIDTH, and 8 as the OFFSET: "movl 0x8(%edx, %ebx, 4), %eax". 2.2 UNIX policies that affect debugging UNIX is a monolithic multi-tasking operating system and was designed as such from the ground up. Some of UNIX's policies affect the way 3. GDB, the GNU Debugger GDB is the one and only debugger available for FreeBSD, and the one for which symset was created. This section describes how to launch GDB, load a target, execute it, generate assembly listings, insert breakpoints, and view and modify both data and code. GDB, like many UNIX applications, is a command-line debugger. Configured out-of-the-box, it is not quite as friendly as SoftICE or IDA, two major disassembly tools for DOS. However, it has several powerful features that are unmatched in any other debugger. If you find debugging via command-line too difficult, there are graphical front-ends available for GDB. However, I have never tried them myself, nor do I know of the most recent versions. 3.1 How to launch gdb To launch GDB, simply type it on the command line: "gdb". You must then load your target with the "load" command: "load ". If you wish to save keystrokes and time, you may specify the target on the command line when starting GDB: "gdb ". 3.2 How to set program arguments If your target requires command-line arguments then you must set them with the "set args" command before running the target. (Note to those who know a lot about arguments: In the strictest UNIX tradition the target's executable name is automatically provided by GDB to the target as the zeroth argument, "argv[0]". The arguments you provide with the "set args" command will appear as "argv[1]", "argv[2]", and so on). 3.3 How to run a target To run a target, type "run". 3.4 How to view assembly listings 3.5 How to place breakpoints 3.6 How to view and manipulate registers and data. 4. Symset 4.1 Symbol table basics 4.1.1 Symbol types 4.1.2 Symbol table entry format 4.2 GDB's data types 4.3 Marking and naming functions 4.4 Marking and naming global variables 4.5 Marking and naming function parameters 5.0 About the Author Jeremy Cooper is a contributor to the NetBSD operating system, where he is the port maintainer of the NetBSD/sun3x architecture. He uses reverse engineering techniques to verify compiler output and to fix BIOS bugs in Sun machines.