A Short Course: Compilers, Assemblers, Linkers, Loaders

Source and copy right @ https://courses.cs.washington.edu/courses/cse378/97au/help/compilation.html

Assembler: A computer will not understand any program written in a language, other than its machine language. The programs written in other languages must be translated into the machine language. Such translation is performed with the help of software. A program which translates an assembly language program into a machine language program is called an assembler. If an assembler which runs on a computer and produces the machine codes for the same computer then it is called self assembler or resident assembler. If an assembler that runs on a computer and produces the machine codes for other computer then it is called Cross Assembler.

Assemblers are further divided into two types: One Pass Assembler and Two Pass Assembler. One pass assembler is the assembler which assigns the memory addresses to the variables and translates the source code into machine code in the first pass simultaneously. A Two Pass Assembler is the assembler which reads the source code twice. In the first pass, it reads all the variables and assigns them memory addresses. In the second pass, it reads the source code and translates the code into object code.

Compiler: It is a program which translates a high level language program into a machine language program. A compiler is more intelligent than an assembler. It checks all kinds of limits, ranges, errors etc. But its program run time is more and occupies a larger part of the memory. It has slow speed. Because a compiler goes through the entire program and then translates the entire program into machine codes. If a compiler runs on a computer and produces the machine codes for the same computer then it is known as a self compiler or resident compiler. On the other hand, if a compiler runs on a computer and produces the machine codes for other computer then it is known as a cross compiler.

Interpreter: An interpreter is a program which translates statements of a program into machine code. It translates only one statement of the program at a time. It reads only one statement of program, translates it and executes it. Then it reads the next statement of the program again translates it and executes it. In this way it proceeds further till all the statements are translated and executed. On the other hand, a compiler goes through the entire program and then translates the entire program into machine codes. A compiler is 5 to 25 times faster than an interpreter.

By the compiler, the machine codes are saved permanently for future reference. On the other hand, the machine codes produced by interpreter are not saved. An interpreter is a small program as compared to compiler. It occupies less memory space, so it can be used in a smaller system which has limited memory space.

Linker: In high level languages, some built in header files or libraries are stored. These libraries are predefined and these contain basic functions which are essential for executing the program. These functions are linked to the libraries by a program called Linker. If linker does not find a library of a function then it informs to compiler and then compiler generates an error. The compiler automatically invokes the linker as the last step in compiling a program.

Not built in libraries, it also links the user defined functions to the user defined libraries. Usually a longer program is divided into smaller subprograms called modules. And these modules must be combined to execute the program. The process of combining the modules is done by the linker.

Loader: Loader is a program that loads machine codes of a program into the system memory. In Computing, a loader is the part of an Operating System that is responsible for loading programs. It is one of the essential stages in the process of starting a program. Because it places programs into memory and prepares them for execution. Loading a program involves reading the contents of executable file into memory.  Once loading is complete, the operating system starts the program by passing control to the loaded program code. All operating systems that support program loading have loaders. In many operating systems the loader is permanently resident in memory.

Compilers, Assemblers, Linkers, Loaders: A Short Course

This document briefly describes what happens when you compiler and run
a program. More details can be found in Compilers, Principles,
Techniques, and Tools
by Aho, Sethi, and Ullman (CSE 401 book)
and Appendix A of Computer Organization and Design by
Patterson and Hennesey (CSE 378 book).

Compiling a Program

When you type cc at the command line a lot of stuff happens.
There are four entities involved in the compilation process:
preprocessor, compiler, assembler, linker (see Figure 1).


The internals of cc

Figure 1: The internals of cc.


First, the C preprocessor cpp expands all those macros
definitions and include statements (and anything else that starts with
a #) and passes the result to the actual compiler. The
preprocessor is not so interesting because it just replaces some short
cuts you used in your code with more code. The output of cpp
is just C code; if you didn’t have any preprocessor statements in your
file, you wouldn’t need to run cpp. The preprocessor does
not require any knowledge about the target architecture. If you had
the correct include files, you could preprocess your C files on a
LINUX machine and take the output to the instructional machines and
pass that to cc. To see the output of a preprocessed file,
use cc -E.

The compiler effectively translates preprocessed C code into assembly
code, performing various optimizations along the way as well as
register allocation. Since a compiler generates assembly code
specific to a particular architecture, you cannot use the assembly
output of cc from an Intel Pentium machine on one of the
instructional machines (Digital Alpha machines). Compilers are very
interesting which is one of the reasons why the department offers an
entire course on compilers (CSE 401). To see the assembly code
produced by the compiler, use cc -S.

The assembly code generated by the compilation step is then passed to
the assembler which translates it into machine code; the resulting
file is called an object file. On the instructional machines, both
cc and gcc use the native assembler as that
is provided by UNIX. You could write an assembly language program and
pass it directly to as and even to cc (this is what
we do in project 2 with sys.s). An object file is a binary
representation of your program. The assembler gives a memory
to each variable and instruction; we will see later that
these memory locations are actually represented symbolically or via
offsets. It also make a lists of all the unresolved references that
presumably will be defined in other object file or libraries,
e.g. printf. A typical object file contains the
program text (instructions) and data (constants and strings),
information about instructions and data that depend on absolute
addresses, a symbol table of unresolved references, and possibly some
debugging information. The UNIX command nm allows you to
look at the symbols (both defined and unresolved) in an object file.

Since an object file will be linked with other object files and
libraries to produce a program, the assembler cannot assign absolute
memory locations to all the instructions and data in a file. Rather,
it writes some notes in the object file about how it assumed
things were layed out. It is the job of the linker to use these notes
to assign absolute memory locations to everything and resolve any
unresolved references. Again, both cc and gcc on
the instructional machines use the native linker, ld. Some
compilers chose to have their own linkers, so that optimizations can
be performed at link time; one such optimization is that of aligning
procedures on page boundaries. The linker produces a binary
executable that can be run from the command interface.

Notice that you could invoke each of the above steps by hand. Since
it is an annoyance to call each part separately as well as pass the
correct flags and files, cc does this for you. For example,
you could run the entire process by hand by invoking /lib/cpp
and then cc -S and then /bin/as and finally
ld. If you think this is easy, try compiling a simple
program in this way.

Running a Program

When you type a.out at the command line, a whole bunch of
things must happen before your program is actually run. The loader
magically does these things for you. On UNIX systems, the loader
creates a process. This involves reading the file and creating an
address space for the process. Page table entries for the
instructions, data and program stack are created and the register set
is initialized. Then the loader executes a jump instruction to the
first instruction in the program. This generally causes a page fault
and the first page of your instructions is brought into memory. On
some systems the loader is a little more interesting. For example, on
systems like Windows NT that provide support for dynamically loaded
libraries (DLLs), the loader must resolve references to such libraries
similar to the way a linker does.


Figure 2 illustrates a typical layout for program memory. It is the
job of the loader to map the program, static data (including globals
and strings) and the stack to physical addresses. Notice that the
stack is mapped to the high addresses and grows down and the program
and data are mapped to the low addresses. The area labeled
heap is where the data you allocate via malloc is
placed. A call to malloc may use the sbrk system
call to add more physical pages to the program’s address space (for
more information on malloc, free and sbrk,
see the man pages).


Memory layout

Figure 2: Memory layout.


Procedure Call Conventions

A call to a procedure is a context switch in your program. Just like
any other context switch, some state must be saved by the calling
procedure, or caller, so that when the called procedure, or
callee, returns the caller may continue execution without
distraction. To enable separate compilation, a compiler must follow a
set of rules for use of the registers when calling procedures. This
procedure call convention may be different across compilers
(does cc and gcc use the same calling convention?)
which is why object files created by one compiler cannot always be
linked with that of another compiler.

A typical calling convention involves action on the part of the caller
and the callee. The caller places the arguments to the callee in some
agreed upon place; this place is usually a few registers and the
extras are passed on the stack (the stack pointer may need to be
updated). Then the caller saves the value of any registers it will
need after the call and jumps to the callee’s first instruction. The
callee then allocates memory for its stack frame and saves any
registers who’s values are guaranteed to be unaltered through a
procedure call, e.g. return address. When the callee is
ready to return, it places the return value, if any, in a special
register and restores the callee-saved registers. It then pops the
stack frame and jumps to the return address.

Original version from CSE451, Autumn 1996.
Modified by wolman@cs.washington.edu, Autumn 1997.