Recent Posts

Saturday, 16 June 2012

Libraries Part 1 – Introduction to Static Libraries


Having covered the basics of compilation and linking, the next important point for a C Programmer is understanding the libraries. To give an example of library, the printf function in the basic Hello World program comes from the standard Libc library.

So what makes a Library?. To put in plain words, a library is a file containing several object files, that can be used as a single entity in a linking phase of a program. Linux system allows us to create and use two kinds of libraries - static libraries and shared (or dynamic) libraries. In part 1 we will discuss the static libraries, and part 2 will cover the shared libraries.

Static Library:
Libraries that are typically stored in special archive files with the extension ‘.a’, are referred to as static libraries. They are created from object files with a separate tool, the GNU archiver ar, and used by the linker to resolve references to functions at compile-time; as discussed in earlier blog.

In Linux, the standard system libraries are usually found in the directories ‘/usr/lib’ and ‘/lib’. For example, the C math library is typically stored in the file‘/usr/lib/libm.a’ . The corresponding prototype declarations for the functions in this library are given in the header file ‘/usr/include/math.h’. The C standard library itself is stored in ‘/usr/lib/libc.a’ and contains functions specified in the ANSI/ISO C standard, such as ‘printf’; this library is linked by default for every C program.


The basic tool used to create static libraries is a program called 'ar', for 'archiver'. This program can be used to create static libraries, modify object files in the static library, list the names of object files in the library, and so on.

In order to create a static library, the command is like this: 
$ ar rc libutil.a util_X.o util_Y.o util_Z.o 

This command creates a static library named 'libutil.a' and puts copies of the object files "util_X.o", "util_Y.o" and "util_Z.o" in it. If the library file already exists, it has the object files added to it, or replaced, if they are newer than those inside the library. The 'c' flag tells ar to create the library if it doesn't already exist. The 'r' flag tells it to replace older object files in the library, with the new object files.

Library Naming Convention
Libraries are typically named with the prefix "lib". This is true for all the C standard libraries. When linking, the command line reference to the library will not contain the library prefix or suffix.

After an archive is created, or modified, there is a need to index it. This index is later used by the compiler to speed up symbol-lookup inside the library, and to make sure that the order of the symbols in the library won't matter during compilation.

Many of us might have encountered compilers complaining that ‘index is out of date for some library and abort’. If any of your compilers still issues this warning, you have two options to solve it. The first is to use ranlib and re-generate the index. The second one is to use 'cp -p', instead of only 'cp'. The '-p' flag tells 'cp' to keep all attributes of the file, including its access permissions, owner (if "cp" is invoked by a superuser) and its last modification date. This will cause the compiler to think the index inside the file is still updated. This method is useful for makefiles that need to copy the library to another directory for some reason.

Note:  Ranlib is now embedded into ‘ar’ command in most modern compilers. Running ranlib is completely equivalent to executing ar -s.  

For listing the index, you may use nm -s or nm --print-armap.

An archive with such an index speeds up linking to the library and allows routines in the library to call each other without regard to their placement in the archive.

Using A "C" Library In A Program

After we created our archive, we want to use it in a program. This is done by adding the library's name to the list of object file names given to the linker, using a special flag, normally '-l'. Here is an example: 

$ cc main.o -L. -lutil -o prog 

This will create a program using object file "main.o", and any symbols it requires from the "util" static library. Note that we omitted the "lib" prefix and the ".a" suffix when mentioning the library on the link command. The linker attaches these parts back to the name of the library to create a name of a file to look for. Note also the usage of the '-L' flag - this flag tells the linker that libraries might be found in the given directory ('.', referring to the current directory), in addition to the standard locations where the compiler looks for system libraries.

I hope you guys had a good time with static libraries. In part 2, we will discuss about shared / dynamic libraries.



Saturday, 24 March 2012

FAQ - The C Compilation Steps

I strongly believe - Reading without reflection is like Eating without digesting. So here I list a few queries for my eager readers, which I hope will help understand the concepts better. 

  1. List down the basic C compilation steps.
  2. How to stop the compilation after pre-processing?
  3. How will you know the standard defines defined by the compiler in a linux system?.
  4. Which section do constant variables go?
  5. What do BSS hold?
  6. What do data segment hold?.
  7. Will the variable int a[100], declared as global, increase the memory of the executable?, if not why?.  
  8. what do a symbol table hold?.
  9. What does symbol resolution mean?.
  10. Differentiate Strong and Weak Symbols?.
  11. Readelf and objdump usage?


Sunday, 19 February 2012

Behind the Scene – The C Compilation Steps - Part III The Linker


The Linker – Linking is the final step in creating an executable file. During the linking process, the linker resolves references to external symbols, assigns final addresses to procedures/functions and revises code and data to reflect new addresses – relocation process.

To link object files

$ gcc hello.o  -o hello

To perform the linking step gcc uses the linker ld, which is a separate program.

In a typical program, a section of code in one source file can refer variables defined in another source file; similarly functions also. Letz have a look onto what goes into a linker symbol table.
·        Global symbols defined and referenced in the module
·        Global symbols referenced, but not defined in this module – externs
·        Non-global symbols for debuggers and core dump analysis
·        Segment names which are the global symbols for each of the segment.
·        Local symbols that is defined and referenced exclusively in the module.

The linker reads all the symbols in the input module and extracts all the useful information and then it builds the link time symbol table for guiding the linking process. Linker symbol tables are similar to those in compiler. Within the linker there is one symbol table listing for listing the input files and library modules and another one for global symbols across input files. Each time the linker reads an input file, it adds the entire file’s global symbols to the symbol table.

It is important to remember that local linker symbols are not the same as local program variables. The symbol table in .symtab does not contain any symbols that correspond to local non static program variables. These are managed at run time on the stack and linker does not handle this. But for local static variables defined inside a function , the compiler allocates a space in .data or .bss section and creates a unique local linker symbol .

To understand this step more clearly, assume two functions fn_a() and fn_b(), declaring the same static variable name x.
Fn_a()
{
 Static int x = 100;
………….
}
Fn_b()
{
Static int x =0;
}
In the above case, the compiler allocates space for two integers in .data section and exports a pair of unique local linker symbols to the assembler.

Symbol resolution
During the second phase of linking, the linker resolves the symbol references as it creates the output file. The linker resolves the symbol references by associating each reference with exactly one symbol from the input object file. Symbol resolution is straight forward for references to local symbols that are defined in the same module. When the compiler encounters a symbol that is not defined in the current module, it assumes that it is defined in some other module, generates a symbol table entry and leaves it for the linker to handle. If the linker donot find any definition for the referenced symbol, it throws an error message and terminates.

The real situation is more complex. The output file will usually have a symbol table of its own, so the linker needs to create a new vector of indexes of the symbols to be used in the output file, then map the symbol numbers in outgoing relocation entries to those new indices.

Hence folks, we need to be familiar with few terms in this context as explained below.

Name Mangling:
The names used in the object file symbol tables and in linking are not often the same names used in the source programs. There are three reasons for this – avoiding name collisions, name overloading and type checking.  Here we will also discuss multiple defined global symbols and how they are resolved across object files.

At compile time the compiler exports each global symbol to the assembler as either strong or weak, and the assembler encodes this information implicitly in the symbol table. Functions and the initialized global variables get strong symbols. Uninitialized global variables get weak symbols.  Also the linker follows the following rule set.

1.      Multiple strong symbols are not allowed
2.      If there is a strong symbol and multiple weak symbol, strong symbol is preferred.
3.      Between multiple weak symbols, choose any one of the weak symbols

Eg1:  Strong Symbol
/* strong.c*/
Int x = 100;
int main() {
 return 0; }
Here ‘main’ and ‘x’ are both strong symbols.

Eg 2: Multiple strong symbol – function name
/* strong.c*/                                                    
Int x = 100;
int main() {
 return 0; }
/* strong2.c*/
int main() {
 return 0; }
In this case compiler will generate an error message because strong symbol main is defined multiple times.

Eg 3: Multiple strong symbol – variable type
/* strong.c*/                                                    
Int x = 100;
int main() {
 return 0; }
/* strong3.c*/
Int x = 500;
int main() {
 return 0; }
Here Linker will generate an error message because the strong symbol ‘x’ is defined twice.

Eg 4 -  Strong Vs Weak Symbol
/* strong.c*/                                                    
Int x = 100;
int main() {
 return 0; }
/* Weak.c*/
Int x;
int main() {
 return 0; }
In this case, if ‘x’ is uninitialized in one module, the linker will quietly choose the strong symbol defined in the other module.

Eg 5:  Multiple weak symbols – Same type
/* weak.c*/                                                      
Int x;
int main() {
x = 100;
x = x + 100;
 return 0; }
/* Weak1.c*/
Int x;
int main() {
x = 500;
x = x / 100;
 return 0; }

Here when there are two weak symbols, the linker donot give any warnings. This can lead to run time bugs, which are very dangerous and remains undetected and tough to debug. Programmers should be very careful, but its very dangerous if the variable types are of different types as explained in the next example.


Eg 6: Multiple weak symbols – different types
/* weak.c*/                                                      
Int x = 777;
Int z = 300;
int main() {
x = x + 100;
 return 0; }
/* Weak1.c*/
Double  x;
int main() {
x = 500.123;
return 0; }

Here the duplicate weak symbols have different types. The assignment x = 500.123 will overwrite memory locations of ‘x’ and ‘z’. This is a very nasty bug, as it occurs silently  with no warnings from the compilation system.  Here we have an option that can be enabled to make linker tell multiply defined global symbols, it is to enable the GCC flag  –fno-common flag.

Symbol Relocation:
Once a linker has scanned all of the input files to determine segment sizes, symbol definitions and symbol references, figured out which library modules to include, and decided where in the output address space all of the segments will go, the next stage is relocation. Relocation will do the process of adjusting program addresses to account for non-zero segment origins, and the process of resolving references to external symbols, since the two are frequently handled together.  

The below figure will give a better picture of how the executable is made.


With this we come to the end of compilation process, but we have many more to explain like – shared libraries, dynamic link libraries and loader which we will explore in the coming series.



Monday, 30 January 2012

Behind the Scene – The C Compilation Steps - Part II Compiler & Assembler





Compilation and Assembling is the second stage in the C build process. Compiler takes the output of the pre-processor and the source code and generates assembler code. The output of the C Compiler is the assembly code file. The compiler translates the C Code into assembly language, which is a machine level code that contains instructions that control the memory and processor directly, in a layer beneath the operating system.

$ gcc –S hello.c

This will create a file called hello.s, which looks like:

 .file   "hello.c"
        .section        .rodata
.LC0:
        .string "Hello World "
        .text
.globl main
        .type   main, @function
main:
        pushl   %ebp
        movl    %esp, %ebp
        andl    $-16, %esp
        subl    $16, %esp
        movl    $.LC0, (%esp)
        call    puts
        movl    $0, %eax
        leave
        ret
        .size   main, .-main
        .ident  "GCC: (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5"
        .section        .note.GNU-stack,"",@progbits

This file contains machine level instructions, if you remember the old microprocessor programming lab exercise from your college days this will surely help you to understand better. Hence it’s the compiler that generates instruction code to run your program as per your target. Once the assembler code is generated, the compilation step is over and assembler will take over.


Assembler takes the assembly source code and produces an assembly listing with offsets. The assembler output is stored in an object file. You can create object code from an assembly code with


$ as hello.s  –o hello.o

Also you can create object code from a c code with


$ gcc  –c  hello.c

This creates a binary file called hello.o.

In Linux the object file come in ELF (Executable and Linking Format). You can check the file format in Linux using the command


$ file hello.o

Which will give an output like this.
hello.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped


So what all things build up an object file?.  The object file contains different kind of information.
1.      Header Information:  File header to describe the object file, like size of code, name of the source file, creation date etc…
2.      Object code:  This is the binary instructions and data generated by a compiler or assembler.
3.      Relocation:  A list of places in the object code that have to be fixed up when the linker changes the address of the object code.
4.      Symbols:  Global symbols defined in this module, symbols to be imported from other modules.
5.      Debugging Information:  Other information about the object code required for the debugger. This includes source file and line information, local symbols etc…


We can view the contents of the object files by using the command


$ readelf –a hello.o

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          220 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           40 (bytes)
  Number of section headers:         11
  Section header string table index: 8

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 000034 00001c 00  AX  0   0  4
  [ 2] .rel.text         REL             00000000 000348 000010 08      9   1  4
  [ 3] .data             PROGBITS        00000000 000050 000000 00  WA  0   0  4
  [ 4] .bss              NOBITS          00000000 000050 000000 00  WA  0   0  4
  [ 5] .rodata           PROGBITS        00000000 000050 00000d 00   A  0   0  1
  [ 6] .comment          PROGBITS        00000000 00005d 00002c 01  MS  0   0  1
  [ 7] .note.GNU-stack   PROGBITS        00000000 000089 000000 00      0   0  1
  [ 8] .shstrtab         STRTAB          00000000 000089 000051 00      0   0  1
  [ 9] .symtab           SYMTAB          00000000 000294 0000a0 10     10   8  4
  [10] .strtab           STRTAB          00000000 000334 000013 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

There are no section groups in this file.

There are no program headers in this file.

Relocation section '.rel.text' at offset 0x348 contains 2 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
0000000c  00000501 R_386_32          00000000   .rodata
00000011  00000902 R_386_PC32        00000000   puts

There are no unwind sections in this file.
Symbol table '.symtab' contains 10 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 00000000     0 FILE    LOCAL  DEFAULT  ABS hello.c
     2: 00000000     0 SECTION LOCAL  DEFAULT    1
     3: 00000000     0 SECTION LOCAL  DEFAULT    3
     4: 00000000     0 SECTION LOCAL  DEFAULT    4
     5: 00000000     0 SECTION LOCAL  DEFAULT    5
     6: 00000000     0 SECTION LOCAL  DEFAULT    7
     7: 00000000     0 SECTION LOCAL  DEFAULT    6
     8: 00000000    28 FUNC    GLOBAL DEFAULT    1 main
     9: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND puts
No version information found in this file.

Now let’s examine the sections more relevant to a C programmer. I am specifically emphasizing the below sections as it is a must for a C programmer for writing a better solution.



To understand more on the sections, lets analyze each section with an example program.
A bare bone C program:
#include <stdio.h>
main()
  return 0; }

$ gcc test.c
$ size a.out
text   data    bss    dec    hex filename
836    260      8             1104    450 a.out

For those who are not familiar with size command, it gives the following as per the man page.
       The GNU size utility lists the section sizes---and the total size---for
       each of the object or archive files objfile in its argument list.  By
       default, one line of output is generated for each object file or each
       module in an archive.

Now letz add a global variable to the c program as shown below.

#include <stdio.h>
int var; 
main()
{  return 0; }


$ gcc test.c
$ size a.out
text   data    bss    dec    hex filename
836    260     12   1108    454 a.out

Now we can see the bss getting increased by 4 bytes, ie an uninitialized global variable goes to bss section.

Lets initialize the global variable and see where it goes.

#include <stdio.h>
int var = 100; 
main()
{  return 0;}


$ gcc test.c
$ size a.out
text   data    bss    dec    hex filename
836    264      8             1108    454 a.out

Here we can see that the initialized variables go into data segment. The case is similar for static variables, readers are advised to verify the same with making the variable static.

After the global and static, lets go to Const variables and see where it goes.
#include <stdio.h>
const int var = 100; 
main()
{  return 0;}

$ gcc test.c
$ size a.out
text   data    bss    dec    hex filename
 840    260      8             1108    454 a.out

We can clearly see that the text section has increased in size with the const addition. Hope this exercise has given you a clear picture of each section in an executable.


To view the contents of an object file, you can also use objdump command, as given below.

$ objdump –x hello.o

hello.o:     file format elf32-i386
hello.o
architecture: i386, flags 0x00000011:
HAS_RELOC, HAS_SYMS
start address 0x00000000


Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         0000001c  00000000  00000000  00000034  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000050  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000050  2**2
                  ALLOC
  3 .rodata       0000000d  00000000  00000000  00000050  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .comment      0000002c  00000000  00000000  0000005d  2**0
                  CONTENTS, READONLY
  5 .note.GNU-stack 00000000  00000000  00000000  00000089  2**0
                  CONTENTS, READONLY
SYMBOL TABLE:
00000000 l    df *ABS*     00000000 hello.c
00000000 l    d  .text         00000000 .text
00000000 l    d  .data        00000000 .data
00000000 l    d  .bss           00000000 .bss
00000000 l    d  .rodata    00000000 .rodata
00000000 l    d  .note.GNU-stack                00000000 .note.GNU-stack
00000000 l    d  .comment              00000000 .comment
00000000 g     F .text        0000001c main
00000000         *UND*     00000000 puts

RELOCATION RECORDS FOR [.text]:
OFFSET   TYPE              VALUE
0000000c R_386_32          .rodata
00000011 R_386_PC32        puts

There are three types of object file –

·        Relocatable object file -  contains binary code and data in a form that can be combined with other relocatable object files.
·        Executable object file – contains binary code and data in a form that can be copied directly into memory and executed.
·        Shared object file – A special type of relocatable object file that can be loaded into memory and linked dynamically.


Compilers and assemblers generate relocatable object files. Linkers generate executable object files.

However understanding the various sections of the object file won’t be complete without a few more words on the Symbol table. 


Symbol Table:


The compiler creates a symbol table containing the name to address mappings as part of object files it produces.  A symbol table is a data structure used by compiler to keep track of semantics of variables. The below figure will give a simplified view of a sample symbol table.


The name is a byte offset into the string table that points to the null terminated string name of the symbol. The value is the symbol’s address. For relocatable modules the value is an offset from the beginning of the section where the object is defined.  For executable object files the value is an absolute run time address. The size is the size of the object in bytes. The type is usually either data or function.  The binding is the scope of the symbol. The compiler keeps a compiler symbol table and the linker keeps a linker table created during linking process. Whatever output that you see here as part of readelf and objdump will be different as we examine the executable.  The commands are introduced here for the readers to get familiar with the procedure.


To view a symbol table, use the command
$ nm hello.o