Greetings, folks.

Today we talk mainly about Memory Layout in C++. We are going to see

• how sources became a program;
• what is a memory segmentation;
• what is the difference between sections and segments;
• what is an object file and why it's so important;
• what tools can be used to get a closer look at program's internals.

System Details:

• GCC version 4.8.2
• CentOS 7 x64

### The C++ compilation process

Before we talk about program's internals we need to understand how all the pieces are composed together. Let's see how program source become a fully operational executable file.

Consider the following tiny example written in C++:

#include <vector>
#include <iostream>

int main()
{
std::vector<int> numbers {1,2,3,4,5,6};

for (auto n : numbers)
std::cout << n*n << '\n';

return 0;
}


# Preprocessor Stage

Command: g++ -std=c++11 -E main.cpp > main.ii.
Option -E says GCC to produce only the preprocessor output and not to run the compiler.

The preprocessor takes a C++ source code file, parse it and takes lines beginning with '#' as directives (#includes, #defines, #if and others). The output of this step is a "pure" C++ source file without pre-processor directives.

It works on one C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending of #if, #ifdef and #ifndef directives.

The preprocessor works on a stream of preprocessing tokens. Macro substitution is defined as replacing tokens with other tokens (the operator ## enables merging two tokens when it make sense).

After all this, the preprocessor produces a single output that is a stream of tokens resulting from the transformations described above. It also adds some special markers that tell the compiler where each line came from so that it can use those to produce sensible error messages.

Some errors can be produced at this stage with a help of the #if and #error directives.

For example, this small file (simple.cpp):

#ifndef PI
#define PI 3.14159265359
#endif

int main()
{
float pi = PI;
return 0;
}


give us the following preprocessing output:

# 1 "simple.cpp"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "simple.cpp"

int main()
{
float pi = 3.14159265359;
return 0;
}


Preprocessor added linemarkers in the output file. The format is:

# linenum filename flags

According to GCC Online Documentation:

After the file name comes zero or more flags, which are '1', '2', '3', or '4'. If there are multiple flags, spaces separate them. Here is what the flags mean:

'1' - This indicates the start of a new file.
'2' - This indicates returning to a file (after having included another file).
'3' - This indicates that the following text comes from a system header file, so certain warnings should be suppressed.
'4' - This indicates that the following text should be treated as being wrapped in an implicit extern "C" block.

# Compiler Stage

Compiler takes the preprocessor's output and produces an object file from it. However, we can ask it to produce only an assembly file.

Command: g++ -std=c++11 -S main.ii.
Option -S says GCC to produce only the assembly code and not assemble the file.

You can view the assembly file easily, but in order to unmangle names use c++filt utility from
GNU Binutils:

cat main.s | c++filt > main_unmangled.s

# Assembler Stage

The assembler processes input assembly file and produces object file.

Command: g++ -std=c++11 -c -o main.o main.s.
Option -c says GCC to produce only the compiled code and not to link.

Command: g++ -std=c++11 -o main main.o

A linker converts object files into executables and shared libraries. Those are binary data files written in a format designed as input for the operating system or the loader. There is no special requirement for the object file format to resemble the executable file format. But, usually they are very similar.

The main type of entity in this stage, as you might have guessed, is object file. The object file is a binary data file written in a format designed as an input to the linker. Object file has various sections (the major ones are described later) and in order to combine all the objects to a single executable, the linker merges all sections of similar type into a single section of that type.

It's useful to know that linker works with symbols and relocations.

Symbols are objects that exist in a single place for the duration of the program.

Quote from Wikipedia:

Typically, an object file can contain three kinds of symbols:

• defined "external" symbols, which allow it to be called by other modules
• undefined "external" symbols, which reference other modules where these symbols are defined
• local symbols, used internally within the object file to facilitate relocation.

For example, in an object file generated from C++ code, there will be a symbol for each function and for each global and static variable. Those are defined symbols. Symbols located in different objects files are undefined symbols. During the linking process, the linker will assign an address to each defined symbol, and will resolve each undefined symbol by finding a defined symbol with the same name. Depending on whether it's global or local, initialized or uninitialized, variable or constant the symbols are put in different sections.

Closely related to symbols are relocations.

Quote from Wikipedia:

Relocation is the process of assigning load addresses to various parts of a program and adjusting the code and data in the program to reflect the assigned addresses.

Relocation process is tightly bound to relocation tables and defines a lot of specific sections containing extra information that you may encounter with when investigating object code or executable files. For example, one of this section is .rela.text which is related to relocation regarding .text section of the object file. I'm not going to dive into all these, but if you would like to go further, a good starting point is ELF file format.

Every .o file has a "relocation table" listing every single reference to a symbol that the linker needs to update, and how it will do the update.

Now as we've seen how to manually trigger commands to get intermediate files, let's stick on just one command to produce an executable:

g++ -std=c++11 -o main main.cpp

If you still want to get all the temporary files (expanded source file, assembly file and object code file) and don't want to type these four command each time, you are free to use GCC's --save-temps flag. It will produce the executable plus all the temporary files for you:

g++ -std=c++11 --save-temps -o main main.cpp

As a quick wrap-up, the figure below shows all stages:

Let's take a closer look of what sections the object code file and, as a consequence, the running program consists of.

### Investigating Sections and Segments

What follows is a short overview. For deep knowledge refer to specific material (see Reference section).

What is the difference between sections and segments? How is this related to object files and executables?

The contents of an executable file or a shared library which are intended to be loaded into memory are contained within a segment. An object file in turn does not have segments but sections.

Quote from Wikipedia:

The segments contain information that is necessary for runtime execution of the file, while sections contain important data for linking and relocation. Any byte in the entire file can be owned by at most one section, and there can be orphan bytes which are not owned by any section.

Self-explanatory picture from the same page:

An ELF file has two views: The program header shows the segments used at run-time, whereas the section header lists the set of sections of the binary.

Quite from Wikipedia:

Each segment has a length and set of permissions (for example, read, write, execute) associated with it. A process is only allowed to make a reference into a segment if the type of reference is allowed by the permissions, and if the offset within the segment is within the range specified by the length of the segment. Otherwise, a hardware exception such as a segmentation fault is raised.

But as long as you are gettings familiar with sections and segments, you are definitely going to notice that some sections and segments have the same name and contents. How is that? The answer is not short and is out of this article's scope, but speaking shortly, it's because:

... the linker reads sections from the input object files. It sorts and concatenates them into sections in the output file. It maps all the loadable sections into segments in the output file. It lays out the section contents in the output file segments respecting alignment and access requirements, so that the segments may be mapped directly into memory. The sections are mapped to segments based on the access requirements: normally all the read-only sections are mapped to one segment and all the writable sections are mapped to another segment.

How these mappings work can be easlity shown by calling readelf --segments <input> command:

Skipping to the bottom of the output, we can see what sections have been moved into what segments (check the part Section to Segment mapping).

So, now we are going to look at segments.

### Program Segments

.text segment

The text segment, a.k.a. code segment, contains executable instructions provided by the compiler and assembler.

.data segment

The data segment, a.k.a. initialized data segment, contains initialized:

• global variables (including global static variables)
• static local variables.

The segment's size depends on the size of the values in the source code, the values can be altered at run-time.

.rdata/.rodata segment

The segments contains static unnamed data (like string constants)

.bss segment

BSS segment, a.k.a. uninitialized data segment, contains statically-allocated (global and static) variables represented solely by zero-valued bits on program start. BSS stands for Block Started by Symbol, a pseudo-operation that existed in a very old assembler developed for the IBM.

Another important memory regions each program has are stack and heap.

Quote from Wikipedia

The heap area commonly begins at the end of the .bss and .data segments and grows to larger addresses from there. The heap area is managed by malloc, realloc, and free, which may use the brk and sbrk system calls to adjust its size (note that the use of brk/sbrk and a single "heap area" is not required to fulfill the contract of malloc/realloc/free; they may also be implemented using mmap to reserve potentially non-contiguous regions of virtual memory into the process' virtual address space). The heap area is shared by all threads, shared libraries, and dynamically loaded modules in a process.

Quote from Wikipedia

The stack area contains the program stack, a LIFO structure, typically located in the higher parts of memory. A "stack pointer" register tracks the top of the stack; it is adjusted each time a value is "pushed" onto the stack. The set of values pushed for one function call is termed a "stack frame". A stack frame consists at minimum of a return address. Automatic variables are also allocated on the stack.

In fact, once you get an executable file you can use readelf --sections <file>/readelf --segments <file> command to output all the available sections/segments a file comprises of. However, to fully understand the output of the command you will have to go through the format of an ELF file.

We all know that a picture is worth a thousand words. I'll give you two. The first one just shows a simplified view on memory layout:

The second one I took from this pdf and it shows the memory layout for the following program:

void func(int x, int y)
{
int a;
int b[3];
/* no other auto variable */
...
}

void main()
{
...
func(72,73);
...
}


Here it is:

### From Theory to Practice

The last thing I'd like to work through before going to code-compile-investigate phase is what tools are useful for the research purpose. There are an enormous number of such tools, but here are the major ones:

size - List section sizes and total size. Useful for coarse estimation and not suitable for deep investigation.

readelf - Displays information about ELF files. Shows a lot of different information regarding ELF file. I will not be using this tool, as objdump's functionality is quite enough for our test cases, but you should probably stick with this one in case of an advanced investigation.

objdump - Display information from object files. Less powerful than readelf as it can't show some ELF specific information, but nevertheless is good for us.

Consider the following tiny program:

int main()
{
return 0;
}


Here are its section sizes:

[gahcep@vmr src]$g++ --save-temps -o main main.cpp [gahcep@vmr src]$ size main.o
text    data     bss     dec     hex filename
67       0       0      67      43 main.o


.data and .bss sections are empty. No surprise here. Let's modify our program a bit:

// Uninitialized global variables go to .bss
int global;

int main()
{
// Uninitialized static variables go to .bss
static int st;
return 0;
}


Let's see result now:

[gahcep@vmr src]$size main.o text data bss dec hex filename 67 0 8 75 4b main.o  .bss sections has been increased by 8 bytes which is the size of two ints. Check disassemble code for the section: [gahcep@vmr src]$ objdump -CS -s -j .bss main

main:     file format elf64-x86-64

Disassembly of section .bss:

000000000060102c <__bss_start>:
...

0000000000601030 <global>:
601030:       00 00 00 00                                         ....

0000000000601034 <main::st>:
601034:       00 00 00 00                                         ....


See the <global> and <main::st> identifiers? That's our guys.

Let's try another edit:

// Uninitialized global variables go to .bss
int uninit_global;

// Initialized global variables go to .data
float init_global = 3.14;

int main()
{
// Initialized static variables go to .data
static int st = 77;
return 0;
}


Result:

[gahcep@vmr src]$size main.o text data bss dec hex filename 67 8 4 79 4f main.o  is obvious: .data has been increased by 8 bytes (initialized global and local static variables) and .bss now contains only 4 byte (size of uninit_global variable). Let's verify this: [gahcep@vmr src]$ objdump -CS -s -j .data main

main:     file format elf64-x86-64

Contents of section .data:
601028 00000000 c3f54840 4d000000           ......H@M...

Disassembly of section .data:

0000000000601028 <__data_start>:
...

000000000060102c <init_global>:
60102c:       c3 f5 48 40                                         ..H@

0000000000601030 <main::st>:
601030:       4d 00 00 00                                         M...

[gahcep@vmr src]$objdump -CS -s -j .bss main main: file format elf64-x86-64 Disassembly of section .bss: 0000000000601034 <__bss_start>: 601034: 00 00 add %al,(%rax) ... 0000000000601038 <uninit_global>: ...  Another interesting story with constants. As you know those are for .rodata sections. Previous program has no constants, thus .rodata is empty: [gahcep@vmr src]$ objdump -CS -s -j .rodata main

main:     file format elf64-x86-64

Contents of section .rodata:
400640 01000200 00000000 00000000 00000000  ................

Disassembly of section .rodata:

0000000000400640 <_IO_stdin_used>:
400640:       01 00 02 00 00 00 00 00                             ........

0000000000400648 <__dso_handle>:
...


Now let's make the following edits:

const int MAX = 10000;
const int MIN = 100;

int main()
{
const float pi = 3.14;
return 0;
}


.bss & .data are empty:

[gahcep@vmr src]$size main.o text data bss dec hex filename 84 0 0 84 54 main.o  But .rodata isn't: [gahcep@vmr src]$ objdump -CS -s -j .rodata main

main:     file format elf64-x86-64

Contents of section .rodata:
400650 01000200 00000000 00000000 00000000  ................
400660 10270000 64000000 c3f54840           .'..d.....H@

Disassembly of section .rodata:

0000000000400650 <_IO_stdin_used>:
400650:       01 00 02 00 00 00 00 00                             ........

0000000000400658 <__dso_handle>:
...

0000000000400660 <MAX>:
400660:       10 27 00 00                                         .'..

0000000000400664 <MIN>:
400664:       64 00 00 00 c3 f5 48 40                             d.....H@


Two global constants are easily recognizable (<MAX> and <MIN>). What about main's local pi constant? As long as it's used within main function it has no visible name in .rodata section, but it is indeed located in that section: c3 f5 48 40 under <MIN>'s part is 3.14 in binary.