Articles / GCC Myths and Facts

GCC Myths and Facts

Since my good old Pentium 166 days, I've liked to search for the best optimizations possible so programs can take the maximum advantage of hardware/CPU cycles. If I have a nice piece of hardware, why not run it at its full power, using every little feature? Shouldn't we all try to get the best results from the money invested in our machines?

This article is written for the average desktop Linux user and with the x86 architecture and C/C++ in mind, but some of its content can be applied to all architectures and languages.

GCC 3 Improvements

GCC 3 is the biggest step forward since GCC 2 and represents more than ten years of work and two of hard development. It has major benefits over its predecessor, including:

Target Improvements

  • A new x86 backend, generating much-improved code.
  • Support for a generic i386-elf target.
  • A new option to emit x86 assembly code using an Intel-style syntax.
  • Better code generated for floating point-to-integer conversions, leading to better performance by many 3D applications.

Language Improvements

  • A new C++ ABI. On the IA-64 platform, GCC is capable of interoperating with other IA-64 compilers.
  • A significant reduction in the size of symbol and debugging information (thanks to the new ABI).
  • A new C++ support library and many C++ bugfixes, vastly improving conformance to the ISO C++ standard.
  • A new inliner for C++.
  • A rewritten C preprocessor, integrated into the C, C++, and Objective C compilers, with many improvements, including ISO C99 support and improvements to dependency generation.

General Optimizations

  • Infrastructure for profile-driven optimizations.
  • Support for data prefetching.
  • Support for SSE, SSE2, 3DNOW!, and MMX instructions.
  • A basic block reordering pass.
  • New tail call and sibling call elimination optimizations.

Why do some programmers and users fail to take advantage of these amazing new features? I admit that some of them are still "experimental", but not all of them. Perhaps the PGCC (Pentium compiler group) project gave rise to several misunderstandings which persist today. (PGCC offered several Pentium-specific optimizations. I looked at it when it first started, but benchmarks showed that the improvement was only about 2%-5% over GCC 2.7.2.3.)

We should clear the air about the GCC misconceptions. Let's start with the most loved and hated optimization: -Ox.

Myths

I use -O69 because it is faster than -O3.

This is wrong!

The highest optimization is -O3.

From the GCC 3.2.1 manual:

       -O3    Optimize yet more.  -O3 turns on all optimizations
              specified   by   -O2   and   also   turns  on  the
              -finline-functions and -frename-registers options.

The most skeptical can verify this in gcc/topolev.c:


/* Scan to see what optimization level has been specified.
   That will determine the default value of many flags. */ 

-snip- 

  if (optimize >= 3)

     {

      flag_inline_functions = 1;

      flag_rename_registers = 1;

     }


If you are using GCC, there's no point in using anything higher than 3.

-O2 turns on loop unrolling.

In the GCC manpage, it's clearly written that:

-O2 turns on all optional optimizations except for loop unrolling [...]

Skeptics: check topolev.c.

So when you use -O2, which optimizations are you using?

The -O2 flag turns on the following flags:

  • -O1, which turns on:
    • defer pop (see -fno-defer-pop)
    • -fthread-jumps
    • -fdelayed-branch (on, but specific machines may handle it differently)
    • -fomit-frame-pointer (only on if the machine can debug without a frame pointer; otherwise, you need to specify)
    • guess-branch-prob (see -fno-guess-branch-prob)
    • cprop-registers (see -fno-cprop-registers)
  • -foptimize-sibling-calls
  • -fcse-follow-jumps
  • -fcse-skip-blocks
  • -fgcse
  • -fexpensive-optimizations
  • -fstrength-reduce
  • -frerun-cse-after-loop
  • -frerun-loop-opt
  • -fcaller-saves
  • -flag_force_mem
  • peephole2 (a machine-dependent option; see -fno-peephole2)
  • -fschedule-insns (if supported by the target machine)
  • -fregmove
  • -fstrict-aliasing
  • -fdelete-null-pointer-checks
  • reorder blocks

There's no point in using -O2 -fstrength-reduce, etc., since O2 implies all this.

Facts

The truth about -O*

This leaves us with -O3, which is the same as -O2 and:

  • -finline-functions
  • -frename-registers

Inline-functions is useful in some cases (mainly with C++) because it lets you define the size of inlined functions (600 by default) with -finline-limit. Unfortunately, if you set a high number, at compile time you will probably get an error complaining about lack of memory. This option needs a huge amount of memory, takes more time to compile, and makes the binary big. Sometimes, you can see a profit, and sometimes, you can't.

Rename-registers attempts to avoid false dependencies in scheduled code by making use of registers left over after register allocation. This optimization will most benefit processors with lots of registers. It can, however, make debugging impossible, since variables will no longer stay in a "home register". Since i386 is not a register-rich architecture, I don't think this will have much impact.

A higher -O does not always mean improved performance. -O3 increases the code size and may introduce cache penalties and become slower than -O2. However, -O2 is almost always faster than -O.

-march and -mcpu

With GCC 3, you can specify the type of processor you're using with -march or -mcpu. Although they seem the same, they're not, since one specifies the architecture, and other the CPU. The available options are:

  • i386
  • i486
  • i586
  • i686
  • Pentium
  • pentium-mmx
  • pentiumpro
  • pentium2
  • pentium3
  • pentium4
  • k6
  • k6-2
  • k6-3
  • athlon
  • athlon-tbird
  • athlon-4
  • athlon-xp
  • athlon-mp

-march implies -mcpu, so when you use -march, there's no need to use -mcpu.

-mcpu generates code tuned for the specified CPU, but it does not alter the ABI and the set of available instructions, so you can still run the resulting binary on other CPUs (it turns on flags like mmx/3dnow, etc.).

When you use -march, you generate code for the specified machine type, and the available instructions will be used, which means that you probably cannot run the binary on other machine types.

Conclusion

Fine-tune your Makefile, remove those redundant options, and take a look at the GCC manpage. I bet you will save yourself a lot of time. There's probably a bug somewhere that can be smashed by turning off some of GCC's default flags.

This article discusses only a few of GCC's features, but I won't broaden its scope. I just want to try to clarify some of the myths and misunderstandings. There's a lot left to say, but nothing that can't be found in the Fine Manual, HOWTOs, or around the Internet. If you have patience, a look at the GCC sources can be very rewarding.

When you're coding a program, you'll inevitably run into bugs. Occasionally, you'll find one that's GCC's fault. When you do, stop to think about the time and effort that's gone into the compiler project and all that it's given you. You might think twice before simply flaming GCC.

Interesting Links

Recent comments

16 Sep 2006 17:02 Avatar hzmonte

Re: Optimization - does O3 always generate faster code than O2?
%Therefore, I guess the conclusion (at least for gcc 3) is
%1. -O2 does not mandate loop unrolling;
%2. with -O or -O2, loop unrolling may or may not be turned on.

To clarify, what gcc 3 means may be:
-O2 does not perform loop unrolling unless it is already performed by -O. Therefore, loop unrolling may or may not be performed under -O or -O2.
And it seems -O3 does not turn on loop unrolling either (unless it is already performed ny -O).
Is my understanding correct?

With the wording in the 4.1.1 Manual, I have no clue what it means. In particular, it says "The compiler does not perform loop unrolling or function inlining when you specify -O2." It does not say "-O2 does not perform loop unrolling"; it says "the compiler does not perform loop unrolling". So it seems -O2 will turn off any loop unrolling that is enabled by -O!

16 Sep 2006 16:27 Avatar hzmonte

Optimization - does O3 always generate faster code than O2?
Is it possible that code generated using the O2 option runs faster than that using O3, for example? Is it posible that an optimzation

technique used by O3 is counter-productive for a particular algorithm?

And is there more detailed explanation (preferably with examples) about each optimization technique used by gcc than that in the GCC manual ?

The article says: "In the GCC manpage, it's clearly written that: -O2 turns on all optional optimizations except for loop unrolling [...]" (In the 4.1.1 manual, the exact wordings are: "The compiler does not perform loop unrolling or function inlining when you specify -O2." which is even more confusing.) True, but -O2 turns on all flags that are turned on by -O. And -O turns on -floop-optimize which "optionally" does loop unrolling. Therefore, I guess the conclusion (at least for gcc 3) is

1. -O2 does not mandate loop unrolling;

2. with -O or -O2, loop unrolling may or may not be turned on.

However, based on the 4.1.1 wordings, there is simply no loop unrolling under -O2, period. It somehow implies that if there is any loop unrolling optionally turned on by -O, -O2 would disable it. That is strange.

And how does -floop-optimize2 works?

GCC 4.1.1 manual :

-fprofile-use

Enable profile feedback directed optimizations, and optimizations generally profitable only with profile feedback available.

The following options are enabled: -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops, -ftracer.

-funroll-loops

Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

Enabled with -fprofile-use.

-floop-optimize

Perform loop optimizations: move constant expressions out of loops, simplify exit test conditions and optionally do strength-reduction and loop unrolling as well.

Enabled at levels -O, -O2, -O3, -Os.

-floop-optimize2

Perform loop optimizations using the new loop optimizer. The optimizations (loop unrolling, peeling and unswitching, loop invariant motion) are enabled by separate flags.

That is, -O turns on -floop-optimize which optionally does loop unrolling. On the other hand, -fprofile-use enables -funroll-loops. And none of the -Ox flags turns on -fprofile-use. Also, none of the -Ox flags turns on -floop-optimize2. And it appears that the manual says that once -floop-optimize2 is turned on, loop unrolling is enabled by a separate flag, presumably -funroll-loops, and implies that -floop-optimize2 would "disable" -floop-optimize because -floop-optimize2 would force the loop optimization techniques be individually turned on. It follows that if I do this:

gcc -O -fprofile-use myprog.c or

gcc -O -floop-optimize2 myprog.c

No loop optimization is performed because any loop optimization that would otherwise be turned on by -O is turned off by -floop-optimize2. If I want to do loop unrolling using the so-called "new loop optimizer", and also benefit from other optimization (except loop optimzation) offered by -O, then I need to do this:

gcc -O -floop-optimize2 -funroll-loops myprog.c

This would do:

1. optimization (except loop optimzation) offered by -O

2. loop unrolling offered by the "new loop optimizer"

but would not do any other loop optimization.

Is my understanding correct?

04 Dec 2004 23:41 Avatar oliverthered

Re: Think about when optimizing and what

> For example if you do

> scientic calculations, you are writing

> some 3d game it can be critical... but

> if you are just writing a chat program

> or a mail client, speed is not so

> important

What if the person running the scientific calcluations

is also talking to someone else over the internet

using a chat client.

1: All applications should be optimized.

2: If your running scientific application (or povray) or

anything else that needs high throughput or a lot of

CPU time, profile the code and recompile using the

profile, this will help out more than -O?

3:Optimization is a trades off with features,

stability, and relase time. Sometimes you want a

fast turn around, with a simple toolkit even if it runs

twice as slow, that's kinda what you get with

Basic/VB. What you don't get is software that you'd

want to use in a server or critical environment, and

that's the tradeoff a lot of people took but regreted.

Personally I'd do with Delphi for fast turn around rad

in the days of VB.

02 Oct 2004 04:31 Avatar tomfm

Re: some points still missing


> Incremental compiling is protected under

> US patent 5,586,328. Open-Source

> implementations will be illegal until

> 2017.

Only if the patent holds up in court, which seems unlikely, given that the technique is many years older than that.

17 May 2004 13:10 Avatar d_weasel

Re: Think about when optimizing and what

>
> % First of all most people don't even
> know
> % what means optimizing. Otherwise you
> % can't explain why many (ahem)
> % programmers use things like VBasic.
>
>
> Second of all you don't even know what
> means engrish!
> If you are going to take the time to
> bash something at least take the to
> properly formulate your sentences. This
> way your opponents don't have flame
> fodder sitting all over the place!
>
> Ack!! What a horrible blanket statement.
> There are tons of successful projects
> developed in Visual Basic. It a quite
> stable environement to develop and debug
> from. There is very little you can't do
> with VB, and the longer I use it the
> more I realize that you can do basically
> anything with it, given the right
> skills. Since you can call just about
> any API call directly from VB its is
> just as efficient as other languages.
> The only overhead you might have is a
> few K of the VB runtime environment,
> that it likely already installed and
> being actively used on your Window
> machine. It certainly not the best tool
> for every job, but no language is. Its
> also far better than alot of other
> languages/IDEs out there.

Ack I just stuck my foot in my mouth, by improperly 'formulating' a sentence about properly formulating sentences....hahah


*feels like a fool*
close enough!

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.