Articles / Optimization

Optimization

If you want to write really fast code, what should you do? Hone your assembly skills? Erik Greenwald thinks that's a mistake too many people make, and that it's possible to write even better-optimized code in a high-level language.

Introduction

Many people seem to think that the best road to blindingly fast programs is writing as much as possible in assembly. Typically, this results in buggy software that is actually slower than that written in a higher level language. Premature optimizations can cost huge amounts of time and effort. No optimization should occur before the program works without crashing or exhibiting weird behavior.

Make it Work.

Attempting to optimize a program that doesn't even run is a pointless task. If your program crashes, then not only is it impossible to profile (binaries compiled for profiling write the data file after your program calls exit), but the users won't want to run it. Make sure you have no syntactic or memory errors in your program before even attempting to optimize it. I strongly suggest using all facilities in the language to extend warnings and errors, as well as using tools like Splint and Electric Fence to verify your program will not crash.

Make it Right.

Optimization can often make a program less readable, so you want to guarantee that the program does exactly what it should BEFORE you start optimizing. Any logical bugs should be tracked down and squished. I suggest a lot of test cases, especially at the bounds. "Off by one" errors can be tricky to find in clear code, and virtually impossible in code that has been optimized beyond readability.

Make it Fast.

Once we confirm that the program runs free of both syntactic and logical bugs, we're ready to begin the optimization. Often people will attempt to optimize prematurely or in the wrong place, resulting in wasted effort. To make optimizing easy and truly beneficial, a certain order of optimizations should take place.

Profiling

The first step in the optimization process is to locate and isolate the performance bottlenecks. Amdahl's Law states that the greatest opportunity for improvement is where the most time is spent. If a function is using 50% of the program's processor time and the time spent in that function is halved, the program will execute 25% faster. If we alter another function so it runs eight times as fast, but it only accounts for 10% of the CPU utilization, then we increase the speed of the program less than 2%. Obviously, the key to a good optimization is to improve the parts that consume the most time.

There are several tools available to do profiling, but I will concentrate on GNU gprof. It's a free tool that works with GCC and produces the amount of time spent inside of each function. Most libraries can be compiled with profiling support, so you can even use it to see if a library function is the bottleneck of your application. The gprof manual is available at http://www.gnu.org/manual/gprof/, and the software package is available at the GNU FTP site (and is probably already installed on your system if you use Linux). To get started using gprof, you only need to know how to compile your program to generate profiling information and how to view the generated information. GCC has a flag to enable profiling information generation, and the "gprof" command is used to view it. Note that gprof has to be run in the same directory you ran the program in, and requires the program as an argument.

$ gcc -o myapp myapp.c -pg
$ ./myapp
$ gprof ./myapp

Algorithmic Optimization

Once we know where the time is being consumed, we can start trimming down the time consumption. When you find a function or set of functions that consume a large amount of time, the first question you should ask yourself is "Can I approach this problem in a better, smarter way?" Big-O notation is often used to describe the efficiency of an algorithm, and is useful in this step. When I'm exploring the difference between different algorithms, I like to use a higher level language than what I'm writing the finished code in. This allows me to experiment with some radical changes to the algorithms with very little time and effort invested to explore those avenues.

Language Usage

While the algorithm itself may be sound, the implementation isn't always. If you are satisfied with the algorithm, you should look for how you expressed that algorithm in the programming language to see if there is something you could have done better. The biggest improvement is usually to cache computations instead of computing them several times. Functions that are called frequently may benefit from being inlined. A good knowledge of how the language is implemented can be very useful here.

Compiler Options

Just about every modern compiler has some fairly hefty optimization capabilities these days. Finding the right combination of flags to suit your code may take a little experimentation and reading. The GCC man page describes what the different flags do in a fair amount of detail. The first flags to consider are the "-O" flags. "-O" by itself does light optimization, and is equivalent to "-O1". To disable optimizations, you can use "-O0", and to set more aggressive optimization levels, you can use "-O2" or "-O3". Setting the machine type is also useful, as the compiler can build the assembly to use extended opcodes and registers. Most of us have x86 assemblers, so some of the flags useful for that architecture are "-m486", "-mpentium", and "-mppro", but be careful -- if you set your machine type too high, earlier machines may not be able to execute the code. The last series of flags you should consider are the specific optimization enables. To use the fast math library, for example, you can add -ffast-math to the compile string, but the O flags often incorporate several of these and are probably going to generate better code than if you try to set the "-f" flags yourself. If you know your program would benefit from a flag and the optimization level doesn't add that flag for you, you can force it.

Compiler Hints

Modern compilers do a very good job of figuring out how to convert your code into efficient assembly code. While the #pragma precompiler directives and certain keywords like "register" are still part of the language, they should generally be considered deprecated.

Assembly

When it comes to optimizing, the kneejerk reaction of a lot of people is to write in assembly. This thinking is very wrong. The compiler can almost always generate better assembly than a mere mortal. If you feel the need to write a piece in assembly, you should first examine the assembly that the compiler generates. To make gcc generate an assembly source file instead of an object or binary executable, simply call it with the "-S" flag. Examine the generated assembly file, and if you are absolutely sure that you can build better assembly than gcc, knock yourself out. I strongly suggest doing the assembly as inline assemble in the source and wrapping it and the original version in a precompiler conditional, so the program can still compile on different architectures.

Conclusion

Given this order of optimizing, you can greatly reduce the time and effort that goes into making quality software that operates in an expedient matter. I've only mentioned a few tools to help facilitate development; there are plenty of others out there (some do slightly different things, and some target different compilers/platforms). Happy optimizing.


Erik Greenwald (http://math.smsu.edu/~br0ke/) is a fourth year computer science student at Southwest Missouri State University. A Linux addict since 1996, he is active in Free Software development. He can be reached at erik@smluc.org.


T-Shirts and Fame!

We're eager to find people interested in writing editorials on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an editorial gets a freshmeat t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

Recent comments

16 Mar 2005 22:53 Avatar frizzz

assembler is really the higher language!
I miss in this discussion some sentences about adressing.
"high-level"-programmers seem to think, that the meaning
of a value is built in the value - but the meaning is
reasoned by adressing the value.
Every language,
which omits adressing modes is therefore lower than
assembler! It is a subset of the possibilities given by a
CPU. The vastest
lack can be seen in omitting indexed adressing in higher
languages.
In the newest releases of gcc, there is the
feature provided to define a labels adress at runtime -
working similar to the assembler mnemonic 'lea' (load
effective adress), but I could not find out how to deal with
this adress in a 'goto'-statement!
Opposite to this, it is
very simple using assembler to switch branching in only
one command to thousands of branches! An example
could make it clear:

adress_of_label: DD label ; = a "pointer"

......
jmp [adress_of_label] ; indexed jump to label

As You can redefine the value of "adress_of_label" in
every part of Your program, You can branch without a
case-selection depending on status-values before You
jump. You can find a lot of examples in the assembler OS,
which I just published at http://www.rcfriz.de

This OS is made of only 30K code, but exceeds the well
known M$DOS in every respect - only the first step to a
mature kernel...
There a possibilities in using pointers
to adress values in structs - but if You know, how mighty
a
mov eax,[es:eax+ebx*4+displacement]
is as
part of a loop, You will laugh about those ***,who like to
say "C is higher abstracted..." - again: it's lower
abstracted!
But there are other disadvantages, when
those higher level programs are linked. You need not only
more than one (higher) language for that purpose. You
need a crop of bureaucracy!
I'm sorry - english is not
my mothertongue. That's why I want to end with this...


15 Oct 2000 02:52 Avatar arach

Texts/common hangups.
okay, now for some real issue of interest if anyone is still active here:
1) Resources for info, PLEASE!
2) portability:
#ifdef __386__
asm
mov ....
end
#else
C code
#end

That is a really good thing, aye..

Now, for a quesiton/request..
is there anyone who has a tool/toolinprogress to search for repeated structures of code within a program, or several?

that is, not to search for the -same- structure, but to search for alike ones...

for x in 0 to 100 do
for y in 0 to 1000 do
for x2 in x-1 to x+1 do
for y2 in y-1 to y+1 do
colour(x,y)+=colour(x2,y2)
end end
colour(x,y)/=3^3
end end

or whatever..
run it on source, find where 1) the programmer 2) the compiler
is repeating code wich can be reused..
and also, code wich can be fast changed into some more optimized roll of assembler.
there are always certain things that are used more often in programs as assembler.. ideas/thoughts/andsoon?

09 Oct 2000 11:22 Avatar risinghigh

ASSEMBLER
i have written many assembler programs for many machines, like c64, snes, amiga, palm, windows
ans so on. and i must say, the best structured and best optimized programes can only been done
with assembler. its sadly true that the portability of programs for different plattforms is an disease
or nearly impossible,m also learning the new registers, code and whatever is hard to. but its still
true and even will ever be, that an ARTISTIC assembler programmer that knows how to save
tact cycles will ever create a faster code than any c fuckcompiler will do..

..to be honest programming these days, or whatever ppl call these java things went curious these
days.. so who cares..

07 Oct 2000 23:26 Avatar dvl83t9t

Optimization
A few comments on this article, before I throw it away completely and provide a replacement. ;-)

The Intel Optimization Handbook (manual? appendix? docu-thingy? After you read the entire text of the thing, you'll forget its title too ;-) is perfect reading for those long black winter nights. Just make sure you haven't eaten first.

Optimization on real CPU architectures is a relatively simple task: you simply slot data into registers until you run out of program, and you arrange to use the parallel execution units in a relatively straightforward way. The bulk of the optimization time is spent analyzing the code's data flow structure and selecting from so many orthogonal optimization possibilities. Optimizing on RISC is a relatively straightforward operation, but it requires doing a lot of tedious counting that humans are not so good at. Often on RISC architectures the opcodes have concurrency issues with other opcodes, which humans are also not so good at.

Contrast with the IA32 architecture, where most of the task of "optimization" is recoding the program in such a way that it's semantically equivalent, but avoids all the hardware bottlenecks. After you've finished performing relatively expensive optimizations on IA32, you get performance roughly equivalent to what you would get from a RISC architecture before you tried doing optimization at all--and worse, you've already allocated all of the chip's actual resources, so you can't optimize any further. The bulk of the optimization time is spent searching a cookbook of pattern-matching algorithms to find idioms that are slow on IA32 hardware and transparently replacing them with idioms that are fast(er). Humans are good at IA32 optimization simply because they have more imagination than most compilers.

Modern IA32 has a lot of design trade-offs. 8 and 32-bit register operations are twice as fast as 16-bit ones, because Pentiums can execute two 8/32-bit opcodes per clock cycle but only one 16-bit opcode. The parallelization rhythm (the exact arrangement of opcodes to fill up the instruction prefetching pipeline so that you can execute instructions in parallel) within a Pentium I/II/III is different for each CPU generation. REAL CPU's have floating-point registers (not a stack), extension instruction sets that are not mutually exclusive (nor cost 100+ cycles to switch from one to another), and branch prediction that actually works right the first time. IA32 has weird opcodes from the 8086 era that do useful things in exceptional cases.


Another thing to consider, one which I haven't seen mentioned here yet, is that modern CPU's have a huge gap between the nominal clock rate of the CPU and the performance of the actual system. Sure, it says 1GHz on the box, but actually only 0.1% of the memory of the machine really runs at that clock rate--much less, if you count disk space as memory (and with applications that swap or use databases, disk space does count as memory). The other 99% of the RAM runs almost 10 times slower--even slower than that if there's evil RAM tricks like fast-page-mode going on. This is why some programs crawl at a snail's pace on Celerons while they fly on Pentiums or AMD chips: "it's the cache, stupid..."

The sad fact is, no amount of assembly code hand-optimization can compensate for even a handful of L2 cache misses. Your program will lose dozens of CPU cycles for every cache miss, and you won't get them back without redesign at or above the C language level. Conversely, you can get them back by redesigning your data flow, and you don't need assembly language to do that. Also, optimizations of this kind really are portable, at least to other CPU's where the data is the same size and shape, as it were.

My own approach to optimization is the same as my approach to debugging and security: don't put bugs in the program to begin with, don't introduce security holes into the program to begin with, and don't make the program slow to begin with.

Optimization, like debugging and security, begins at the design stage, and carries through until release (and even beyond, if the program is actively maintained after release).

0. Be aware of what the goals of your program really are. If your program has to run in some amount of time between one and ten hours and produce results read by millions of people, it probably doesn't need a lot of optimization, but it might need to be debugged and secure. If it has to run once every 33 milliseconds, but has limited communication with the outside world, it might need more optimization than security.

1. Be aware of the cost of correcting detected mistakes versus undetected ones. Don't write code to specifications that are provably wrong on paper. Don't integrate code into the program without testing it before and after integration.

2. Be aware of the cost of correcting security flaws before and after they become enshrined in working code. Have a security policy in mind when you input data, and carry it through until you're finished with it. Do this for all programs, even if the policy you decide on is "you will run this on a non-networked machine using data from trusted users only."

3. Be aware of the architecture of the CPU, the architecture of the compiler, and the relative costs of your data structures and algorithms. Not all memory runs at the same speed, so size can be as important as speed within a critical section of code. Know the memory structure (L1/L2 cache size) and relative speeds of the target architecture, and keep the speed-critical loops within those sizes. Don't pick algorithms that have O(N^x) running time when algorithms with O(N^(x-1)) running time are available, assuming the cost of implementing both algorithms is equal. Don't optimize a loop to save seven clock cycles per iteration if there's an L2 cache miss at the end--the latency cost for accessing main memory is so high that you'll never notice the difference. Similarly, don't design a data structure that causes unnecessary L2 cache misses--or even L1 misses, if you can avoid them. If you have multiple CPU's with threads, don't use a design that forces all of them to wait on a single lock. And so on.

On the other hand, if you know you're going to take an L2 cache hit, and you can't do anything about it, you can afford to spend a cycle or two doing something else, like run-time correctness insertions, or interpreting the byte-codes of your program.

07 Oct 2000 22:03 Avatar jkominek

register keyword
I'd like to point out that very few, if any compilers use perfect register allocation algorithms. Your compiler does not understand your algorithm, and cannot. I have personally seen cases where specifying which variables to keep in registers improved speed drastically.


(There are a number of other flaws with this post which I'm not going to take the time to itemize. Go read your compiler's source code and see if it looks like it knows more about your CPU than you. If so, don't waste your time writing assembly. If you do know more than it, then you might as well spend as much time as you want on it, because it doesn't matter to the rest of the world if you waste your time.)

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.