Profiling and Optimization
First, this article assumes that you're working with an existing and essentially complete piece of code or product that you want to optimize. However, if you are working on performance critical code (rendering code falls into this category, for example), you will have wanted to start thinking about performance long before you reach this stage. Premature micro-optimization is evil, but algorithmic optimization should always be in the back of your mind.
- ALWAYS profile before and while performing optimizations. There are ALWAYS surprises, and performance bottlenecks are often not where you expect them to be - especially if there are other people writing code besides you. Plus, a good profiler will help you figure out exactly what your problem is - whether it's memory access, CPU.
- Iterate frequently. Don't do one profile, write a week's worth of code, then write another profile. As you change the code, the bottlenecks will change. Don't spend a lot of time optimizing one piece of code down to zero when cutting it in half makes something else the bottleneck.
- Choosing a test case:
- If possible, have a reproducible test case in your production environment.
- A good test case may not be steady-state. It may be better for your test case to reproduce the production environment than have it be repeatable.
- If you are performing micro-optimizations that you can't reproduce, generate a good synthetic test case that heavily exercises what you're working on, and optimize that.
- Choosing the profiler:
- If a profiler is slowing down your application, it is probably distorting your results. This is usually because the profiler is adding/running significant amounts of instrumentation which is doing a lot of work. In addition, in network and graphics applications, significant performance degradation can significantly impact things such as the amount of network traffic recieved and the amount of time spent blocking waiting for the GPU.
- For the above reason, I generally prefer time-based sampling profilers over instrumented call-graph profilers. The one disadvantage of most time-based profilers is that they don't generally have call-chain information (they only keep track of the function where CPU time was spent, not where it was called from) - although at least one profiler (Shark on OSX) is capable of doing this without performance degradation).
A list of profilers that we've used at Linden:
- Shark (Mac) - my preferred profiler right now. Sampling profiler, but it still manages to have call chain information. Great UI. Unfortunately, can't use it for everything, as it's Mac only, and graphics stuff is very different on the Mac vs. PC.
- oprofile (linux) - another sampling profiler. Has some third party UIs, call chain information via
--callgraph=N. Fairly easy to use on Linux.
- VTune (Windows) - sampling and call graph profiler. The UI is terrible - but there aren't many other sampling profilers available on Windows.
- GlowCode (Windows) - instrumented call-graph profiler, but a lot faster than typical - so it runs in near real time. If you want call chain information on Windows, this may be the way to go.
- Obviously, always optimize based on the results of a profile.
- Always try to do algorithmic and architecture-based optimizations before code optimizations.
- Memory and cache considerations are often MUCH more important than instruction count. This is a relatively recent phenomenon. On some architectures, you can run dozens of instructions or more in the amount of time that it takes to fetch data from main memory. Thus, when optimizing, you may want to think more about how your algorithms access data than the number of instructions that they use.
- Don't write assembly unless you absolutely have to. It's not easy to make cross-platform, and unless you are REALLY good, compilers are often better than you are. If you need to, restructure the C/C++ code to make it more clear to the compiler what you are attempting to do.
- On the other hand, understand assembly if you're doing micro-optimizations. Profilers will tell you what line your problem is on according to the symbol database, but that line can expand out into hundreds of instructions, only one of which is the problem. Understanding assembly will let you know if that instruction is the divide, or actually the memory access fetching the value to be divided.