I just tried the -fprofile-generate/-fprofile-use feature of gcc, and I must say that I was more than surprised at the benefit. We switched to some different server software on our resolvers because BIND couldn't handle the load. That in itself was a big improvement. Following the authors suggestion, I tried out that feature of the gcc compiler, and wow.
When I say wow, let me define that. With BIND the server was on its knees, begging for death and causing massive backups on all our email servers. With the switch to pdns-resolver, that went down to 30-50% cpu usage. With the recompile, it hangs out around 8-30%. The server is barely breaking a sweat now.
I had to cross compile on a computer that had gcc, and move it to the server. If you need to try something similar, just be aware that when the program dies, the profiling information will land in the exact same path as where you compiled the program. I had assumed they would land in the current working directory. There appears to have been a -fprofile-dir option at one time, but it doesn't seem to exist anymore. No big deal, it just creates all the needed directories if they don't exist (which they didn't in my case). It might be something to think about if you have a similar directory hierarchy on the other server, and don't want to lose the files in a mess of other stuff or overwrite things.
When you compile the binary this way and run it, it gathers data and dumps it in files when the program stops.
All the files have gcda and gcno extensions. They have profiling information which gcc uses later to know where it should spend its time optimizing and where it shouldn't bother. Like I said, this appears to work really well.