- Replacing the system malloc on OS X is Really Hard(TM).
- jemalloc was doing way too much red-black tree manipulation.
OS X uses the Mach-O format, and in order to completely replace the system malloc, it would be necessary to compile a custom libSystem. As far as I know, that has not been possible outside the confines of Apple since version 10.3 (2+ years ago). Even if it were possible, there would be all sorts of undesirable aspects to shipping a custom libSystem with Firefox; libSystem is a huge library, and binary compatibility issues would be a constant problem. So, the only remaining viable option is to subvert the malloc zone machinery. There is no supported method for changing the default zone, and furthermore, CoreFoundation directly accesses the default zone. Enough about that though; suffice it to say that I did find ways to subvert the malloc zone machinery.
Once Firefox was successfully using jemalloc for all memory allocation, I started doing performance tests. Memory usage differences were minor, but jemalloc was consistently slower than OS X's allocator. It took a lot of profiling for me to finally accept the hard truth: jemalloc was spending way too much time manipulating red-black trees. My first experimental solution was to replace red-black trees with treaps. However, this made little overall difference. So, the problem was too many tree operations, not slow tree operations.
After a bit of code review, it became clear that when I fixed a page allocation bottleneck earlier this year, I was overzealous with the application of red-black trees. It is possible to use constant-time algorithms based on linear page map data structures for splitting/coalescing sequential runs of pages, but I had re-coded these operations entirely using red-black trees. So, I enhanced the page map data structures to support splitting/coalescing, and jemalloc became markedly faster. For example, Firefox sped up by as much as ~10% on JavaScript-heavy benchmarks. (As a side benefit, memory usage went down by 1-2%).
In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.