We are working on a moderately-sized C++ code base (10Mloc) which through our op
ID: 652431 • Letter: W
Question
We are working on a moderately-sized C++ code base (10Mloc) which through our optimization efforts is becoming uniformly slow.
This code base is a set of libraries which we combine to put them to work. When the general framework of how these libraries communicate was developed there was some emphasis on performance and later on, when more parts where added, the general framework wasn't changed much. Optimization was done when needed and as our hardware evolved. This made expensive early decision apparent only much later. We are now at a point where further optimizations are much more expensive since they would require rewriting large parts of the code base. We find ourselves approaching an undesirable local minimum since we know that in principle the code should be able to run much faster.
Are there any successful methodologies which help to decide what turns to take over the evolution of a code base towards a globally optimally performing solution which aren't easily confused by easy optimization opportunities?
EDIT
To answer the question how we currently profile:
We really only have 2 different scenarios how this code can be used, both embarrassingly parallel. Profiling is done both with wall clock time averaged over a large sample of inputs and more detailed runs (instruction costs, branch mispredictions and caching issues). This works well since we run exclusively on our extremely homogeneous machines (a cluster of a couple thousand identical machines). Since we usually keep all our machines busy most of the time running faster means we can look at additional new stuff. The issue is of course that when new input variations show up, they might get a late-comer penalty since we removed most obvious micro-inefficiencies for the other use cases, thus possibly narrowing down the number of "optimally running" scenarios. The current answers rightly suggest to separate data and algorithms so this becomes easier to adjust.
Explanation / Answer
I do not know of a general-purpose approach to this problem, but two somewhat related approaches worked well for me in the past: for lack of better terms, I called them bunching and horizontal optimization.
Bunching approach is an attempt at replacing a large number of short, fast operations with a single, slower-running, highly specialized operation that ultimately produces the same result.
Example: After profiling one particularly slow operation of our visual rule editor we discovered no "low hanging fruit": there was not a single operation that was taking more than 2% of the execution time, yet the operation as a whole felt sluggish. However, we discovered that the editor was sending a large number of small requests to the server. Even though the editor was quickly processing the individual replies, the number of request/response interactions had a multiplicative effect, so the overall time the operation took was several seconds. After carefully cataloging the interactions of the editor during that long-running operation, we added a new command to the server interface. The additional command was more specialized, as it accepted the data required to perform a subset of short operations, explored data dependencies to figure out the final set of data to return, and provided a response containing the information needed to complete all the individual small operations in a single trip to the server. This did not reduce the processing time in our code, but it cut a very significant amount of latency due to removing multiple expensive client-server round trips.
Horizontal optimization is a related technique when you eliminate the "slowness" that is thinly distributed among multiple components of your system using a particular feature of your execution environment.
Example: After profiling a long-running operation we discovered that we make a lot of calls across the application domain boundary (this is highly specific to .NET). We could not eliminate any of the calls, and we could not bunch them together: they were coming at different times from widely different sections of our system, and the things they requested were dependent on the results returned from prior requests. Each call required serialization and deserialization of a relatively small amount of data. Again, the individual calls were short in duration, but very large in number. We ended up designing a scheme that avoided serialization almost entirely, replacing it with passing a pointer across the app domain boundary. This was a big win, because many requests from entirely unrelated classes instantly became much faster as a result of applying a single horizontal solution.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.