Profiling and tuning (Gauche Users’ Reference)

3.6 Profiling and tuning

If you find your script isn’t running fast enough, there are several possibilities to improve its performance.

It is always a good idea to begin with finding which part of the code is consuming the execution time. Gauche has a couple of basic tools to help it. A built-in sampling profiler, explained in the following subsection, can show how much time is spent in each procedure, and how many times it is called. The gauche.time module (gauche.time - Measure timings) provides APIs to measure execution time of specific parts of the code.

Optimization is specialization—you look for the most common patterns of execution, and put a special path to support those patterns efficiently. Gauche itself is no exception, so there are some patterns Gauche can handle very efficiently, while some patterns it cannot. The next subsection, Performance tips, will give you some tips of how to adapt your code to fit the patterns Gauche can execute well.

3.6.1 Using profiler

As of 0.8.4, Gauche has a built-in profiler. It is still experimental quality and only be tested on Linux. It isn’t available for all platforms. It works only in single-thread applications for now.

To use the profiler non-interactively, give -ptime command-line option to gosh.

% gosh -ptime your-script.scm

After the execution of your-script.scm is completed, Gauche prints out the table of functions with its call count and its consumed time, sorted by the total consumed time.

Profiler statistics (total 1457 samples, 14.57 seconds)
                                                    num    time/    total
Name                                                calls  call(ms) samples
---------------------------------------------------+------+-------+-----------
combinations*                                       237351  0.0142   337( 23%)
(lset-difference #f)                               1281837  0.0020   256( 17%)
(make-anchor make-anchor)                          3950793  0.0005   198( 13%)
member                                             4627246  0.0004   190( 13%)
filter                                              273238  0.0030    81(  5%)
every                                              1315131  0.0004    59(  4%)
(lset-difference #f #f)                            1281837  0.0004    54(  3%)
(make-entry make-entry)                             730916  0.0005    40(  2%)
(clear? #f)                                         730884  0.0005    33(  2%)
(initialize #f)                                     599292  0.0005    32(  2%)
fold                                                237307  0.0013    30(  2%)
acons                                               806406  0.0004    29(  1%)
clear?                                               33294  0.0084    28(  1%)
(combinations* #f)                                  805504  0.0002    15(  1%)
(make-exit make-exit)                               730884  0.0002    15(  1%)
lset-difference                                     237318  0.0006    15(  1%)
reverse!                                            475900  0.0001     6(  0%)
(fold <top> <top> <list>)                           237323  0.0003     6(  0%)
procedure?                                          238723  0.0002     4(  0%)
pair?                                               237307  0.0001     3(  0%)
 :
 :

Note that the time profiler uses statistic sampling. Every 10ms the profiler interrupts the process and records the function that is executed then. Compared to the individual execution time per function call, which is the order of nanoseconds, this sampling rate is very sparse. However, if we run the program long enough, we can expect the distribution of samples per each function approximately reflects the distribution of time spent in each function.

Keep in mind that the number is only approximation; the number of sample counts for a function may easily vary if the program deals with different data sets. It should also be noted that, for now, GC time is included in the function in which GC is triggered. This sometimes causes a less important function to "float up" to near-top of the list. To know the general pattern, it is a good custom to run the program with several different data sets.

On the other hand, the call count is accurate since Gauche actually counts each call.

Because all functions are basically anonymous in Scheme, the ’name’ field of the profiler result is only a hint. The functions bound at toplevel is generally printed with the global variable name it is bound at the first time. Internal functions are printed as a list of names, reflecting the nesting of functions. Methods are also printed as a list of the name and specializers.

The profiler has its own overhead; generally the total process time will increase 20-30%. If you want to turn on the profiler selectively, or you’re running a non-stop server program and want to obtain the statistics without exiting the server, you can call the profiler API from your program; see Profiler API, for the details.

3.6.2 Performance tips

Don’t guess, just benchmark. It is the first rule of performance tuning. Especially for the higher-level languages like Scheme, what impacts on performance greatly depends on the implementation. Certain operations that are very cheap on an implementation may be costly on others. Gauche has such implementation-specific characteristics, and to know some of them would help to see what to look out in the benchmark results.

"80% of execution time is spent in 20% of the code" is another old saying. Don’t obscure your code by "potential" optimization that has little impact on the actual execution. We describe some tips below, but it doesn’t mean you need to watch them all the time. It is better to keep most of the code clean and easy to understand, and only do tricks on the innermost loop.

Ports: To satisfy the specification of SRFI-18 (Threading), every call to I/O primitives of Gauche locks the port. This overhead may be visible if the application does a lot of I/O with smaller units (e.g. every bytes). The primitives that deals with larger unit, such as read and read-uvector, are less problematic, since usually they just lock the port once per call and do all the low-level I/O without the lock overhead. (Note: this doesn’t mean those primitives guarantee to lock the port throughout the execution of the function; furthermore, the port locking feature is optimized for the case that port accesses rarely collide. If you know it is possible that more than one threads read from or write to the same port, it is your responsibility to use mutex explicitly to avoid the collision.)

If you find out the locking is indeed a bottleneck, there are couple of things you can consider: (1) Try using the larger-unit primitives, instead of calling the smaller-unit ones. (2) Use with-port-locking (see Port and threads) to lock the port in larger context.

Strings: Because of the multibyte strings, two operations are particularly heavy in Gauche: string mutation and indexed string access. It is a design choice; we encourage the programming style that avoids those operations. When you sequentially access the string, string ports (see String ports) provide a cleaner and more efficient way. When you search and retrieve a substring, there are various higher-level primitives are provided (see String utilities, Regular expressions, and srfi.13 - String library, for example). If you’re using strings to represent an octet sequence, use uniform vectors (see Uniform vectors) instead.

Deep recursion: Gauche’s VM uses a stack for efficient local frame allocation. If recursion goes very deep (depending on the code, but usually several hundreds to a thousand), the stack overflows and Gauche moves the content of the stack into the heap. This incurs some overhead. If you observe a performance degradation beyond a certain amount of data, check out this possibility.

Generic functions: Because of its dynamic nature, generic function calls are slower than procedure calls. Not only because of the runtime dispatch overhead, but also because Gauche’s compile-time optimizer can’t do much optimization for generic function calls. You don’t need to avoid generic functions because of performance reasons in general, but if you do find single function call consuming a large part of execution time and it is calling a generic function in its inner loop—then it may be worth to modify it.

Redefining builtin functions: Gauche inlines some builtin functions if they are not redefined. Although sometimes it is useful to redefine basic functions, you may want to limit the effect. For example, put redefined functions in a separate module and use the module in the code that absolutely needs those functions replaced.

Closure creation: When you create a closure, its closing environment is copied to the heap. This overhead is small, but it still may be visible when a closure is created within an innermost loop that is called millions of times. If you suspect this is a problem, try disassemble the function. Gauche’s compiler uses some basic techniques of closure analysis to avoid creating closures for typical cases, in which case you see the local function’s bodies are inlined. If you see a CLOSURE instruction, though, it means a closure is created.

This list isn’t complete, and may change when Gauche’s implementation is improved, so don’t take this as fixed features. We’ll adjust it occasionally.