Saturday, 15 September 2007

Spectral norm

The Computer Language Shootout is a well-known set of benchmarks that is still gradually evolving. Although the shootout is infamous because most of its benchmark tasks are trivially reducible (e.g. computing constants like pi and prime numbers) and that creates a strong subjective bias because submissions must be rejected or selectively crippled, a few contributors have created better benchmark tasks such as k-nucleotide and spectral-norm.

Our primary machines for high performance computing using OCaml are based upon Athlon 64 CPUs. At least on these machines, the current OCaml implementation of the spectral-norm benchmark can be made significantly faster. This post describes how.

The core of the spectral-norm benchmark is a pair of nested loops. For no particular reason, the shootout requires that this functionality be replicated in two separate functions. Let us focus on the first such function as the other is virtually identical:

let eval_A_times_u u v =
let n = Array.length v - 1 in
for i = 0 to n do
v.(i) <- 0.0;
for j = 0 to n do
v.(i) <- v.(i) + eval_A i j * u.(j)
done
done
This implementation, currently on the shootout, takes 14.4s to run with n=5500 on this machine (4400+ Athlon64).

Most of the time is, of course, spent in the inner loop over j. However, the current implementation on the shootout does not hoist the subexpression v.(i) from the inner loop. This can be done by accumulating the result of the inner loop in a variable and storing it in v.(i) only once, after the inner loop completes:

let eval_A_times_u u v =
let n = Array.length v - 1 in
for i = 0 to n do
let vi = ref 0.0 in
for j = 0 to n do
vi := !vi +. eval_A i j *. u.(j)
done;
v.(i) <- !vi
done
This brings the time taken down to 12.0s.

The next most important optimization in the OCaml is to remove bounds checking. Although this can be done globally using the -unsafe flag as the shootout does, this is very bad practice. Specifically, many OCaml programs assume that out of bounds array accesses incur an appropriate exception and compiling them with -unsafe renders such code invalid, essentially changing the semantics of the language.

Using explicitly unsafe array accesses on minimal performance-critical portions of code is much more preferable and this benchmark is a good illustration of how this can be done. Specifically, there is only one array access in the inner loop, to u.(j), and this can be made unsafe. To ensure correctness, we can simply derive n from the length of u rather than v:

let eval_A_times_u u v =
let n = Array.length u - 1 in
for i = 0 to n do
let vi = ref 0.0 in
for j = 0 to n do
vi := !vi +. eval_A i j *. Array.unsafe_get u j
done;
v.(i) <- !vi
done
This brings the time down to 9.4s, over 50% faster than the original.

The forms of optimization described here are generally-applicable and may be used to improve the performance of a wide variety of OCaml programs. For detailed and comprehensive information on high-performance OCaml, read the book OCaml for Scientists. For general information on the OCaml programming language and tutorial articles, read the OCaml Journal.

No comments: