Following an interesting thread in comp.lang.python, I spent some time implementing convolution and liftingscheme based implementations of the discrete wavelet transform in C++, OCaml and F#.
The original Python implementation used the lifting scheme algorithm and was implemented using slicing:
max_level = int(math.floor(math.log(x.shape[axis],2)))
if level is None:
level = max_level
else:
assert level <= max_level, "Level param too high"
coeffs = []
cA = x.swapaxes(0, axis)
while level > 0:
level = 1
even = cA[::2]
odd = cA[1::2]
even += C1*odd
odd[0] = C2*even[0] + C3*even[1]
odd[1:] = C2*even[1:] + C3*even[:1]
even[:1] = odd[1:]
even[1] = odd[0]
even *= C4
odd *= C5
cA, cD = even, odd
coeffs.append(cD.swapaxes(axis, 0))
coeffs.append(cA.swapaxes(axis, 0))
coeffs.reverse()
return coeffs
The Python implementation eagerly allocates intermediate arrays that are trivial combinations of small numbers of elements from existing arrays.
There are several different ways to translate this into OCaml. If we start from scratch with the task of computing this wavelet transform then we are most likely to write in a C style using convolution:
let rec d4_aux tmp a n =
let n2 = n lsr 1 in
for i=0 to n22 do
tmp.[i] < a.[i*2] *. h0 +. a.[i*2+1] *. h1 +. a.[i*2+2] *. h2 +. a.[i*2+3] *. h3;
tmp.[i+n2] < a.[i*2] *. g0 +. a.[i*2+1] *. g1 +. a.[i*2+2] *. g2 +. a.[i*2+3] *. g3;
done;
tmp.[n21] < a.[n2] *. h0 +. a.[n1] *. h1 +. a.[0] *. h2 +. a.[1] *. h3;
tmp.[2*n21] < a.[n1] *. g0 +. a.[0] *. g1 +. a.[1] *. g2 +. a.[2] *. g3;
Array.blit tmp 0 a 0 n;
if n > 4 then d4_aux tmp a (n lsr 1)
let d4 a =
let tmp = Array.create (Array.length a) 0. in
d4_aux tmp a (Array.length a)
This convolution based approach is 5x faster than the Python. However, some people argued that this is an unfair comparison because the implementations in different languages are using different algorithms.
We can also mimic the slicing approach used by Python. A naive translation generates many intermediate arrays (7 array allocations per iteration) but this is easily reduced by composing closures rather than allocating new arrays (a deforesting optimisation):
let rec d4_aux cA = function
1 > [cA]
n >
let n = n/2 in
let even i = cA.(2*i) in
let odd i = cA.(2*i + 1) in
let even i = even i +. c1 *. odd i in
let odd = function
0 > odd 0 . c2 *. even 0 . c3 *. even (n1)
i > odd i . c2 *. even i . c3 *. even (i1) in
let even = function
i when i = n1 > even i . odd 0
i > even i . odd (i+1) in
let even = init n (fun i > c4 *. even i) in
let odd = init n (fun i > c5 *. odd i) in
odd :: d4_aux even n
let d4 a = d4_aux a (Array.length a)
This implementation allocates only two arrays and fills them from a composition of closures. This is slower than the cachefriendly convolutionbased approach but is still 60% faster than the Python.
Cheers,
Jon.
Xavier Leroy's "standard lecture on threads"

Xavier Leroy’s standard lecture on threads post from the camllist in 2002
seems to have disappeared from the INRIA archives so I am reproducing it
here fo...
1 week ago