`Python`

has become my preferred language when it comes to programing. It is so
simple to import from its vast library of packages and then build upon them to
get things working. Yet, at some point, when all the design phase is over, and
the software has to perform it is so slow.

In this case, I’m referring to numerical calculations. Indeed the numpy
and
scipy
modules are of great help, letting all the numerical calculations to run
faster in their `C`

or `Fortran`

coded engines. But what of the algorithm I
develop in `Python`

? It is going to be the slow part of my work. Thus I have to
avoid writing too much `Python`

and prefer calling all this `C`

coded modules.
**BUT** is not enough, because `numpy`

and `scipy`

only provide very general
functions and it is up to the user to make a use out of them.

Once the user has its algorithm working, he needs to speed it up. Coding it all
into `C`

might be too much of a hassle and error prone. He also might loose many
of the `Python`

packages he depended on in the first place and the code will
become less readable. The alternatives for now are to use Cython
or the not so
new project numba
. These two are designed to establish the bridge between
`Python`

’s ease of use and `C`

’s performance. My problem is that all the simple
examples and tutorials you get from them are to highlight the huge performance
boost you can get, yet they don’t really match my real life situations. Their
examples are based on situations where the complete algorithm is stand alone and
the `Python`

interpreter is slowing everything down. Thus a quick call to
`numba`

or `Cython`

’s static typesetting will just solve the problem and give
you 3 orders of magnitude performance boost.

My story begins because I was already calling the fast `C`

functions
of `numpy`

and `scipy`

, and those, should be the most time consuming
functions of my algorithm. But they weren’t, there was enough `Python`

code
to slow it down. So my first hope was to call `numba`

, but there was no
performance gain as it can’t deal with my calls to `scipy`

functions. So
`Cython`

was the way to go, I’m certainly not aware of the many possibilities
`Cython`

permits to speed up my code, I still believe there is room for
improvement, and I would be happy to hear from them in the comments.

So I now present you how I solved my problem and gained a performance
boost of 4x. Indeed it is not so glamorous or amazing as 1000x you get
from the advertised `Cython`

version of the code, but it is a new start
in performance gains over my python code.

So before I start I will stress: **HAVE TESTS FOR YOUR CODE!**
It doesn’t help at all to get faster execution times just to arrive to
the wrong results faster. Debugging is a pain, and it is more painful
every time your new modifications break old code. Testing lets you test
over smaller pieces of your code and make it easier to track what broke
your code. I love `Python`

so much because there are very simple frameworks
to write your tests, so just use them. When is a good time to write your
test? If you can, before every function you develop. After you wrote your package
is also good. Even if it produces the correct results from the beginning
you have to keep track it continues to do so as your code evolves.

The goal of this work is to implement a Quantum Monte Carlo algorithm
to solve the Anderson impurity model and use it in the context of DMFT.
There are already many implementations of this code spread around the
world, many of them in `Fortran`

and `C/C++`

so why not use them? Well, first
they are more complicated to understand, modify, and some times to big
packages. Second, I needed to learn and modify it on my own style.

The complete software is freely available on github at learn-dmft
, matching
the first release of the code(because the code itself will evolve as new ideas
come in, bugs appear and get fixed. I will focus in this post on the local
changes to speed up the code. These code snippets are not aimed to run
independently but are give in the aim to provide some insight to the work
done.

Lets start by profiling the code. Here you see only the relevant parts
I want to focus on for this short example. Here one sees that most of time
is spent inside the `imp_solver`

which performs the Monte Carlo `update`

,
that needs to update a dense matrix each time, `gnew`

. Almost all of the time
would need to be spend in the matrix update, but because I’m using `scipy`

’s
interface to `BLAS`

it just doesn’t take that much time and it is the Metropolis
update scheme which is manually implemented in `Python`

that takes all the time.

```
1 1214757 function calls (1212402 primitive calls) in 3.903 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.029 0.029 3.591 3.591 hirschfye.py:52(imp_solver)
7 9000 1.603 0.000 3.557 0.000 hirschfye.py:83(update)
8 260114 1.635 0.000 1.879 0.000 hirschfye.py:142(gnew)
9 520228 0.244 0.000 0.244 0.000 {method 'copy' of 'numpy.ndarray' objects
```

Based on this example for the Ising model
I wanted to test `numba`

, but it did
not help, because `numba`

can’t, to my knowledge, compile the calls I do to
`scipy`

. Static typing in `Cython`

didn’t work either as the main bottleneck
still originates from calling the `scipy`

modules of `BLAS`

. So I need to use
`Cython`

to perform the direct call into `BLAS`

. It was thanks to tokyo
, a
`Cython`

wrapper to `BLAS`

that I could learn this. Although `scipy`

seems to be
inserting in its development branch the `Cython`

wrappers to `BLAS`

, that will
only become usable at a later time.

```
1from scipy.linalg.blas import dger
2def gnew(g, v, k, sign):
3 """Quick update of the interacting Green function matrix after a single
4 spin flip of the auxiliary field. It calculates
5
6 .. math:: \alpha = \frac{\exp(2\sigma v_j) - 1}
7 {1 + (1 - G_{jj})(\exp(2\sigma v_j) - 1)}
8 .. math:: G'_{ij} = G_{ij} + \alpha (G_{ik} - \delta_{ik})G_{kj}
9
10 no sumation in the indexes"""
11 dv = sign*v*2
12 ee = np.exp(dv)-1.
13 a = ee/(1. + (1.-g[k, k])*ee)
14 x = g[:, k].copy()
15 x[k] -= 1
16 y = g[k, :].copy()
17 g = dger(a, x, y, 1, 1, g, 1, 1, 1)
```

Recoding this into Cython syntax it becomes

```
1cdef extern from "cblas.h":
2 enum CBLAS_ORDER: CblasRowMajor, CblasColMajor
3 void lib_dger "cblas_dger"(CBLAS_ORDER Order, int M, int N, double alpha,
4 double *x, int dx, double *y, int dy,
5 double *A, int lda)
6cimport numpy as np
7import cython
8from libc.math cimport exp
9
10@cython.boundscheck(False)
11@cython.wraparound(False)
12@cython.cdivision(True)
13cpdef gnew(np.ndarray[np.float64_t, ndim=2] g, double v, int k, double sign):
14 cdef double dv, ee, alpha
15 cdef int N = g.shape[0]
16 dv = sign*v*2
17 ee = exp(dv)-1.
18 alpha = ee/(1. + (1.-g[k, k])*ee)
19 cdef np.ndarray[np.float64_t, ndim=1] x = g[:, k].copy()
20 cdef np.ndarray[np.float64_t, ndim=1] y = g[k, :].copy()
21
22 x[k] -= 1.
23 lib_dger(CblasColMajor, N, N, alpha,
24 &x[0], 1, &y[0], 1, &g[0,0], N)
```

It is important to keep in mind that `numpy`

normally uses **C** ordered arrays,
but if you happen to use `scipy.linalg.blas`

functions in some steps, your
`numpy`

arrays will become **Fortran** ordered. Then it is important to keep that
ordering when you call `BLAS`

. (This was my big bug, and why you need tests). So
pay special attention of the keyword of `CblasColMajor`

to keep the `Fortran`

Column ordering provided by the `numpy`

arrays. The copy instruction is also
necessary, and there should be better ways of data copying and not creating new
`numpy`

arrays.

```
1 695841 function calls (693486 primitive calls) in 2.704 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.028 0.028 2.389 2.389 hirschfye.py:52(imp_solver)
7 9000 1.273 0.000 2.357 0.000 hirschfye.py:83(update)
8 261426 1.026 0.000 1.026 0.000 {hffast.gnew}
```

This first action dropped the amount of python calls almost by half. The new
`Cython`

function is faster by almost half. The changing number of calls is
because I’m a bad profiler and I’m not using the same seed for the random
number generator, but in this case is has quite negligible effects. The total
execution time dropped by a bit more than a second.

Now it’s time to redo the `update`

function that is in charge of the Metropolis
algorithm. It has a very simple structure. It uses most of its time calling
the random number generator and doing the algebra operations over simple floats

```
1def update(gup, gdw, v):
2 for j in range(v.size):
3 dv = 2.*v[j]
4 ratup = 1. + (1. - gup[j, j])*(np.exp(-dv)-1.)
5 ratdw = 1. + (1. - gdw[j, j])*(np.exp( dv)-1.)
6 rat = ratup * ratdw
7 rat = rat/(1.+rat)
8 if rat > np.random.rand():
9 v[j] *= -1.
10 gnew(gup, v[j], j, 1.)
11 gnew(gdw, v[j], j, -1.)
```

But once rewritten into `Cython`

it becomes:

```
1cdef extern from "gsl/gsl_rng.h":
2 ctypedef struct gsl_rng_type:
3 pass
4 ctypedef struct gsl_rng:
5 pass
6 gsl_rng_type *gsl_rng_mt19937
7 gsl_rng *gsl_rng_alloc(gsl_rng_type * T)
8 double uniform "gsl_rng_uniform"(gsl_rng *r)
9
10cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
11
12@cython.boundscheck(False)
13@cython.wraparound(False)
14@cython.cdivision(True)
15cpdef update(np.ndarray[np.float64_t, ndim=2] gup,
16 np.ndarray[np.float64_t, ndim=2] gdw,
17 np.ndarray[np.float64_t, ndim=1] v):
18 cdef double dv, ratup, ratdw, rat
19 cdef int j, i, up, dw, pair, N=v.shape[0]
20 for j in range(N):
21 dv = 2.*v[j]
22 ratup = 1. + (1. - gup[j, j])*(exp(-dv)-1.)
23 ratdw = 1. + (1. - gdw[j, j])*(exp( dv)-1.)
24 rat = ratup * ratdw
25 rat = rat/(1.+rat)
26 if rat > uniform(r):
27 v[j] *= -1.
28 gnew(gup, v[j], j, 1.)
29 gnew(gdw, v[j], j, -1.)
```

Notice that in this case I’m again calling external functions available in `C`

like the ones of the GSL
libraries. After these changes the end result
is a completely cythonized Metropolis Monte Carlo update function that
runs entirely on `C`

. As it is shown by the profiler.

```
1 137415 function calls (135060 primitive calls) in 1.182 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.024 0.024 0.870 0.870 hirschfye.py:52(imp_solver)
7 9000 0.841 0.000 0.841 0.000 {hffast.update}
```

The amount of `Python`

function calls dropped again this time by 5 times. And the
`update`

function internally calling `gnew`

requires in total `0.841s`

. Dropping
the total time used by the impurity solver from `3.591s`

to `0.841s`

that is
a `4.12x`

speed gain. And most of the time would be spend as expected in the
matrix update function `gnew`

.