Python has become my preferred language when it comes to programing. It is so simple to import from its vast library of packages and then build upon them to get things working. Yet, at some point, when all the design phase is over, and the software has to perform it is so slow.

In this case, I’m referring to numerical calculations. Indeed the numpy and scipy modules are of great help, letting all the numerical calculations to run faster in their C or Fortran coded engines. But what of the algorithm I develop in Python? It is going to be the slow part of my work. Thus I have to avoid writing too much Python and prefer calling all this C coded modules. BUT is not enough, because numpy and scipy only provide very general functions and it is up to the user to make a use out of them.

Once the user has its algorithm working, he needs to speed it up. Coding it all into C might be too much of a hassle and error prone. He also might loose many of the Python packages he depended on in the first place and the code will become less readable. The alternatives for now are to use Cython or the not so new project numba . These two are designed to establish the bridge between Python’s ease of use and C’s performance. My problem is that all the simple examples and tutorials you get from them are to highlight the huge performance boost you can get, yet they don’t really match my real life situations. Their examples are based on situations where the complete algorithm is stand alone and the Python interpreter is slowing everything down. Thus a quick call to numba or Cython’s static typesetting will just solve the problem and give you 3 orders of magnitude performance boost.

My story begins because I was already calling the fast C functions of numpy and scipy, and those, should be the most time consuming functions of my algorithm. But they weren’t, there was enough Python code to slow it down. So my first hope was to call numba, but there was no performance gain as it can’t deal with my calls to scipy functions. So Cython was the way to go, I’m certainly not aware of the many possibilities Cython permits to speed up my code, I still believe there is room for improvement, and I would be happy to hear from them in the comments.

So I now present you how I solved my problem and gained a performance boost of 4x. Indeed it is not so glamorous or amazing as 1000x you get from the advertised Cython version of the code, but it is a new start in performance gains over my python code.

So before I start I will stress: HAVE TESTS FOR YOUR CODE! It doesn’t help at all to get faster execution times just to arrive to the wrong results faster. Debugging is a pain, and it is more painful every time your new modifications break old code. Testing lets you test over smaller pieces of your code and make it easier to track what broke your code. I love Python so much because there are very simple frameworks to write your tests, so just use them. When is a good time to write your test? If you can, before every function you develop. After you wrote your package is also good. Even if it produces the correct results from the beginning you have to keep track it continues to do so as your code evolves.

The goal of this work is to implement a Quantum Monte Carlo algorithm to solve the Anderson impurity model and use it in the context of DMFT. There are already many implementations of this code spread around the world, many of them in Fortran and C/C++ so why not use them? Well, first they are more complicated to understand, modify, and some times to big packages. Second, I needed to learn and modify it on my own style.

The complete software is freely available on github at learn-dmft , matching the first release of the code(because the code itself will evolve as new ideas come in, bugs appear and get fixed. I will focus in this post on the local changes to speed up the code. These code snippets are not aimed to run independently but are give in the aim to provide some insight to the work done.

Lets start by profiling the code. Here you see only the relevant parts I want to focus on for this short example. Here one sees that most of time is spent inside the imp_solver which performs the Monte Carlo update, that needs to update a dense matrix each time, gnew. Almost all of the time would need to be spend in the matrix update, but because I’m using scipy’s interface to BLAS it just doesn’t take that much time and it is the Metropolis update scheme which is manually implemented in Python that takes all the time.

1         1214757 function calls (1212402 primitive calls) in 3.903 seconds
3   Ordered by: cumulative time
5   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6        1    0.029    0.029    3.591    3.591
7     9000    1.603    0.000    3.557    0.000
8   260114    1.635    0.000    1.879    0.000
9   520228    0.244    0.000    0.244    0.000 {method 'copy' of 'numpy.ndarray' objects

Based on this example for the Ising model I wanted to test numba, but it did not help, because numba can’t, to my knowledge, compile the calls I do to scipy. Static typing in Cython didn’t work either as the main bottleneck still originates from calling the scipy modules of BLAS. So I need to use Cython to perform the direct call into BLAS. It was thanks to tokyo , a Cython wrapper to BLAS that I could learn this. Although scipy seems to be inserting in its development branch the Cython wrappers to BLAS, that will only become usable at a later time.

 1from scipy.linalg.blas import dger
 2def gnew(g, v, k, sign):
 3    """Quick update of the interacting Green function matrix after a single
 4    spin flip of the auxiliary field. It calculates
 6    .. math:: \alpha = \frac{\exp(2\sigma v_j) - 1}
 7                        {1 + (1 - G_{jj})(\exp(2\sigma v_j) - 1)}
 8    .. math:: G'_{ij} = G_{ij} + \alpha (G_{ik} - \delta_{ik})G_{kj}
10    no sumation in the indexes"""
11    dv = sign*v*2
12    ee = np.exp(dv)-1.
13    a = ee/(1. + (1.-g[k, k])*ee)
14    x = g[:, k].copy()
15    x[k] -= 1
16    y = g[k, :].copy()
17    g = dger(a, x, y, 1, 1, g, 1, 1, 1)

Recoding this into Cython syntax it becomes

 1cdef extern from "cblas.h":
 2    enum CBLAS_ORDER: CblasRowMajor, CblasColMajor
 3    void lib_dger "cblas_dger"(CBLAS_ORDER Order, int M, int N, double alpha,
 4                                double *x, int dx, double *y, int dy,
 5                                double *A, int lda)
 6cimport numpy as np
 7import cython
 8from libc.math cimport exp
13cpdef gnew(np.ndarray[np.float64_t, ndim=2] g, double v, int k, double sign):
14    cdef double dv, ee, alpha
15    cdef int N = g.shape[0]
16    dv = sign*v*2
17    ee = exp(dv)-1.
18    alpha = ee/(1. + (1.-g[k, k])*ee)
19    cdef np.ndarray[np.float64_t, ndim=1] x = g[:, k].copy()
20    cdef np.ndarray[np.float64_t, ndim=1] y = g[k, :].copy()
22    x[k] -= 1.
23    lib_dger(CblasColMajor, N, N, alpha,
24            &x[0], 1, &y[0], 1, &g[0,0], N)

It is important to keep in mind that numpy normally uses C ordered arrays, but if you happen to use scipy.linalg.blas functions in some steps, your numpy arrays will become Fortran ordered. Then it is important to keep that ordering when you call BLAS. (This was my big bug, and why you need tests). So pay special attention of the keyword of CblasColMajor to keep the Fortran Column ordering provided by the numpy arrays. The copy instruction is also necessary, and there should be better ways of data copying and not creating new numpy arrays.

1         695841 function calls (693486 primitive calls) in 2.704 seconds
3   Ordered by: cumulative time
5   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6        1    0.028    0.028    2.389    2.389
7     9000    1.273    0.000    2.357    0.000
8   261426    1.026    0.000    1.026    0.000 {hffast.gnew}

This first action dropped the amount of python calls almost by half. The new Cython function is faster by almost half. The changing number of calls is because I’m a bad profiler and I’m not using the same seed for the random number generator, but in this case is has quite negligible effects. The total execution time dropped by a bit more than a second.

Now it’s time to redo the update function that is in charge of the Metropolis algorithm. It has a very simple structure. It uses most of its time calling the random number generator and doing the algebra operations over simple floats

 1def update(gup, gdw, v):
 2    for j in range(v.size):
 3        dv = 2.*v[j]
 4        ratup = 1. + (1. - gup[j, j])*(np.exp(-dv)-1.)
 5        ratdw = 1. + (1. - gdw[j, j])*(np.exp( dv)-1.)
 6        rat = ratup * ratdw
 7        rat = rat/(1.+rat)
 8        if rat > np.random.rand():
 9            v[j] *= -1.
10            gnew(gup, v[j], j, 1.)
11            gnew(gdw, v[j], j, -1.)

But once rewritten into Cython it becomes:

 1cdef extern from "gsl/gsl_rng.h":
 2    ctypedef struct gsl_rng_type:
 3        pass
 4    ctypedef struct gsl_rng:
 5        pass
 6    gsl_rng_type *gsl_rng_mt19937
 7    gsl_rng *gsl_rng_alloc(gsl_rng_type * T)
 8    double uniform "gsl_rng_uniform"(gsl_rng *r)
10cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
15cpdef update(np.ndarray[np.float64_t, ndim=2] gup,
16             np.ndarray[np.float64_t, ndim=2] gdw,
17             np.ndarray[np.float64_t, ndim=1] v):
18    cdef double dv, ratup, ratdw, rat
19    cdef int j, i, up, dw, pair, N=v.shape[0]
20    for j in range(N):
21        dv = 2.*v[j]
22        ratup = 1. + (1. - gup[j, j])*(exp(-dv)-1.)
23        ratdw = 1. + (1. - gdw[j, j])*(exp( dv)-1.)
24        rat = ratup * ratdw
25        rat = rat/(1.+rat)
26        if rat > uniform(r):
27            v[j] *= -1.
28            gnew(gup, v[j], j, 1.)
29            gnew(gdw, v[j], j, -1.)

Notice that in this case I’m again calling external functions available in C like the ones of the GSL libraries. After these changes the end result is a completely cythonized Metropolis Monte Carlo update function that runs entirely on C. As it is shown by the profiler.

1         137415 function calls (135060 primitive calls) in 1.182 seconds
3   Ordered by: cumulative time
5   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6        1    0.024    0.024    0.870    0.870
7     9000    0.841    0.000    0.841    0.000 {hffast.update}

The amount of Python function calls dropped again this time by 5 times. And the update function internally calling gnew requires in total 0.841s. Dropping the total time used by the impurity solver from 3.591s to 0.841s that is a 4.12x speed gain. And most of the time would be spend as expected in the matrix update function gnew.