@Article{delaCruz:2014:ASS,
  author =       "Ra\'{u}l {de la Cruz} and Mauricio Araya-Polo",
  title =        "Algorithm 942: {Semi-stencil}",
  journal =      "{ACM} Transactions on Mathematical Software",
  volume =       40,
  number =       3,
  year    =      2014,
  month   =      apr,
  pages =        "23:1--23:39",
  url =          "http://doi.acm.org/10.1145/2591006",
  accepted =     "23 October 2013",
  abstract =     "
                  Finite Difference (FD) is a widely used method to
                  solve Partial Differential Equations (PDE). PDEs are
                  the core of many simulations in different scientific
                  fields, e.g. geophysics, astrophysics, etc. The
                  typical FD solver performs stencil computations for
                  the entire computational domain, thus solving the
                  differential operators.  In general terms, the stencil
                  computation consists of a weighted accumulation of the
                  contribution of neighbor points along the cartesian
                  axis. Therefore, optimizing stencil computations is
                  crucial in reducing the application execution time.

                  Stencil computation performance is bounded by two main
                  factors: the memory access pattern and the inefficient
                  reuse of the accessed data. We propose a novel algorithm,
                  named Semi-stencil, that tackles these two problems. The
                  main idea behind this algorithm is to change the way
                  in which the stencil computation progresses within the
                  computational domain. Instead of accessing all required
                  neighbors and adding all their contributions at once,
                  the Semi-stencil algorithm divides the computation into
                  several updates. Then, each update gathers half of the
                  axis neighbors, partially computing at the same time
                  the stencil in a set of closely located points. As
                  the Semi-stencil progresses through the domain, the
                  stencil computations are completed on precomputed
                  points. This computation strategy improves memory
                  access pattern and efficiently reuses the accessed data.

                  Our initial target architecture was the Cell/B.E., where
                  the Semi-stencil in a SPE was 44 per cent faster than the naive
                  stencil implementation. Since then, we have continued
                  our research on emerging multi-core architectures in
                  order to assess and extend this work on homogeneous
                  architectures. The experiments presented combine the
                  Semi-stencil strategy with space and time-blocking
                  algorithms used in hierarchical memory architectures. Two
                  x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM
                  POWER6 and IBM BG/P) platforms are used as testbeds,
                  where the best improvements for a 25-point stencil range
                  from 1.27 to 1.76 times faster. The results show that this
                  novel strategy is a feasible optimization method which
                  may be integrated into auto-tuning frameworks. Also,
                  since all current architectures are multi-core based,
                  we have introduced a brief section where scalability
                  results on IBM POWER7, Intel Xeon and MIC based systems
                  are presented.  In a nutshell, the algorithm scales
                  as well as or better than other stencil techniques.
                  For instance, the scalability of the Semi-stencil
                  on MIC for a certain testcase reached 93.8x over 244
                  threads.",
}