@Article{Nelson:2015:RGH,
  author =       "Thomas Nelson and Geoffrey Belter and Jeremy G. Siek
                  and Elizabeth Jessup and Boyana Norris",
  title =        "Reliable Generation of High-Performance Matrix Algebra",
  journal =      "{ACM} Transactions on Mathematical Software",
  volume =       "41",
  number =       "3",
  accepted =     "4 May 2014",
  upcoming =     "true", 
  abstract =     "
                  Scientific programmers often turn
                  to vendor-tuned Basic Linear Algebra
                  Subprograms (BLAS) to obtain portable
                  high performance. However, many
                  numerical algorithms require several
                  BLAS calls in sequence, and those
                  successive calls do not achieve optimal
                  performance. The entire sequence needs
                  to be optimized in concert. Instead of
                  vendor-tuned BLAS, a programmer could
                  start with source code in Fortran or
                  C (e.g., based on the Netlib BLAS)
                  and use a state-of-the-art optimizing
                  compiler. However, our experiments show
                  that optimizing compilers often attain
                  only one-quarter of the performance
                  of hand-optimized code. In this paper
                  we present a domain-specific compiler
                  for matrix kernels, the Build to Order
                  BLAS (BTO), that reliably achieves
                  high performance using a scalable
                  search algorithm for choosing the
                  best combination of loop fusion, array
                  contraction, and multithreading for data
                  parallelism. The BTO compiler generates
                  code that is between 16\                  faster than hand-optimized code.

  ",
}