We show the design of specialized compute fabrics
that maintain the efficiency of full custom hardware while
providing enough flexibility to execute a whole class of coarse-
grain linear algebra operations. The broad vision of this project
is to develop integrated and specialized hardware/software
solutions that are co-optimized and co-designed across all layers
ranging from the basic hardware foundations all the way to
the application through standard linear algebra packages.
We have designed a specialized linear algebra proces-
sor (LAP) that can perform level-3 BLAS and more complex
LAPACK level operations like Cholesky, LU (with partial
pivoting), and QR factorizations. We present a power per-
formance model that compares state of the art CPUs and
GPUs with our design. Our power model reveals sources
of inefficiencies in CPUs and GPUs, and our LAP design
demonstrates how to overcome them. When compared to other
conventional architectures for linear algebra applications, LAP
is over orders of magnitude more power efficient. Based on our
estimations up to 55 and 25 GFLOPS/W single- and double-
precision efficiencies are achievable on a single chip in standard
45nm technology.