We show the design of specialized compute fabrics that maintain the efficiency of full custom hardware while providing enough flexibility to execute a whole class of coarse- grain linear algebra operations. The broad vision of this project is to develop integrated and specialized hardware/software solutions that are co-optimized and co-designed across all layers ranging from the basic hardware foundations all the way to the application through standard linear algebra packages. We have designed a specialized linear algebra proces- sor (LAP) that can perform level-3 BLAS and more complex LAPACK level operations like Cholesky, LU (with partial pivoting), and QR factorizations. We present a power per- formance model that compares state of the art CPUs and GPUs with our design. Our power model reveals sources of inefficiencies in CPUs and GPUs, and our LAP design demonstrates how to overcome them. When compared to other conventional architectures for linear algebra applications, LAP is over orders of magnitude more power efficient. Based on our estimations up to 55 and 25 GFLOPS/W single- and double- precision efficiencies are achievable on a single chip in standard 45nm technology.