I am following an interesting class on Languages for Scientific Computing (http://www.cs.utexas.edu/users/pauldj/lectures.html) taught by Prof. Bientinesi (http://www.cs.utexas.edu/users/pauldj/). Just a week ago he explained about the theoretical peak performance of a processor that is none other than: n_cores * frequency * ops_per_cycle_per_cpu. But, that is not the practical one.
A processor often sits idle waiting for the data to come: No data, no processing. So, to get a high performance, the data must always be available when the processor needs it. Data, however, are stored in memory, but the type of memory that the processor has a direct access can only contain a very small amount of data because the memory is expensive. One of the ABCs of HPC (High-Performance Computing) is the following pyramid where the top of the pyramid is the most expensive but scarcest type of memory available while the bottom of the pyramid is the cheapest but plentiest type of memory: http://www.instructables.com/files/orig/FI3/L4R9/FPQLJSGU/FI3L4R9FPQLJSG.... Thefore, the peak performance of a processor then depends on the nature of the algorithm that it is performing.
As the professor said, the DGEMM (General Matrix Multiply) of BLAS (Basic Linear Algebra Subprograms) is a very commonly used algorithm that can pull its data very well close to the processor. This is because DGEMM computes C = C + A x B where A, B, C an element of R^(n x n) that requires n^3 multiplication and n^3 (n^2 * (n - 1) to be exact) operations as well as 3 * n^2 memory reads and n^2 memory writes. Their ratio is 2n^3 / 4n^2 = n/2 that means that the bigger the data is, the busier the processor is (the number of operations performed compensates the cost of pulling the data from the lower types of memory). Therefore, DGEMM has the best practical peak performance.
As a consequence, DGEMM is used in the benchmark of supercomputers in the world (http://www.top500.org/), and therefore, it is very important to optimize the machine instructions running DGEMM for a specific architecture. For example, Kazushige Goto is famous for hand-optimizing the machine instructions (http://www.nytimes.com/2005/11/28/technology/28super.html?scp=1&sq=Kazus...). Of course, one must also consider the cache size when crafting DGEMM code.
To conclude, if you want to know the peak performance of your CPU, just run DGEMM of BLAS.