Four years ago, Google realized the real potential of using neural networks in its applications. At the same time she began to implement them everywhere – in translation of texts, voice search with speech recognition, etc. But it immediately became clear that the use of neural networks greatly increases the load on Google servers. Roughly speaking, if every person performed a voice search on Android (or dictated the text with speech recognition) only three minutes a day, then Google would have to double the number of data centers (!) Just to allow the neural network to process such amount of voice traffic.
I had to do something – and Google found a solution. In 2015, she developed her own hardware architecture for machine learning (Tensor Processing Unit, TPU), which exceeds the traditional GPU and CPU by 70 times and up to 196 times by the number of calculations per watt. The traditional GPU / CPU refers to the general-purpose processors Xeon E5 v3 (Haswell) and the graphics processors Nvidia Tesla K80.
For the first time, the architecture of the TPU is described this week in scientific work (pdf), which will be presented at the 44th International Symposium on Computer Architectures (ISCA), June 26, 2017 in Toronto. The lead author of more than 70 authors of this scientific work, the outstanding engineer Norman Juppi, known as one of the creators of the MIPS processor, in his interview to the publication The Next Platform explained in his own words the features of the unique TPU architecture that actually represents A specialized ASIC, that is, an integrated circuit for a special purpose.
Unlike conventional FPGAs or highly specialized ASICs, TPUs are programmed in the same way as a GPU or CPU, it is not a narrow-purpose device for a single neuron Ti. Norman Yuppie says that the TPU supports CISC instructions for different types of neural networks: convolutional neural networks, LSTM models and large, fully connected models. So it remains still programmable, only uses the matrix as a primitive, and not vector or scalar primitives.
Google emphasizes that while other developers optimize their microchips for convolutional neural networks, such neural networks give only 5 % Of the load in Google data centers. The main part of Google applications uses multi-layered Rumelhardt perceptrons, so it was so important to create a more universal architecture that is not “sharpened” only for convolutional neural networks.
One of the elements of architecture is the systolic flow engine Data, an array of 256 × 256, which receives the activation (weight) from the neurons on the left, and then everything is shifted step by step, multiplying by the weights in the cell. It turns out that the systolic matrix produces 65 536 calculations per cycle. This architecture is ideal for neural networks
According to Yuppie, the architecture of the TPU is more like a FPU coprocessor than a conventional GPU, although multiple matrices for multiplication do not store any programs, they simply follow the instructions received From the host.
The entire architecture of the TPU except for DDR3 memory. Instructions are sent from the host (left) to the queue. Then, the control logic, depending on the instruction, can repeatedly launch each of them
It is not yet known how much such architecture is scaled. Juppie says that there will always be a kind of bottleneck in a system with this kind of host.
Compared to conventional CPUs and GPUs, Google’s engine architecture exceeds them in dozens of times. For example, the Haswell Xeon E5-2699 v3 processor with 18 cores at 2.3 GHz with 64-bit floating point performs 1.3 ter-operations per second (TOPS) and shows a memory exchange rate of 51 GB / s. In this case, the chip itself consumes 145 W, and the entire system on it with 256 GB of memory – 455 W.
For comparison, TPU on 8-bit operations with 256 GB of external memory and 32 GB of internal memory demonstrates the exchange rate with The memory is 34 GB / s, but the card performs 92 TOPS, that is approximately 91 times more than the Haswell processor. The power consumption of the server on the TPU is 384 W.
The following graph compares the relative performance per watt server with the GPU (blue column), the server on the TPU (red) relative to the server On the CPU. Also compares the relative performance per watt server with the TPU in relation to the server on the GPU (orange) and the improved version of the TPU relative to the server on the CPU (green) and the server on the GPU (lilac).
It should be noted that Google conducted comparisons in application tests on TensorFlow with the relative old version of Haswell Xeon, while in the newer version of Broadwell Xeon E5 v4, the number of instructions per cycle increased by 5% due to architectural improvements, and in Version of Skylake Xeon E5 v5, which is expected in summer the number of instructions on the The cycle can increase by another 9-10%. And with the increase in the number of cores from 18 to 28 in Skylake, the overall performance of Intel processors in Google tests can improve by 80%. But even so, there will be a huge difference in performance with the TPU. In the 32-bit floating point test version, the TPU difference from the CPU is reduced to about 3.5 times. But most models are perfectly quantized to 8 bits.
Google thought how to use the GPU, FPGA and ASIC in its data centers since 2006, but did not find them to use until recently, when it introduced machine learning for a number of practical Tasks, and on these neural networks the load began to grow with billions of requests from users. Now the company has no choice but to leave the traditional CPU.
The company does not plan to sell its processors to anyone, but hopes that the scientific work with the ASIC of 2015 will allow others to improve the architecture and create improved versions of ASIC , Which “will raise the bar even higher.” Google itself is already probably working on a new version of ASIC.