Energy Efficiency in Microprocessor Platforms
November 26, 2007
Energy Efficiency in Microprocessor Platforms
The Imsys IM3000 family of components sets new benchmarks in power efficiency and speed for specific applications. The low number of gates is complemented by a large base of microcode that optimizes the use of its transistors, reducing overhead activity and thereby power consumption.
Think of the revolution in buses and cables
Wide buses are expensive. They take up silicon area, and if they go off-chip they increase the number of pins and the size of packages and circuit boards. Reducing bus width can therefore save money. Furthermore, narrower buses make is easier to fulfill requirements on reliability and electrical interference, which also saves money.
Serial transfer therefore now replaces old parallel bus standards. This trend started with cable interfaces such as USB and it continues, via board level buses (Fibre Channel, PCI-E, Serial ATA, Infiniband etc), to on-chip buses in complex integrated circuits.
A processor core
What about the internal datapath of a processor, i.e. its ALU and registers? This path consists of a number of parallel bit lanes, which are similar to each other. If it is 32 bits wide it can add an integer to another in one cycle. If it is 8 bits wide this takes 4 cycles, but the energy consumed is the same.
If the speed is sufficient, then the 8-bit datapath could be preferable due to its lower cost, higher reliability, and lower interference, just like USB is preferable compared to the old printer cables. Note that the speed of transistors increases with every CMOS generation.
An 8-bit datapath could also be more efficiently utilized, since the 8-bit byte is the building block for most data formats. A 32-bit machine has to do extra work to access a byte within a 32-bit word, and when it operates on 8-bit, 16-bit- or 24-bit data, then part of its datapath is idle.
A machine with 8-bit datapath must have multicycle instructions and thus more complex control than that of a 32-bit RISC. This calls for microprogrammed control, which in turn enables the use of much more of complex control than that needed to make up for the narrower datapath. The instructions produced by the compiler can then be defined on a higher level of abstraction and adapted for the compiler instead of for the hardware - leading to more efficient utilization of instruction bits. This leads to denser and smaller runtime program and thus lower memory bandwidth and lower power consumption in the memory.
Making the microprogram writable also enables soft optimization by dedicated microcode for special functions, upgradeable like software.
The potential benefits of a microprogrammed 8-bit machine are thu
- lower cost
- higher reliability
- lower electrical interference
- higher energy efficiency
- lower memory power consumption
- reduced width of memory interface
- higher speed and efficiency for special functions
- field configurability of accelerated functions.
Why has this not been done before?
If it is possible, by more advanced control as described above, to make an 8-bit datapath do more than four times as much per cycle as the fastest ordinary 8-bit processors do, then such a processor should be able to compete with the much more expensive and power-hungry 32-bit machines. Imsys has proven that it is possible to increase performance by much more than a factor of four. Then why hasn't this been done before?
The CPU architectures most commonly used in embedded systems developed today are old architectures with either 8-bit or 32-bit wide datapath. They were once optimized for requirements that are no longer valid. None of these architectures can develop by evolution into something like the Imsys processor.
To launch an entirely new CPU architecture has been very difficult and costly due to the need for new software tools, training, etc. When the RISC became popular, in the eighties, it became somewhat easier since assembly programming decreased in importance and it was relatively easy to adjust an open source C compiler for the differences between these reduced instruction sets and then compile a Unix (or Linux) operating system.
Another development that can enable the launch of a new architecture - and now one with complex control - is the standardization of a "virtual machine" for high-level-language compilers. Such "machines" were defined long ago for Pascal and Modula-2, and more recently for Java.
Java is a modern, object-oriented language, and the VM (virtual machine) for it is suitable also for other high-level languages (although small additions need to be made for C). Excellent tools are freely available, and the language is now the most popular of all. The VM completely hides the underlying CPU architecture and the operating system - these can be proprietary and be further improved without compatibility restrictions as long as the API (Application Program Interface) follows the specification.
This is a new situation, one that has never existed before, and this is why machines like the Imsys processor have not appeared as general CPUs for customer software before.
The Imsys Processor
The conceptual design of the Imsys processor and its microcode and software tools have a long history of development and refinement, but its predecessors were used in applications where programming would not be done by customers or users. The first incarnation in the eighties was, however, technically a minicomputer (and terminal cluster controller); it had a multi-user operating system and it was programmed in high-level language. The hardware (built in TTL) had more raw speed than the software needed and was therefore designed to be extremely general and flexible, so that it also could take care of some non-CPU functions and thereby save cost.
The new Imsys processor has a completely different instruction set architecture, defined on a higher level of abstraction. The specification is further from the hardware; in fact nothing in the hardware was designed specifically to support this particular architecture, but the hardware is general and flexible enough to implement it efficiently. The instruction repertoire, thus entirely defined in the microprogram, was built on the Java VM, i.e. a higher-level instruction repertoire optimized to suit the compiler and to have high code density, and intended to be interpreted (i.e. executed indirectly) by software. The Imsys machine does the interpretation in microcode, inside the processor.
A compiler for C/C++ has been developed for this architecture, and translates C source code to runtime code with the high density characteristic for Java bytecode (75-80% code size reduction compared to RISC, according to Sun).
Most of the microprogram resides in ROM (only 1 transistor/cell, vs. 6 transistors for RAM). The size and power consumption of this ROM are very small compared to those of an instruction cache memory for a RISC (where interpretation software would go).
The microinstructions are much wider than RISC instructions and have many more degrees of freedom of what to do in each cycle. Branches can have several alternative destinations, and be combined with memory access, operations on data, and loop counting, in the same cycle. Thus, interpretation and all other special sequences that are important enough to be microprogrammed, use fewer cycles.
Measurements show that execution of the four common Java bytecode instructions for floating-point numbers (a "float" consists of a 1-bit sign, an 8-bit exponent, and a 23-bit mantissa) require 5.6 - 7.7 times as many cycles on an ARM920T than microinstruction cycles on the Imsys processor. Copying arrays in memory requires 16 - 23 times more cycles for ARM than for Imsys. The energy consumption factor is at least as high as this, since datasheets for ARM9-based chips and measurements on the Imsys processor show that each ARM cycle consumes at least as much energy as an Imsys cycle.
Benchmark measurements on processors with cache memory are problematic, since it is difficult to define a benchmark that will be representative for a real application (where cache misses occur). A comparison of the Imsys processor with a cache-less 8-bit microcontroller (in similar CMOS technology) could therefore also be of interest. A set of ten different C language benchmark programs from an independent source, Texas Instruments, was used for comparing the Imsys processor with Rabbit 3000, which uses the Z80 architecture. The Imsys processor was on average 13.5 times faster. In addition to these tests, the time for an algorithm that is part of the RSA crypto algorithm was also measured. In pure C code it was 19 times faster on the Imsys processor. Microcoded it was 348 times faster, and the cost (silicon area of added microcode) is minimal.
Thus, with advanced control (which doesn't consume much power) it is indeed possible for a processor with 75% smaller arithmetic unit to compete with the much more expensive and power-hungry 32-bit machines.

