IM3000 Speed and Efficiency for Digital Signal Processing
Below are shown some results of benchmarks performed by and for a customer, as part of a prestudy for an ASIC development. The benchmark programs had been defined by the customer and (except for FFT) written in C language code and then optimized by the customer through rewriting in assembly code for three different processors. Imsys optimized these benchmark programs for the IM3000 by using microcode (which is not possible for the other processors).
Benchmark overview
1. Array copy– One time copy of 1024 short (16bit) values from one buffer to another.
2. Vector product – Five times dot product of two 1024value short arrays.
3. Product of conjugate – Five times cosine computation of two 512value short arrays.
4. Atan2 computation – One time product of conjugate over 512 complex (2x16bit) values.
5. Cosin computation – Five times by 512 samples.
6. CosinSin computation – 512 complex numbers.
7. FFT computation – 1024 points.
FFT implementations
No C code existed for this and Imsys developed its own reference model in C. It was implemented to perform “inplace”, i.e. the result replaces input data in memory. The assembly code implementations for STM32 and dsPIC use inplace and outofplace respectively. The customer provided result only for dsPIC, presumably the faster of the two.
Microcode
Microcode optimization for IM3000 means that critical parts of the algorithm have been transformed into special opcodes, which are executed by microcode in the writable part of the control store of the Imsys processor. In the case of the Array copy benchmark, this had already been done, i.e. a suitable opcode already existed in the standard assembly instruction repertoire.
FFT computation on IM3000
Microcode was developed for three operations:
The first instruction takes x, y on the stack, and replaces them with x+y, xy. The second is a variant that produces i (xy) instead of xy.
The third instructions is complex multiplication, where one factor is viewed as a fixedpoint number, with the range 0x8000 to 0x7FFF representing the interval 1.0 to +1.0. It takes two complex numbers x, y on the stack, and replaces them with the complex number:
(Re(x) * Re(y) – Im(x) * Im(y)) >> 15
+ i (Re(x) * Im(y) + Im(x) * Re(y)) >> 15
Results of speed measurements
Function 
Execution time (µs) 

dsPIC 
STM32 
PIC32 
IM3000 

Array copy 
26 
33 

Vector dot product 
132 
260 
1336 
431 
Product of conjugate 
921 
651 

Atan2 computation 
422 
293 

Cosin computation 
488 
557 
925 
359 
CosinSin computation 
206 
161 

FFT (1024 points) 
2780 
2040 
Imsys IM3000 is considerably faster than the two 32bit RISC processors on the two benchmarks for which results for those were available. Compared to the digital signal processor dsPIC, the Imsys processor was faster on five and slower on two benchmarks.
Energy consumption
The following results were measured by the customer:
Power consumption 
mW 
dsPIC 
295 
STM32 (ARM CortexM3) 
118 
PIC32 (MIPS) 
128 
Imsys IM3000 
40 
When these values are multiplied by the execution times for the respective benchmarks, the following results are obtained for the energy consumed by each benchmark execution:
Function 
Energy per benchmark (uWs) 

dsPIC 
STM32 
PIC32 
IM3000 

Array copy 
8 
1 

Vector dot product 
39 
31 
171 
17 
Product of conjugate 
272 
26 

Atan2 computation 
124 
12 

Cosin computation 
144 
66 
118 
14 
CosinSin computation 
61 
6 

FFT (1024 points) 
820 
86 
As can be seen here, the Imsys processor consumes much less energy when executing the benchmarks, in several cases an order of magnitude less.