#### Next generation GRAPE: GRAPE-DR

#### Jun Makino

# Center for Computational Astrophysics National Astronomical Observatory of Japan

## Intel P4 and GRAPE-6





Intel Prescott (2004)GRAPE-6 chip (2000)2 FP ops/clock $\sim 400$  FP ops/clock90nm, 7.6 GF, > 100W? $0.25 \mu m, 31$  GF, 10WA GRAPE processor is  $\sim 100$  times more efficient.Next generation will be similarly good.

## Problem with GRAPE approach

• Chip development cost becomes too high.

| Year | Machine         | Chip Initial Cost |
|------|-----------------|-------------------|
| 1992 | GRAPE-4         | 200K\$            |
| 1997 | <b>GRAPE-6</b>  | 1M\$              |
| 2004 | <b>GRAPE-DR</b> | 4M\$              |

Initial cost should be 1/4 or less of the total budget. How we can continue?

# Next-Generation GRAPE — GRAPE-DR

- Planned peak speed: 2 Pflops
- New architecture wider application range than previous GRAPEs
- primarily to get funded
- Planned completion year: FY 2008 (early 2009)

## **GRAPE-DR** processor structure



**Result output port** 

Collection of small processor, each with ALU, register file (local memory)

One chip will integrate 512 processors Single processor will run at 500MHz clock (2 operations/cycle).

Peak speed of one chip: 0.5 Tflops (20 times faster than GRAPE-6).

# **PE** architecture



- Float Mult (24 bit mantissa, with full 49 bit output)
- Float add/sub (60 bit mantissa)
- Integer ALU (72 bit)
- 32-word (72 bit) general-purpose register file
- 256-word (72 bit) memory
- ports to shared memory (shared by 32 processors)

#### How do you use it?

- GRAPE: We'll write the necessary software. Move from GRAPE-6 will be less painful than move from GRAPE-4 to GRAPE-6.
- Matrix etc ... RIKEN/NAOJ will do something
- New applications:
  - Primitive Compiler available
  - For high performance, you need to write the kernel code in assembly language

# Primitive compiler

#### (Nakasato 2006)

```
/VARI xi, yi, zi, e2;
/VARJ xj, yj, zj, mj;
/VARF fx, fy, fz;
dx = xi - xj;
dy = yi - yj;
dz = zi - zj;
r2 = dx * dx + dy * dy + dz * dz + e2;
r3i= powm32(r2);
ff = mj*r3i;
fx += ff*dx;
fy += ff*dy;
fz += ff*dz:
```

#### **Development status**



1st prototype board. (Designed by Toshi Fukushige) Confirmed succesful operation at 500MHz clock Currently working on softwares and FPGA design to run real applications

# Summary

- GRAPE-DR, with programmable processors, will have wider application range than traditional GRAPE
- Assembly language defined.
- Primitive compiler is ready.
- Processor chip is completed and no problem found (so far).