



# dCF4/dt - The First Derivatives of the Version 4 *ColdFire*® Integrated Core

**Joe Circello**

Chief ColdFire Architect

Microprocessor Forum

October 11, 2000



Motorola General Business Use

 **Digital DNA**<sup>™</sup>  
from Motorola  
THE HEART OF SMART.

# ColdFire® Core Performance Roadmap



# V4 ColdFire® Core Microarchitecture

- Independent, decoupled pipelines
  - 4-stage Instruction Fetch Pipeline (IFP)
  - 5-stage Operand Execution Pipeline (OEP)
  - FIFO I-Buffer
- Limited superscalar execution via instruction folding
  - Cost-effective dual instruction per cycle implementation



# V4 ColdFire® Core Microarchitecture

- Harvard memory architecture
- Outstanding performance/size
  - Most instructions execute in 1 cycle
  - CPI performance = 1.35 cycles/inst
  - Dhrystone 2.1 = 1.54 MIPS/MHz
  - 350 MIPS @ 225 MHz, 4.0 mm<sup>2</sup> in 0.18µ
  - 510 MIPS @ 333 MHz, 2.1 mm<sup>2</sup> in 0.13µ



# Increasing System Demands Drive New Requirements

**Increasingly complex embedded 32-bit applications demand higher system performance:**

- Process isolation for better reliability and security; expanded use of protected-mode RTOS, such as Linux
  - Response = Memory Management Unit
- Much higher performance levels on complex applications
  - Response = Floating Point Unit



# Increasing System Demands Drive New Requirements

**Increasingly complex embedded 32-bit applications demand higher system performance:**

- DSP functionality on a MPU with a single, unified code stream
  - Response = Enhanced Multiply-Accumulate Unit, Dual-Ported Processor RAMs and User-Defined Address Permutation
- Numerically-intensive algorithms as well as general-purpose control processing
  - Response = On-Chip Multiprocessing



# What's New on the V4e

- Virtual memory management unit (MMU)
  - Address translation inside the core complex
  - Process partitioning
  - Expanded debug capabilities
  - Harvard, dual 32-entry, fully-associative TLBs
- Floating point unit (FPU)
  - Double-precision implementation of the MC68060 FP ISA
  - Concurrent execution between Operand Execution Pipeline & FPU
  - IEEE-754 compliant



# What's New on the V4e

- Enhanced multiply accumulate unit (MAC)
  - Single-cycle issue, optimized for 32x32 MACs
  - Four 48-bit accumulators
  - Expanded programming model
- Dual-port RAM with user-defined address permutation
- Hardware support for on-chip multiprocessing



# V4e ColdFire® Core



# ColdFire® Virtual MMU

- Virtual-to-physical address translation inside the core complex
- Software-managed translation look-aside buffer (TLB) with hardware address translation acceleration
  - Support for 1K, 4K, 8K and 1M page sizes
  - Hardware assists for determining entry to be replaced



# ColdFire® Virtual MMU

- 8-bit Address Space Identification register (ASID)
  - Expand virtual address to 40 bits: ASID + 32-bit address
  - ASID allows partitioning of user processes
- Expanded debug capabilities
  - Add ASID in ownership trace display; included in breakpoints
  - Complete visibility of user processes in debug



# ColdFire® Virtual MMU

- Initial implementation
  - Harvard, dual 32-entry, fully-associative TLBs
  - 65K gates
- Optional use for V4 and beyond



# ColdFire® Floating-Point Unit

- Tightly coupled execution unit within OEP
- 64-bit implementation of the MC68060 FP ISA
  - Operand formats can be byte, word, or longword integer, single or double precision, but all internal calculations are done in DP
  - Concurrent execution between Operand Execution Pipeline & FPU
- IEEE-754 compliant
  - Denormalized numbers not supported in hardware
  - Full IEEE compliance with software assist



# ColdFire® Floating-Point Unit

- Analysis demonstrates potential large performance impact
  - Estimated large image processing execution by using data from MC68060 with 1 OEP enabled (similar to V4 microarchitecture)
  - 1.4x - 1.9x depending on exact image being processed
- 80K gates
- Optional FPU for V4 and beyond



# ColdFire® Enhanced Multiply-Accumulate

- 4-stage execution pipeline optimized for 32x32 MACs
  - Single-cycle issue
  - Word/longword, signed/unsigned, integer/fractional operands
  - Accumulator results stored back into integer register file
  - Four 48-bit accumulators



# ColdFire® Enhanced Multiply-Accumulate

- Expanded programming model
  - Load/store/copy accumulator instructions
  - MAC, MSAC opcodes with optional shift (<<1, >>1) on integer products
  - Single-cycle execution of M{S}AC + 32-bit LOAD instruction
  - Independent control of product rounding, rounding on store operations
  - Programmable control of saturation arithmetic



# ColdFire® Enhanced Multiply-Accumulate

- Performance improvement: 1.5x on JPEG, 2.3x on 128-pt complex FFT
- 22K gates
- Optional acceleration module for all ColdFire cores
  - Deployed in mid-1999 with V2 application-specific device



# ColdFire® Dual-Ported RAMs

- Back-door port into processor-local RAM memories
  - DMA transfers directly into RAM
  - 2 halves with full concurrent accessibility
  - User-defined arbitration priority: CPU or DMA
  - Ideal for double-buffer schemes
    - Overlap CPU processing with DMA data movement



# ColdFire® Dual-Ported RAMs

- User-defined address permutation
  - Maximize performance of DSP functions needing non-sequential complex addressing, without modifications to the existing pipeline or addressing modes
  - Three user-defined address permutation functions create multiple RAM address space maps
    - Can be used on addresses from both CPU and DMA
    - Allows any address bit “x” to be specified as: 0, 1, or any address bit “y”
    - Creates a 1-cycle pipeline stall on permuted address
      - Stall only on 1st access of a string of permuted references



# ColdFire® On-Chip Multiprocessing (CMP)

- Architectural Enhancements to Support CMP
  - Hardware extensions to support basic multiprocessing needs
    - CPU run/halt control
    - CPU-to-CPU communication mechanism
    - Interrupt steering, debug control, etc.
    - User-mode processor instruction to read CPU number
  - Memory coherency maintained by explicit software control



# ColdFire® On-Chip Multiprocessing (CMP)

- Application-specific processor for consumer electronics
  - Dual V4 cores, each with EMAC + dual-ported RAMs
  - Software architecture is a master/slave configuration
    - Master controls system tasks; slave is algorithm engine
  - Implemented in 0.18 um process
    - > 600 aggregate MIPS @ 200 MHz



# Version 4e ColdFire Core Summary

- Innovative architectural solutions meeting varied customer demands and achieving superior system performance through optional processor integration
- Continues to reap the benefits of 100% synthesizability -easily moving to new, higher performance technologies
- Builds on the ColdFire Family legacy of low-cost, high performance solutions with first silicon in 2Q01



Microprocessor Forum - 2000

Motorola General Business Use

\* **DigitalDNA**  
from Motorola  
THE HEART OF SMART.