

### **Module Introduction**

#### **Purpose**

• This training module covers 68K/ColdFire Architecture

#### **Objectives**

- Explain the features of the V2, V3, V4, V4e, V5, and V5e ColdFire cores.
- Explain the features and functionality of the Floating Point Unit (FPU), Memory Management Unit (MMU), Multiply-Accumulate Unit (MAC), and the Enhanced Multiply-Accumulate Unit (eMAC)
- Describe the features of the Hardware Divide module.

#### Content

- 17 pages
- 3 questions

#### **Learning Time**

• 30 minutes

This module introduces you to the variable-length RISC ColdFire architecture which gives customers greater flexibility to lower memory and system costs. Because instructions can be 16-, 32- or 48 bits long, code is packed tighter in memory resulting in better code density than traditional 32- and 64-bit RISC machines. More efficient use of on-chip memory reduces bus bandwidth and the external memory required, which results in lower system cost.

In this module, we will discuss the features of the V2, V3, V4, V4e, V5, and V5e ColdFire cores, the features and functionality of the Floating Point Unit (FPU), Memory Management Unit (MMU), Multiply-Accumulate Unit (MAC), and the Enhanced Multiply-Accumulate Unit (eMAC). We will also explore the Hardware Divide module.







- Generations of ColdFire ® cores V2, V3, V4, and V5
  - Variable Length Encoding (VLE) instruction set, with most instructions just 16-bits wide - resulting in very compact code
  - Hardware Divide
  - Debug module
- Optional Modules
  - Multiply/Accumulate (MAC) module
  - Enhanced Multiply/Accumulate (eMAC) module
  - Memory Management Unit (MMU)
  - Floating Point Unit (FPU)

The programming model register set is derived and directly compatible with the 68K family processors

There are various versions in the ColdFire family, including V2, V3, V4 and V4e, V5 and V5e. The core uses a variable length RISC architecture that allows instructions to be 16, 32, or 48 bits in length. The result is more efficiently packed code in memory, which reduces memory requirements and lowers overall system cost. The cores consist of a Hardware Divide unit and Debug module.

Here are the ColdFire optional modules.

These include the Multiply Accumulate (MAC) module, which provides high-speed, complex arithmetic processing for simple signal processing applications

The Enhacement Multiply Accumulate (eMAC) module, based on the original MAC, but is optimized for 32 X 32 bit operations. It provides superb support for the execution of DSP operations within the context of a single processor at a minimal hardware cost.

The Memory Management Unit (MMU), which provides virtual-to-physical address translation and memory access control.

The Floating-Point Unit (FPU), which describes instructions implemented in the FPU designed for use with the ColdFire family of microprocessors. These modules will be covered in more detail later in the training.





Next, let's take a closer look at each ColdFire version, beginning with V2. Even though this is the second generation ColdFire core, it is the first "true" ColdFire core.

The V2 ColdFire core is implemented using a semi-custom, standard cell-based design methodology. The resulting integrated processor allows significant reductions in component count, power consumption, board space, and cost, resulting in higher system reliability and performance. The ColdFire microarchitecture and design includes exceptional integration capabilities by using 100% synthesized design, compiled memories, a hierarchical on-chip bus structure, and industry-leading debug modules.

The processor core is comprised of two separate pipelines that are decoupled by an instruction buffer. The two-stage Instruction Fetch Pipeline (IFP) is responsible for instruction-address generation and instruction fetch. The instruction buffer is a first-in-first-out (FIFO) buffer that holds prefetched instructions awaiting execution in the Operand Execution Pipeline (OEP). The OEP includes two pipeline stages. The first stage decodes instructions and selects operands; the second stage performs instruction execution and calculates operand effective addresses, if needed.





- V3 ColdFire Core
  - Third generation ColdFire core
  - Enhanced design & high core frequency result in fast instruction execution
  - Enhanced Version 2 core design with a 2-stage pipelined bus for improved MHz
    - Four stage instruction fetch pipeline
      - Instruction address generation
      - Two stage instruction fetch
      - Instruction Early Decode
    - Eight entry instruction buffer decouples IFP/OEP
    - Two stage operand execution pipeline
      - Decode & select/operand cycle
      - · Address generation execute cycle

The third generation, the V3 ColdFire core, delivers several enhancements. These include a refined instruction prefetch pipeline, branch prediction capabilities, and higher frequencies of operation. These improvements allow the V3 ColdFire core to provide more performance than the V2 ColdFire core for a given technology process, making it an attractive solution for new designs or upgrading existing systems.

As with all ColdFire cores, the V3 ColdFire core is comprised of two separate pipelines that are decoupled by an instruction buffer.

The instruction fetch pipeline (IFP) is a four-stage pipeline for prefetching instructions. The prefetched instruction stream is then gated via eight-entry instruction buffer into the two-stage operand execution pipeline (OEP), which decodes the instruction, fetches the required operands and then executes the required function. Since the IFP and OEP pipelines are decoupled by an instruction buffer which serves as a FIFO queue, the IFP is able to prefetch instructions in advance of their actual use by the OEP thereby minimizing time stalled waiting for instructions.



# V4 & V4e ColdFire Core



(SoC-Designer Defined)

- Fourth generation ColdFire core
- Independent, decoupled pipelines
  - 4-stage instruction fetch pipeline (IFP)
  - 5-stage operand execution pipeline
  - FIFO I-buffer is the decoupling mechanism
- Limited superscaler execution through the use of instruction folding
  - Approaches dual-issue performance but at a much lower
  - silicon cost

    Most instructions execute in one clock cycle
- 32-bit Harvard memory architecture
  - The Harvard architecture requires dual local busses; one for instructions and one for data
  - Doubles available bus bandwidth
- Sophisticated 2-level branch acceleration mechanisms in the IFP minimize execution times of change-of-flow instructions
- V4e core adds Memory Management Unit (MMU) and double-precision Floating Point Unit (FPU)

The V4 ColdFire core allows for greater performance (over 300 Dhrystone 2.1 MIPs) than the V2 and V3 ColdFire Cores. The independent, decoupled pipelines have a four-stage IFP, a five-stage OEP, and a 10 instruction FIFO as the decoupling mechanism.

The V4 core is a limited superscalar design that approaches dual-issue performance with much lower silicon cost. Instruction folding is used to approach the dual-issue performance. This results in the V4 core executing most instructions in one cycle.

The V4 core also uses a 32-bit Harvard architecture. The Harvard memory architecture requires dual busses; one for instructions and one for data. This memory architecture doubles available bus bandwidth.

To maximize the performance of conditional branch instructions, the IFP implements a sophisticated two-level acceleration mechanism. The first level is an 8-entry branch cache. In the event of a branch cache hit, if the branch is predicated as taken, the branch cache sources the target address. For conditional branches that miss in the branch cache, a second-level, prediction table is accessed.

If the optional FPU and memory management units are included in the design, the resulting core is named V4e.





The V5 core currently provides leading-edge performance for the ColdFire family of embedded processors. Its microarchitecture leverages the basic pipeline organization of the V4 core but adds a second operand execution pipeline to fully support sustained 2-instruction per cycle execution rates.

The V5 core also supports the next generation of the ColdFire instruction set architecture, ISA\_C, and the next revision of BDM, rev E; yet the V5 core retains socket compatibility with the version 4 core.

To keep the dual operand execution pipelines operating at maximum throughput, the width of the internal Harvard memory architecture is increased to 64 bits.

Also, there is a much-larger and sophisticated branch cache compared to the V4 core.

In addition to the basic ALU capabilities provided in the dual operand execution pipelines, the V5 core includes a superscalar eMAC, capable of 2 MAC operations per cycle.

If the optional FPU and memory management units are included in the design, the resulting core is named V5e.



### Question

Which version of the ColdFire processor uses a four-stage IFP and a five-stage OEP? Click on your choice.

- A. V2
- B. V3
- C. V4
- D. V5

Can you identify the different versions of the ColdFire cores?

#### Correct!

The ColdFire V4 core has a four-stage IFP and a five-stage OEP. ColdFire V2 and V3 both have a two-stage IFP and two-stage OEP. The V5 core has a four-stage IFP and a dual five-stage OEP.







- Leverages the 68K programming model with 8 general-purpose FPn registers plus 3 control registers
- General-purpose registers are 64 bits in width supporting double-precision FPU operands
- Conforms to ANSI/IEEE standard 754
- Optimized for real-time execution with exceptions disabled and default results provided for specific operations, operands, and number types
- Also supports single precision and signed integer (byte, word, long) operands
- Similar to the FPU on the 68040 and 68060

The ColdFire floating-point unit is a 64-bit implementation of the programming model originally defined by the 68K processor family. The FPU register file includes 8 general-purpose registers, each 64 bits in size, along with 3 control registers.

It conforms to the IEEE 754 standard and has been optimized for maximum performance with exceptions disabled and default results generated for numeric boundary conditions. The FPU operates on a variety of data types, including single- and double-precision FP data and signed byte-, word- or longword integers, using an ISA similar to that of the MC68040/MC68060 processors.



### **Memory Management Unit (MMU)**

- Flexible, software-defined virtual environment
- Unlike 68K family, no support for hardware tablewalk
- Fault status and recovery information functions
- 32-entry fully associative instruction and data TLBs (Translation Look-aside Buffer)
- Support for 4 Kbyte, 8 Kbyte, 1 MByte and 16 Mbyte page sizes concurrently
- No performance penalty on TLB hits
- Instruction restart exception model implemented in processor to support access faults caused by TLB misses

The ColdFire memory management unit provides a flexible, software-defined virtual environment.

Unlike its 68K predecessors, the ColdFire MMU does not support hardware tablewalking of memory-resident pager tables. Rather, the ColdFire solution provides complete flexibility for software to define and manage the system address space.

Information for access error fault processing is stored in the MMU. A precise fault (transfer error acknowledge) signals the core on translation (TLB miss) and access faults. The core supports an instruction restart model for this fault class. Note that this structure uses the existing ColdFire access fault vector.

Available on the V4 and V5 processors, the MMU implementation includes two fully-associative 32-entry Translation Lookaside Buffers (TLBs) to map the most recently referenced address regions.

The MMU architecture defines support for page byte sizes of 4K, 8K, 1M and 16M and the fully-associative TLBs allow entries mapping different page sizes to be freely mixed.

The MMU functionality is fully integrated into the existing pipeline structure so there is no performance degradation on TLB hits. In the event of a TLB miss, the processor provides an instruction restart exception model, so that once the miss has been serviced by software and the new entry loaded into the TLB, the process is restarted simply by RTE'ing to the instruction that originally caused the page fault.



### Multiply-Accumulate (MAC) Unit

#### Features:

- Integrated into the Operand Execution Pipeline
- Implements a 3 stage arithmetic pipeline optimized for 16x16 multiplies
- Provides hardware support for a limited number of DSP operations used in embedded code
- Provides signal processing capabilities in applications such as digital audio and servo control
- This MAC is featured on the V2, V3 and V4 ColdFire Cores

#### **Functionality:**

Functionality is provided in three related areas:

- Signed and Unsigned Integer Multiplies
- Multiply-accumulate operation supporting:
- 16x16 [un]signed Multiply-accumulate in 1 Clk Cycle
- 32 x32 [un]signed Multiply-accumulate in 3 Clk Cycles
- Also supports signed fixed point fractional operands
- Product may be Shifted once right or left prior to Accumulation
- Register-based Arithmetic operations



To begin, let's look at the multiply-accumulate (MAC) unit. This unit provides hardware support for a limited set of digital signal processing (DSP) operations used in embedded code. The MAC supports the integer multiply instructions in the ColdFire microprocessor family.

The MAC unit is integrated into the Operand Execution Pipeline (OEP). This unit implements a three-stage arithmetic pipeline optimized for 16x16 multiplies. Both 16- and 32-bit operands are supported by this design in addition to a full set of extensions for signed and unsigned integers. Plus signed, fixed point fractional input operands are also included. It also provides signal processing capabilities for ColdFire in a variety of applications including digital audio and servo control. Note that this MAC is featured on the V2, V3, and V4 ColdFire cores.

The functionality of MAC is provided in three related areas, signed and unsigned integer multiplies, multiply-accumulate operations supporting signed, unsigned, and signed fractional operands, and miscellaneous register operations. Logic that supports functionality is contained in the MAC module as shown in the figure.



## MAC Programming Model



- The MAC unit provides a common set of simple DSP operations
  - Also speeds integer multiply within ColdFire core
- 16X16 and 32X32 multiplies with 32-bit accumulates
- · Input operands are contained in two data registers
- ACC is a 32-bit register used to accumulate the results of MAC operations
- MASK is a 16-bit register useful in implementing circular queues in memory
- MACSR is a 8-bit register which defines the operation of the MAC and contains flags indicating results from the MAC

Here is the MAC programming model. It provides a common set of simple DSP operations, while speeding up integer multiplies within the Coldfire core. It supports 16X16 and 32x32 multiplies with 32-bit accumulates. The inputs are contained in two data registers.

The Accumulator (ACC) is a 32-bit register that is used to accumulate the results of MAC operations.

Mask register (MASK) is a 16-bit register that is useful in implementing circular queues in memory.

MAC Status Register (MACSR) is a 8-bit register that defines the operation of the MAC and contains flags indicating results from the MAC.



### **Enhanced MAC (eMAC)**

- Four 48-bit accumulators
- Multiplication capabilities
  - 40-Bit Products
    - 16x16
    - 32x32

#### Additional functions

- Signed and unsigned integer multiplies
- Multiply-accumulate operations supporting signed and unsigned integer operands as well as signed, fixed point, and fractional operands
- Miscellaneous register operations

#### Additional operands

- Signed Integers
- Unsigned Integers
- Signed, fixed-point fractions



The Enhanced Multiply-Accumulator (eMAC) unit can be found on most V2, on V3, and on V4 cores. eMAC is faster and more accurate for math-intensive software algorithms, such as optimizing audio decoding/encoding for MP3. The eMAC features four 48-bit accumulators. The eMAC allows for 16x16 and 32x32 multiplies with a 40-bit product.

The eMAC unit provides functionality in three related areas. First, it provides signed and unsigned integer multiplies. It also provides miscellaneous register operations. Lastly, it provides multiply accumulate operations supporting signed and unsigned integer operands as well as signed, fixed-point, and fractional operands.



### **Enhanced MAC (eMAC)**

#### Comparison of the eMAC versus MAC:

#### **eMAC**

- Four stage execution pipeline
- Optimized for 32-bit operands
- 32x32 multiply array
- Four 48-bit accumulators
- 40-bit products

#### MAC

- Three stage execution pipeline
- Optimized for 16-bit operands
- 16x16 multiply array
- Single 32-bit accumulator
- 32-bit products

Here is a comparison of the eMAC unit versus earlier ColdFire parts with the MAC unit. Notice that the eMAC unit provides a four-stage execution pipeline, optimized for 32-bit operands, a fully pipelined 32 x32 multiply array and four 48-bit accumulators. Finally, it implements a 48-bit data path to allow the use of a 40-bit product plus the addition of eight extension bits.



### Question

Which of the following are features of the MAC unit? Select all that apply and then click Done.

- A. Integrated into the Operand Execution Pipeline
- B. Three-stage arithmetic pipeline
- C. Signal processing capabilities
- D. A hardware divider

Take a minute now to identify the features of the MAC unit.

#### Correct!

Operand execution pipeline, three-stage arithmetic pipeline, and signal capabilities are all features of the MAC unit.



### **Hardware Divide Module**

- A hardware divider: performs a number of integer divide operations
  - 32/16 producing a 16-bit quotient and 16-bit remainder
  - 32/32 producing a 32-bit quotient
  - 32/32 producing a 32-bit remainder
- Included in all current ColdFire devices

Now, let's look at the hardware divide module. Like the MAC unit, it is coupled to the core's operand execution pipeline. It allows processors to support signed divides, unsigned divides, and remainder instructions. With this model, multiply and remainder instructions can take up to 38 clocks to execute. The actual execution time may be less depending on the addressing mode, operand size, and operand values. The hardware divide module is included in all current Coldfire devices.



### Question

**True or False:** The V4 core provides limited superscalar execution through the use of instruction folding, which approaches dualissue performance but at a much lower silicon cost.

- A. True
- B. False

Take a moment now to answer this question.

#### Correct!

There is limited superscalar execution through the use of instruction folding, which approaches dual-issue performance but at a much lower silicon cost.



### **Module Summary**

- ColdFire Cores V2, V3, V4, V4e, V5
- FPU
- MMU
- MAC
- eMAC
- Hardware Divide

In this module, you learned about the features of ColdFire cores, features and functionality of optional modules.

ColdFire core versions, beginning with V2 up to leading edge V5 core architecture, were discussed.

Further you learned about optional modules, which enhance ColdFire performance and features. Floating-point unit module is designed to carry out effectively operations on floating point numbers.

Memory Management Unit module provides complete flexibility for software to define and manage the system address space.

Multiply-Accumulate unit module offers a hardware support for a limited set digital signal processing (DSP) operations optimized for 16-bit operands with 32-bit accumulator.

Enhanced Multiply-Accumulate unit module enhances features of original MAC. It is optimized for 32-bit operands with 48-bit accumulator.

And finally, Hardware Divide Module allows processor to support signed and unsigned divides and remainder instructions.