

June, 2010

# Leveraging the QorlQ Data Path Acceleration Architecture (DPAA) for Wireless Applications

FTF-NET-F0704



**Stephen Cole** – Senior Systems Architect **Jonas Svennebring** – Senior Software FAE





# Agenda (time/presenter to be removed)

- ▶ Challenges in Wireless Industry
- QorlQ Application Mapping
- ▶ Data Path Acceleration quick recap.
- ► Packet I/O
- ► PME traffic monitoring and management
- SEC Ciphering and packet formatting
- ► Use-case: Voice packet processing
- ► Use-case: HSPA, mobile broadband





# **Challenges in Wireless Industry**





## **Trends / Challenges**

- Mobile broadband has taken off.
  - High bandwidths
  - High user peak rates
  - Low network latency
- Smart Phones (iphone etc) require intensive signaling, i.e. control processing.
- WCDMA follows the LTE track in development
  - Competitive performance with HSPA+
  - Support for flat architecture
- ► Home base-station, will 2010/11 be the break through?
- ▶ All IP? Investments in ATM needs smooth migration, i.e. IP over ATM.
- Content Caches in field to offload main pipe to Internet?





# **Challenge: Mobile Broadband**

R99: 384 kbps

HSUPA "basic'

R06: 1.4 Mbps

**HSUPA 2ms TTI** 

R06: 5.8 Mbps

16QAM

R07: 12 Mbps

Multicarrier

R09-11: 20-40 Mbps

How can this be handled within a device?

R99: 384 kbps

HSDPA "basic"

R05: 3.6 Mbps

15 codes

R05: 14.4 Mbps

**64 QAM** 

2x2 MIMO

R07: 21 Mbps 28 Mbps

MIMO + 64QAM

R08: 42 Mbps

Multicarrier +

high order MIMO

R09-11: 80-160 Mbps

Very high peak rates! **Huge total bandwidth!** 

R12--: 400+ Mbps





## **WCDMA** and LTE Core Networks







# **QorlQ Application Mapping**



# \_\_JrI Q™ Investment Continues Our Embedded Leadership Tradition

## ► Our 15+ year Heritage:

### ▶ 3<sup>rd</sup> Generation Data Path

- Gen-1: CPM MPC8260
- Gen-2: QUICC Engine™ MPC8360
- Gen-3: DPAA QorlQ P4080

## Accelerating Connectivity

- eTSEC
- SEC 4.0
- PME 2.0
- PCIe, Serial RapidIO<sup>®</sup>, XAUI

#### ▶ Power Architecture<sup>™</sup> ISA

- e500 PowerQUICC® III
- e500 QorlQ P1, P2 platforms
- e500mc QorlQ P3, P4 platforms
- Long life cycle management for industry







# **QorlQ<sup>™</sup> Platform Levels matching Wireless Applications**

#### 45nm PLATFORMS / PRODUCTS







## **QorlQ P4 Series P4080 Block Diagram**







# Power

## **QorlQ P5 Series P5020 Block Diagram**



- Pin Compatible to P4080, P4040 & P3041
- 45nm SOI Process

#### ► 2x e500mc-64 core, built on Power Architecture® technology

- 2x 64-bit cores (up to 2+ GHz) with 512 KB backside L2 cache
- Dual 1MB Shared L3 cache w/ECC

#### ► Memory Controller

- Dual 32/64-bit DDR3/3L w/ECC up to 1.3 GHz
- **►** Ethernet
  - 5 x 10/100/1000 Ethernet controllers
  - 1 x 10GE controller (XAUI)
- ► High Speed Interconnect
  - 4 PCI Express® 2.0 controllers
  - 2 Serial RapidIO<sup>®</sup> 1.3 + 2.0 controllers
  - 2 SATA 3Gb/s
  - 2 USB 2.0 with PHY
- ► CoreNet Switch Fabric
- ► Trusted Architecture
- ► Data Path Acceleration Architecture
  - Security Engine (SEC)
  - Pattern Matching Engine (PME)
  - RAID 5/6 Engine
  - Enhanced RapidIO Messaging (Rman)





# Power

## **QorlQ P3 Series P3041 Block Diagram**



- Pin Compatible to P4080, P4040, P5020 & P5010
- 45nm SOI Process

#### Quad e500mc Power Architecture®

- ▶4 cores (up to 1.5GHz) with 128KB backside L2 cache
- ▶ 1MB Shared L3 Cache w/ECC

#### **Memory Controller**

➤ 32/64bit DDR3/3L w/ECC up to 1.3 GHz

#### **Ethernet**

- ►5 x 10/100/1000 Ethernet Controllers
- ▶1 x 10GE Controllers

#### **High Speed Interconnect**

- ▶4 PCle 2.0 Controllers
- ≥ 2 SRIO 1.3 + 2.0 Controllers
- ▶ 2 SATA 2.0
- ▶2 USB 2.0 w/PHY

#### **CoreNet Switch Fabric**

#### **Trusted Architecture**

#### **Datapath Acceleration Architecture**

- ► Security Engine (SEC)
- ► Pattern Matching Engine (PME)
- ► Enhanced RapidIO Messaging (Rman)





# MPC8569 PowerQUICC III Bridging the Gap to the All-IP Network



- ▶ **e500 PowerPC** from 800MHz to 1.33 GHz
  - 512KB L2 Cache w/ ECC
  - 36bit physical addressing
  - Double Precision Floating Point

#### ► System Interfaces

- 64b or 2x32-bit DDR2/3 w/ ECC
- 800 Mbps/pin data rate
- 16-bit Local Bus for SRAM/Flash
- Timers, DUART, 2xl<sup>2</sup>C, GPIO, SPI
- USB 2.0 Full Speed

#### ► High Speed Serial Interfaces

- Dual SGMII
- Dual x1 Serial RapidIO or PCI-Ex

#### **▶ QUICC Engine**

- 4 RISCs up to 667 MHz
- Maximum of 8 Ethernet interfaces, one per UCC:
  - 4 x Gigabit Eth (up to 2 w/SGMII)
  - Up to 8 x 10/100 Ethernet
- Multi-PHY UTOPIA/POS-PHY L2 (16-bit)
- IEEE1588 Support v2
- 16 x T1/E1 (512 x 64kbps channels)

#### ► Security Engine (SEC3.0)

- ARC4, 3DES, AES, RSA/ECC, RNG, XOR, Single pass SSL/TLS, Kasumi, SNOW
- ► Four-channel DMA
- ▶45nm SOI process technology
- ► Target <7W Power (@ 800MHz e500)





# **Data Path Acceleration: Quick Recap**





### **DPAA – Data Path Acceleration Architecture**

- ► Frame Manager:
- ▶ Parse header information
- Classify to what destination the packet is targeted ex. MAC, IP, UDP destination
- Distribute packets to allow load balancing between cores
- ▶ Police flows to avoid DoS attacks







## **QMan Software Portals**







# **QoS supported through Channel Scheduler**







## To CoreNet (Cores, sRIO, PCIe) **Memory Mapped Portals** Internal queue/descriptor ¥ ¥ memory To Portal **Portal ▶**To Queuing SEC4.0 **PME Engines HW Portal HW Portal** To FM To FM

## **Queue Manager (QMan)**

- QMan provides a set of building blocks
  - Frame queues (FQ) which are enqueued onto
    - ...Work queues (WQ) which are organized into
    - ...Channels with prioritized class scheduling between WQs
    - ... which can be used to build connections between blocks (cores, network I/Fs, HW accelerators)
- Frame queues are "logical" queues
  - Actual data is stored in memory buffers, QMan queues "frame descriptors"
- QMan also supports active queue management:
  - Tail drop on FQs
  - WRED on groups of FQs
  - Tail drop on groups of FQs
- Channels can be shared between consumers which facilitates load spreading
- Channels may also be dedicated to a single consumer



### **Unified DPAA Frame Formats**







# Packet I/O





# FMan/QMan Ingress Packet Processing







## **Example Location of NAPT Functionality**

### computational cluster in RNC / SGSN / GGSN / ...







# **NAPT Basic Principle**







# System Configuration: Linux® OS and Bare-metal

- ▶ Demo implementation based on separated slow- and fastpath
- Linux running on 1 core with control software on top
- 7 cores bare-metal application processing packets
- Hypervisor protects the partitions from each other







### **Core / Accelerator Partition**

## Frame Manager Ingress

- Buffer allocation in DDR
- Parser / classification / policing
- IP checksum
- UDP checksum
- Enqueue to core

#### Software

- Dequeue from Frame Manager
- Translation lookup
- **Enqueue to Frame Manager**

### Frame Manager Egress

- 9. Dequeue from core
- 10. Add UDP checksum
- 11. Add IP checksum
- 12. Buffer deallocation from DDR

Accelerator

Accelerator



# **RMan Datapath**







# RMan for QorlQ: Greater Performance and Functionality

- ► Many queues allow multiple inbound/outbound queues per core
  - Hardware queue management via QorIQ Datapath Architecture (DPAA)
- Supports all messaging-style transaction types
  - Type 11 Messaging
  - Type 10 Doorbells
  - Type 9 Data Streaming
- Enables low overhead direct core-to-core communication

Device-to-Device Transport







## RMan Enables New Zero-CPU Overhead Use Models

# Scalable Multicore System



### **Ethernet Bridging**







# **PME: Traffic Monitoring and Management**





# **Deploy Deep Packet Inspection and Policy Control**

- ▶ DPI enables companies to:
  - Understand the network traffic and pattern
  - · Gather business intelligence
  - Identify trends and adapt to those trends
- ► Policy Control enables:
  - A smarter pipe requires fine-tuned network controls
  - Control and manage growing usage
  - Fair usage to all network users







# **Policy Control: DPI with Pattern Matcher**

- Regex support plus significant extensions:
  - Patterns can be split into 256 sets each of which can contain 16 subsets
  - 32K patterns of up to 128B length
  - 9.6 Gbps raw performance
- Combined hash/NFA technology
  - No "explosion" in number of patterns due to wildcards
  - Low system memory utilization
  - Fast pattern database compiles and incremental updates
- Stateful rules operate on a per session basis
  - User-defined logic reacts to pattern matches detected by the DXE
  - Can be used to further qualify the pattern match
  - Protocol state tracking (e.g. track the "normal" transitions of SMTP)







# **SEC: Ciphering and Packet Formatting**





## P4080 RLC AM Downlink Processing







## **SEC 4.0 for P4080 rev2**







### **DECO Commands**

- Load Commands:
  - [SEQ] KEY Load cipher keys
  - [SEQ] LOAD Load register
  - [SEQ] FIFO LOAD Loads In/Out FIFO
- Store Commands:
  - [SEQ] STORE Store register
  - [SEQ] FIFO STORE Store In/Out FIFO
- ► MOVE Move data between SEC registers.
- ► MATH Perform arithmetic operation (sum, and, or, xor, bitshift, ...)
- ► OPERATION Execute cipher operation (Kasumi, Snow, AES, ...)
- JUMP Branch in the descriptor
- ► SEQ IN/OUT PTR Sets pointer for in/out sequence data.
- ▶ JD/SD HEADER First word in the descriptor.
- ► SIGNATURE Last word in trusted descriptor.





# **P4080 RLC AM Downlink Processing**

An output Frame Descriptor is sent out from the SEC that in turn points to the generated output data. Input data is either left in memory but can also be freed and the buffer returned to BMan.







# **P4080 RLC AM Downlink Processing**

The frame descriptors is enqueued into a work queue / channel back to a core for further processing.







# **Use Case: Voice Packet Processing**





#### **Challenge: Voice Packets**

- Voice Packets are ciphered/formatted in RNC (3G) and (e)NB (LTE and 3G flat architecture).
- Voice packets are large to the number but small to the size (~20 bytes).
- ► Typically each active flow has a packet per ~20ms, i.e. it is not possible to group packets of same flow or latency will increase and phone conversation will not work (lag severely).
- Hardware Acceleration is a big benefit but overhead/byte of sending to accelerator has been a big problem since packets are so small.



# NP

Key N
Context N

# Per Flow Ciphering





# NP

#### rrame pescriptor

**Context N** 

### S/G output – buffers at random location







### **Use Case: HSPA Mobile Broadband**





### **Challenge: Data Packets**

- ► Huge total bandwidth!
- ► Very high individual user peak rate.
- ► Complex protocol processing.









### Frame Type 1 with MAC and RLC PDU – R7

Gray = Depends on configuration and version

Blue = Empty

Example of special cases:

E bit can have "alternative interpretation" HE valid values differs over revisions Length field can be 7 or 15 bits





## Frame Type 2 with MAC and RLC PDU – R7



| Block N MAC-d length     |                |
|--------------------------|----------------|
| Block N length em        | Block N #MAC-d |
| Block N Log Ch id        | Padding        |
| DRT                      |                |
| DRT                      |                |
| H-RNTI                   |                |
| H-RNTI                   |                |
| RACH Measurement Results |                |
| MAC-d A 1                |                |
| MAC-d A 2                |                |



Payloac

<u>ai</u>

\_ \_ \_



#### **HSPA: AM RLC/MAC/FP Downlink**



#### **Bold frame indicates intensive work.**

Blue = Core Green = Accelerator

#### Features for mobile broadband:

- ► Cipher blocks of packets
- Auto-update sequence number
- ► Bit-shifting
- ► CRC-7 / CRC-11 / CRC-16
- ► Dual CRC (header/payload)





### RLC / MAC / FP flow on P4080: Example A

- User data assembled in memory, from GTP-U.
- When MAC flow control indicates that we can send data:

#### Core:

- Checks retransmission buffer.
- 2. Checks status transmission buffer.
- Creates RLC headers with status.
- Checks if C/T MUX is used.
- 5. Adds pointers to segments and RLC headers.
- 6. Concatenation/piggyback/pad. of last user-data packet
- Saves pointers for retransmission buffer
- Creates FP header/tail.

**SEC:** segments, adds header, bitshift and cipher. Adds FP header/tail and computes CRC-7 and CRC-16.



rrame Descriptor

# **Type-1 MAC-d processing**





# **DEMO**





