NXP Semiconductors logo
Select site:
English
Ultra-high performance SIMD processor for digital communications

Reconfigurable, massively parallel processor array leads the way to SDR


Philips' SIMD (Single Instruction Multiple Data) processor offers manufacturers an easy, low-cost, future-proof path for several applications within digital communications. Available today, it meets current and future market needs for powerful digital processing in everything from Base Transceiver Stations (BTSs) to broadband wireless and optical networking. The device's massively parallel, reconfigurable processor array is 100% software programmable. This offers unique flexibility, scalability and upgradeability as wireless standards and signal processing techniques continue to evolve. And its software reconfigurability is a vital element for building Software Defined Radio (SDR) solutions.

Overview

Most industry forecasts predict explosive growth in the communications industry in the coming years. This will result in a low-cost global wireless and wired broadband communications infrastructure enabling limitless business opportunities and services. This wireless and networking infrastructure will deliver the required bandwidth in a variety of media and protocols, serving broad and specialized markets. But to realize this, low-cost communications processing solutions are required. Philips' SIMD processing technology provides the solution - it replaces high-end FPGAs, DSPs and ASIC/SoC solutions for digital processing by taking applications to the software domain. As a cost reducer, this technology not only meets the demands of today's communications infrastructure but also addresses tomorrow's expectations of operators and consumers of communication services.


Philips believes that fully-software programmable silicon architectures are the key enabling technologies in developing communications processors to support the development and deployment of communication infrastructure. These processors will enable the infrastructure to support very high user densities, new and enhanced services at a cost advantage that will make high bandwidth communications a commodity. Philips' SIMD processor is such a device and is built on a proven parallel computer architecture, developed from years of R&D. It provides the high performance, scalability, fully-software programmability and the flexibility required in current and next generation communication applications. Based on Aspex Semiconductor's industry-leading Linedancer™ architecture, the processor offers:

  • implementation simplicity
  • unmatched scalability
  • very high performance (the SIMD model of computation)
  • associative processing - a unique and natural manner to manipulate data in a variety of abstract ways.

This unique feature combination enables it to execute all the mathematical functions associated with standards implementation and provide the flexibility required for customizing communications services. The current Philips processor delivers higher peak performance and higher sustained application performance than the very latest available Digital Signal Processors (DSPs) and microprocessors. It also comes with programmable support for the wireless and wired broadband communications markets. In addition, several SIMD processors can be combined (or embedded) in various parallel processing models to create a MIMD (Multiple Instruction Multiple Data) processor.


The architecture and high level of integration provide the tremendous power and flexibility that simplify the hardware design process and lower the overall cost of new communications equipment. Moreover, this processor is a key element in building Software Defined Radio (SDR) solutions, where powerful processing is combined with software-reconfigurable hardware to build multimode, multiband and multifunctional wireless devices. Philips' worldwide support for the SIMD processor includes standard and application-specific software libraries, as well as development tools, giving customers a huge time-to-market and validation advantage over competing solutions.


Modular Massively Parallel Computing (MPC)
Advanced communications applications process a variety (size and structure) of data and impose a huge requirement for scalable systems. Scalability defines the ability to independently vary the number and size of resources (memory, processing, and I/O bandwidth) used to support required functionality. For example, some applications may require large memory structures and low external I/O bandwidth, or vice versa. Scalable high performance is an intrinsic characteristic of communications processing applications. Philips uses the Modular-MPC architecture to address scalability. This approach (see Fig.1) comprises a number of identical processing channels. Each of these supports its own external I/O, which can be implemented to support any standardized or customized external interface.


Processing channels
The modular approach addresses scalable I/O bandwidth requirements. If a single channel interface can cope with the external data bandwidth required by the communications processing application, then there is no requirement for additional processing channels. However, if a single interface is not adequate, then an appropriate number of processing channels can be included in the system to help balance the data bandwidth by evenly distributing the data stream among the channels.

Fig.1 Modular-MPC architecture

Each channel comprises storage and processing power in the forms of storage and ASP modules. The ASP modules (see next section) support high-performance parallel processing. The storage modules are used to store communications data for processing or results of communications-related processing. Each is implemented as a collection of memory modules, which can be scaled according to the application and system requirements, independent of the size of the ASP module. The storage modules can be accessed from the ASP in two different modes:

  • Distributed Data Mode - the ASP module only has access to the storage module in the same processing channel
  • Shared Data Mode - allows ASP modules to access the storage modules of other processing channels. This can be accomplished with the help of a routing network (not shown in Fig.1). In this mode the collection of storage modules is seen as a single (shared) memory.

The storage does not need to be “on-chip” with the ASP module to achieve the high performance required for communications applications. In operation, data is transferred and stored in the module of each channel via the external I/O interface. Each ASP module accesses data in a storage module, in either distributed or shared mode, for the purpose of processing. The data transferred to the ASP modules from the storage modules are buffered locally within the ASP modules, allowing for overlapping of data transfers and high-performance processing in the ASP modules. In addition, while processing data, continuous data I/O can be performed through the external I/O interfaces.


The highly modular and scalable concept of Modular-MPC enables the development of cost-effective, communications processing systems by satisfying application requirements and exploiting the advantages and constraints of board and system-on-silicon designs. For example, the simplicity of the design allows board-level implementations to migrate into system-on-silicon designs as the level of microelectronics integration increases.

Associative String Processor (ASP)

ASProCore
An ASP module comprises an ASProCore (Associative String Processor Core) - a programmable, homogeneous and fault-tolerant SIMD parallel processor incorporating a string of identical processing units, software-programmable Intercommunication Network, and Vector Data Buffer for fully overlapped data I/O (see Fig.2). The number of processing units incorporated in a device may vary. At the logical level, the ASProCore constitutes a high-performance cellular string associative processor. At the physical level, the ASP is implemented as a bit-serial, word-parallel associative parallel processor; i.e. all processing units simultaneously perform the same arithmetic, logical or relational operation in a bit-serial manner. These units in the ASP architecture are called Associative Processing Elements (APEs).

Fig.2 ASProCore architecture

APEs
Each processing unit or APE incorporates a Data Register and bit-serial ALU (see Fig.2). The size of the Data Register can vary from implementation to implementation (see “Philips' SIMD processor”). The Data Register, in addition to storing data for arithmetic operations involving the local ALU, also supports associative processing operations; i.e., direct support for logical and relational operations. The Data Register is dynamically (under program control) configured to fields that store processing operands. The partitioning is arbitrary and not necessarily at byte boundaries. To use an analogy from traditional processor architectures, the Data Register is seen as a pool of registers of varying length (to suit the precision requirements of the computed applications) that are used to store operands that can be processed in the local APE.


Intercommunication network
The APEs are connected via the Intercommunication Network. This is a flexible network that supports data transfers and navigation of data structures. It can be dynamically reconfigured, in a programmable and user-transparent way that provides a cost-effective emulation of common network topologies. The Intercommunication Network implements a simple, scalable, fault-tolerant and fully-software programmable, tightly-coupled processing element interconnection strategy, supporting two modes of interprocessing communication:

  • Irregular - Bi-directional, single-bit communication to connect processing units sources and corresponding processing units destinations of high-speed activation signals, implementing a fully-connected, dynamically-configured (programmer-transparently) permutation and broadcast network for processing element selection and inter-processing element routing functions
  • Regular - Bi-directional, multi-bit communication via a high-speed, bit-serial shift register for data/message transfers between processing unit groups

Thus, the interconnection strategy for ASProCore supports a high degree of parallelism for local communication and progressively lower degrees of parallelism for longer distance communication. The topology of the network is derived from a shift register and chordal ring. The latter enables the network to be implemented as a hierarchy of processing groups. Thus, communication times are significantly reduced through automatic bypassing of those processing element groups, which do not include destination APEs. In a similar way, through bypassing of faulty groups of APEs, fault tolerance of the ASProCore architecture is guaranteed.


Vector data buffer
While being served with control and sequential data via the Instruction and Data Interface, the ASProCore supports parallel data I/O via the Vector Data Buffer. Data is loaded, overlapped with SIMD parallel processing, word-sequentially, bit-parallel into the Vector Data Buffer. Data is then exchanged with the data stored in the Data Register of the local processing unit in a word-parallel, bit-sequentially manner. Due to the massive bandwidth of this transfer, the exchange time during which high-performance parallel processing has to be stopped is reduced.


Data register
For data-parallel operations, data is distributed over processing units and stored in the local Data Register. Successive computational tasks are performed on stored data and the results are dumped. The ASProCore supports a form of APEs, in which a sub-set of active APEs (those which associatively match broadcast scalar information) support scalar-vector (between a scalar and Data Registers) and vector-vector (within Data Registers) operations. Matching APEs are either directly activated or source inter-processing element communications to indirectly activate other APEs. The control interface provides feedback on whether none or some processing units match. The instruction set for the ASP is based on four basic operations: match, add, read, and write. Combining these operations can perform more complicated functionality.

Generic processor architecture

The simple yet powerful and flexible architecture of ASProCore leads to an equally simple generic processor architecture, UI-ASP (Ultra-Integrated ASP - see Fig.3). This architecture relies on ASProCore to deliver the performance required for communications applications. It comprises the ASProCore unit, RISC controller, data memory, and set of interfaces to off-chip peripherals. The generic nature of UI-ASP allows the rapid and easy implementation of general purpose and Application Specific Standard Processors (ASSPs) for communications processing applications.


RISC controller
The RISC controller is a traditional microprocessor design, augmented with units capable of delivering SIMD instructions to the ASProCore at high rates. This task is also helped by the existence of library memory where sequences of often-used SIMD instructions are stored. Instructions from the library memory can be issued with minimum intervention from the RISC controller.


ASProCore unit
The ASProCore is attached to the RISC controller as a loosely coupled coprocessor. It executes a set of instructions, as an extension to the microprocessor's basic instruction set, that implement the scalar-vector and vector-vector functionality supported by the ASProCore. To enhance performance, the RISC operations and ASProCore do not operate in lockstep. The vector unit includes instruction buffers that allow the scalar core to run ahead. Both the ASProCore and RISC are synchronized with explicit synchronization instructions.


UI-ASP processor
The programming style of the UI-ASP architecture and asynchronous operation of the RISC controller allow for data I/O and program execution to overlap, as discussed earlier for the Modular-MPC and ASProCore architectures. A UI-ASP processor implements part of the storage unit and ASP module of a Modular-MPC processing channel. If additional scalability is required for higher performance, then more than one UI-ASP processor can be cascaded to linearly increase the application performance. If additional storage is required, the external memory processors are used to store data that are processed by a UI-ASP processor. Tuning of this basic architecture is possible by adding co-processors.

Fig.3 UI-ASP architecture principle

Designing UI-ASP communications applications in software

Key features of the UI-ASP architecture are its performance and high-level programming capability. Associative SIMD techniques are used to maximize parallelism for communications applications and minimize overhead and implementation complexity, which in turn maximize the performance capabilities and cost-efficiency of the architecture.


ASProCore code intrinsics
The UI-ASP architecture was tuned to support application development in high-level programming languages. The UI-ASP programming tools allow for embedding into the C source code ASProCore intrinsics. ASProCore intrinsics represent an abstract view of the ASProCore architecture and do not reflect the details appearing in various implementations. However, these abstractions do not compromise the performance of the compiled code, since the programming tools optimize the code according to the implementation constraints.


The efficient optimizers in the programming tools schedule code to match the intricacies of the parallel ASProCore architecture without requiring the programmer to understand the fine details of the hardware architecture. Furthermore, the ASProCore intrinsics consider the scalability of the architecture (any C code with ASProCore intrinsics will be executed on any UI-ASP implementation) independent of the number of APEs in a UI-ASP or the number of the UI-ASP in a system. The noticeable difference is that the fewer the APEs in a system, the lower the application performance achieved.


Compilers for the majority of high-performance DSPs provide a mechanism called asm statements that allow direct embedding of assembly language instructions in a C program and in-line expansion to insert low-level code. However, asm statements are not an efficient approach for optimization due to lack of portability from generation-to-generation. The asm statements require the programmer to manually allocate registers and schedule instructions that are specific to a device implementation. The UI-ASP programming tools accept the ASProCore processing intrinsics just like any other C operators (e.g. addition) and the compiler generates code according to the target UI-ASP device. As a result, coded programs can be recompiled for new generations of UI-ASP processors and exploit additional features of the device without the need of rewriting. All the advanced features of code optimization are also performed on the ASProCore processing intrinsics just like all the usual C operators. Therefore, the communications intrinsics can be considered an integrated extension of the C language for the purpose of UI-ASP programming tools.


Even on the simplest of processor architectures, high-performance applications need carefully designed codes that make best use of the available software components. ASProCore implementations of the ASP architecture, eases the task of developing new algorithms and applications, or even porting them from other high-performance architectures that cannot provide the scalability and flexibility of UI-ASP.

Fig.4 Software environment

High-level language software programming
In the design phase, developers can use C to design the code, allowing them to work in a more abstract and powerful language than assembly language. Developers can focus on product functions and algorithms instead of tuning algorithms to the hardware platform. The only guideline in developing the code is to use scalar-vector and vector-vector operations for the data intensive parts of the application. Next, the programmer will need to use the ASProCore intrinsics and existing libraries to express the data intensive operations. Since this approach leads to creating shorter application code than assembly for the same capability, less work is done, reducing application development cycle time and effort. And not just code, but requirements and designs can be reused on other designs like SoC, eliminating work and reducing product development time. Code developed is reusable with future generations of ASProCore architecture variants; i.e. algorithm and application code are verified before samples of the next generation UI-ASP processor implementation are available.

ASProCore technology advantage

The advantages of the ASP from other high-performance processing architectures are the direct results of its support of associative processing and its SIMD architecture principle. They are:


Architectural

  • In-situ processing of data in the APEs, for all non-arithmetic operations - reduces the effect of the sequential processing bottleneck (fetch-process-store, in communications processing applications where locality for data access is limited, effective overlapping is not always possible)
  • Processor address elimination - the selection of the APEs is based on their contents rather than their addresses within the parallel structure
  • Unlimited scalability - the result of elimination of APE addressing
  • Fault-tolerance - data can be allocated to any of the working APEs and ignore faulty ones
  • Application flexibility - all data structures to be processed are mapped onto a string, which through associative processing constitutes a fundamental and naturally parallel computation form.

Implementation

  • Low-power consumption - only active APEs dissipate power
  • Ease of integration in systems - the elimination of APE address simplifies the bus requirements and enables future proofing (ASProCore implementations can be used in successive generations of products since the interface to other parts of the system remains identical)
  • Ideal for microelectronics - the elimination of APE address allows ASProCore implementations to exploit large-chip areas, finer geometries, and very dense packing densities to increase the processing power in a device better than any other computational architecture. This achieves a performance leap of at least 8x (see Fig.5) for each new implementation, gaining on architecture mixed with higher process technology density.
  • High-packing density - the pin-out of the ASProCore implementations is independent of the number of APEs integrated in a single device.
Fig.5 Linearly scalable performance with architecture & technology evolution

ASProCore implementations are more technology independent than conventional processors. In short, the architectural advantages offer high performance, processing flexibility and true scalability, whereas the implementation advantages offer very efficient technology implementation and future proofing.

Why it works

An example of next generation communications processing environment that migrates from an integrated fixed function set of processors into software implementation is illustrated in Fig.6.

Fig.6 Communications processing environment

The communications processing environment depicts an information gateway capable of supporting a variety of interfaces, a control processor to handle user interface and system management functions, and a high-performance area that supports Standards functions (e.g. real-time communication protocols) and Customized functions. The latter are user-specified functions associated with intelligent analysis and interpretation of communications information that is received or produced.


Traditional solutions
The majority of high-performance architectures for communications-related processing were introduced with the common belief that the “speed-up” of certain operations was all that was needed. The “speed-up” of processing associated with the Standards functions, was the main objective. However, this belief may be dangerously misleading. Processing of image parts, for example, may lead to combinatorial explosions of derived parts and of relations among collections of parts. The computational bottleneck in communications-related processing is not entirely, or even primarily, at the early processing stages where information is communicated. Treating the customized communications-processing problem as a fixed sequence of stages or that requires low-performance processing is in itself a simplification. In reality, the processing for the Standards and Customized stages are very closely integrated, with the results at a given stage providing information to modify the techniques used in previous or subsequent stages.


The traditional approach to solving the requirements for Standards and Customized functions is to specify configurations that include a general process microprocessor as well as a bank of high-performance DSPs dedicated to signal processing. This calls for two hardware types requiring two separate software code bases. Upgrading such systems, require new hardware with at least one major software change.


Associative SIMD approach
Philips' associative SIMD approach is the easiest, simplest and most cost-effective option to building and developing the next generation of fully-software programmable communications processors. SIMD is the most natural and simple way to build a fully-software programmable processor with parallel operations for data intensive applications. ASProCore's architecture simplifies the complexities of application development by using simple-vector processing models (scalar-vector and vector-vector operations) without the need to support code analysis and scheduling optimization in hardware, the result, much simpler hardware for any degree of performance. Superscalar, VLIW, and advanced RISC processor architectures bear the cost of control complexity in die area, power and cost performance.


ASProCore's support for high-performance number crunching and abstract data processing is the most silicon efficient approach to building next generation processor architectures that combine general purpose computing and DSP processing requirements. New wireless and broadband applications over the next five years will place heavy pressure on processor designers to raise achieved performance levels by 5x, or more, per year without increasing cost (parallel design and high-level software tools will be critical in meeting this challenge).


Likewise, new processors must easily be able to grow in functionality and performance. Scalable, by definition, means simple implementations of both hardware and software to migrate applications in terms of schedule, cost, and complexity towards future, more powerful implementations. A design is not scalable if the code must be significantly rewritten to support future generations of new products. Assembly level programming techniques are simply not scalable when compared with the implementation done in the framework of a high-level language such as C.

Energy efficiency

Low energy consumption is important for embedded communications processing, particularly applications where power consumption is a concern. ASProCore's associative SIMD architecture has several inherent features that lead to lower energy requirements compared to other high-performance architectures such as superscalar, VLIW, and enhanced RISC architectures.


Instruction fetch, decode, and dispatch is performed once for a vector of operations, significantly reducing control logic energy requirements. Vector instructions provide the hardware with explicit dependence information about element operations, hence no complex speculation, prediction, or re-order structure is needed to discover parallelism.
By accessing main memory directly, no energy is wasted in caching data that has only spatial locality, as is common in data-streaming communications applications. Vector instructions also access memory and vector register banks in a regular pattern, which avoids energy consumption in bank arbitration circuitry and enables power optimizations such as selective bank activation.


ASProCore implementations support high-parallel execution with low overhead, allowing voltage scaling to be employed, reducing energy per operation. Clock frequency can be kept low and still deliver higher performances than implementations of other architectures that are operating with clock frequencies more than 3x higher. Lower clock frequency allows a considerable decrease in the power supply voltage.

Design scalability

Scalability is vital for communications processor architectures as transistor budgets, the speed of integrated circuits and requirements for performance improvements increase according to Moore's law. There are two equally important sides to scaling architecture: scaling performance and scaling design complexity. Vector architectures have several advantages for both:

  • Scaling performance
    Conventional architectures scale performance by increasing the operation frequency and number of arithmetic units in the design. The simplicity of the SIMD hardware (simple issue logic, increased locality within the structure) makes it easy to scale clock frequency. But for and an energy-conscious design, adding more APEs allows the execution of more operations in parallel. In Modular-MPC implementations, adding more processing channels can readily scale up the overall system performance transparent to compiled code. The balance between storage and processing power is maintained when scaling the number of processing channels, as each channel contains both processing data paths and memory ports
  • Scaling design
    ASProCore architectures can be scaled into future fabrication technologies, which will be much more sensitive to interconnect delays. The majority of communication in ASProCore implementations is held local to each processing unit; hence cycle time should not be affected when scaling up the number of such units. The rapidly increasing complexity of superscalar and VLIW processor design is a disadvantage when considering using future high-density fabrication processes. In contrast, ASProCore has a modular design both in the APE and memory interface that can be readily scaled up. Control logic complexity remains constant with system size, whereas the design complexity of superscalar and VLIW designs is a superlinear function of issue width.

Philips' SIMD processor

The current processor (see Fig.7) is the first off-the-shelf processor device from Philips that integrates the ASProCore, control, storage, DMA and memory interface logic into a single device with two PCI interfaces. Fabricated in CMOS12 technology, it is an implementation of the UI-ASP architecture. Using the unique level of parallelism that ASProCore implementations can achieve, the device brings the highest level of performance for processing data in this era of data convergence. It samples at more than 300 MHz clock rate, integrates 4096 APEs, and delivers high performance in terms of GOPS and GMACS. (see table in this section).

Fig.7 Philips' SIMD processor block diagram

The processor's major functional units include the ASProCore, RISC controller (an implementation of the SPARC architecture), 128 Kbytes of Data Memory, and I/O interfaces. The ASProCore is designed to operate at 266 MHz with program instructions issued by the RISC controller (the controller block includes a 128 Kbytes program memory). The device integrates a 64-bit synchronous main memory interface operating at 133 MHz (double-data rate) with two 64-bit, 66 MHz PCI interfaces.


Data transfer
The Data Transfer block provides the application programmer the support required to develop software that overlaps application processing with data transfers. In inner loops operating over large sets of data, the programmer can program it to move strips, blocks or patches of data on and off the chip. The Data Transfer block includes logic that accepts instructions from the RISC Controller and executes these in a separate thread of control with minimal CPU support to stream data to and from the chip using external SDRAM memory. The design of the ASProCore block, as previously explained, allows full overlap of the data movement and SIMD processing.


Load and store operations
The conventional microprocessor and DSP approach to process multi-dimensional communications data stored in memory is to access the data using load/store operations. Such load and store operations systematically miss in the data cache and generate very low processor efficiency. First, to reduce the penalty of cache misses; data can be pre-fetched via pre-fetch operations prior to their use. However, pre-fetching instructions themselves consume execution cycles that could have been used for actual computation. Second, locality of the pre-fetched data is usually low since the original data is multi-dimensional. The convention of loading a cache line at a time loads significant unused data, causing pollution of the data cache and wasting external memory bandwidth.


A key insight in many of these data intensive computations is that address patterns are very predictable by the programmer. The Data Transfer in Philips' processor can execute these address patterns directly and transfer exactly the desired two-dimensional block of image data, arranging the data into a pattern that the ASProCore can access efficiently. This high degree of efficiency allows systems employing the processor to rely on inexpensive external memory, while at the same time executing high-performance complex applications routines in real time.

Philips' current SIMD processor - main features
Clock
> 300 MHz
Data Transfer - Synchronous Memory Interface
Up to 2,100 Mbytes/sec
Program Transfer - Synchronous Memory Interface
Up to 2,100 Mbytes/sec
PCI Interface
8 bytes @ 66 MHz (each)
Performance
153.6 GOPS (8-bit add)
25.0 GOPS (8-bit multiply)
7.6 GOPS (16-bit multiply)
6.1 GMACS (16-bit)


Application domains for Philips' SIMD processor
Domain
Areas
Functions
Video
· Broadcast video
· Video post production
· Broadband internet
· Cable and satellite
· HDTV and Interactive DTV
· Format conversion
· Encoding
· Compression (MPEG,
and video streaming)
· Noise reduction
· Image enhancement
Data processing
· Databases
· Search engines
Parallel processing capabilities
and associative processing
are fundamentally well suited
to this environment
Image processing
· Scientific: Industrial and Medical
· Security: Fingerprint and iris recognition
· Avionics and Military
· Image enhancement
· Image analysis
· Pattern recognition
· Compression
Neural networks
Many areas (mainly research):
· Artificial Intelligence
· Scientific
· Military
· Robotics
· Avionics
· Military
Many kind of functions using:
· Parallel processing
· Associativity
· Cylindric CAM
Infrastructure communication
· 3G base station
· Rake receiver
· Path tracking

About Royal Philips Electronics

Royal Philips Electronics of the Netherlands is one of the world's biggest electronics companies and Europe's largest, with sales of EUR 29 billion in 2003. It is a global leader in color television sets, lighting, electric shavers, medical diagnostic imaging and patient monitoring, and one-chip TV products. Its 164,500 employees in more than 60 countries are active in the areas of lighting, consumer electronics, domestic appliances, semiconductors, and medical systems. Philips is quoted on the NYSE (symbol: PHG), Frankfurt, Amsterdam and other stock exchanges. News from Philips is located at: www.nxp.com.