MPC860 Performance Checklist Revision 2 February 11, 1997. When using an MPC8xx family member in an application, it may be helpful to go through the following performance checklist. It lists a number of possible performance optimizations. 1. Make sure the memory controller programming is tuned to the specific DRAMs used in your application, via the UPM ram array. Also, make sure that the memory access timing is correct for the final production product bus frequency. For example, UPM code designed for a 50Mhz system may have wait states programmed that would not be necessary in a 20Mhz system and therefore cause excess DRAM access time. A freeware tool is available on our web site, at http://www.motorola.com/netcomm, called UPM860, which may help with UPM programming. The 860ADS board initialization code provides other examples for study. 2. The MMU should be enabled for many reasons. On the MPC8XX chips, the standard TLB has been replaced by a FastTLB, which determines when code is staying on the same page via the I-bit. If code stays on the same page, enabling the MMU decreases access time into the cache. Also, with MMU enabled, this leads to the next performance enhancement tip, which is: 3. Only use "Guarded" storage for FIFO's or similar types of peripherals that will have their data destroyed upon access. The MMU provides the capability for this designation of memory locations. For example, if a group of instructions enter the queue and one of them higher up has an access to a guarded location, and an interrupt occurs, the interrupt will not be serviced until that guarded location's instruction is completed, no matter how many instructions are ahead of it. On the other hand, if this location was not guarded, the interrupt will be serviced as soon as it comes in, and all instructions in the queue will be flushed and refilled into the queue after servicing the interrupt. 4. Make sure the caches are enabled. When the caches are enabled, make sure the Burst Inhibit (BI) pin is pulled high also, since the caches require burst accesses. The caches are there to reduce access time, but there are other things to also take into consideration, including the following: 5. Make sure all relevant pages are cache enabled. 6. Use the copy back policy on the Dcache for all "software only" data. "Copy back" writes to the cache immediately from the CPU and will write to external memory later, saving external bus cycles until it is more convenient to the system. "Software only" means to use it for things that only the CPU will ever need, typically user's stack, software scratch variables, data tables, and any other types of data that will never be passed to the I/O. You cannot use this mode for anything that the CPM will use, because, since the MPC8XX does not incorporate bus snooping, the CPM could get old data. And so, the following three rules will help optimize cache usage with the CPM, since the CPM cannot access the cache: 7. Use Dcache enabled and "write through" for transmit data. "Write through" writes immediately not only to the cache but to the external memory. 8. Use Dcache inhibit for receive data. A good rule of thumb is to think of this as anything that is DMA'd. 9. Use Dcache inhibit for data used only once. 10. Emphasize cache usage optimization over core optimization techniques. For example, sometimes loop unrolling (doing the same loop's instructions twice or more before going into a loop of them, to reduce loop count) will cause a higher ICACHE miss ratio because the additional instructions stretched out into a queue do not fit in the cache. If you are optimizing very tightly, this one will have to be tried and evaluated for your particular application, but overall, loop unrolling has shown to hamper optimization, especially since the MPC8xx core will do branch prediction anyway. 11. Check how many interrupts per second are activated. Interrupts affect performance in three ways: - Interrupt overhead. Does the interrupt handler save/restore only the necessary states? Meaning, how many registers are saved at the beginning of an interrupt handler and restored at the end. When writing in assembly code, the programmer can determine which registers need to be saved and restored by knowing which ones will be corrupted. If writing in C, there may not be a choice, so this may be a moot point. But a choice can be made to write some interrupt routines in assembly that require minimal register destruction (saving) and restoration. - Interrupt execution can significantly slow or possibly thrash the caches -- especialy the I-cache. Watch for where interrupt routines are placed in memory and therefore where they will become cached. The cache is laid out predominantly by the 7 least significant bits, effectively defining a 2K block size. If your code is placed at the same offset in this 2K block as an interrupt routine that will happen frequently at the same time, there is significant risk that the cache may encounter exhorbitant misses and be thrashed by the repeated cache misses, going back and forth in the code. The solution is to either relocate the code or the interrupt routine in memory. - If the interrupt's save area is the user's stack, then it should be write-through and copy-back. However, if it is in a different location, ie. supervisor's stack, then it should be analyzed relative to frequency of interrupt events in the system, similar to analyzing the interrupt code itself. Also: Note that the caches are two way set associative, so each memory location can be located in just TWO places in cache. You can analyze your code by blocks. Example: Typical code is composed of task calling C library routines and interrupted by an interrupt handler. If the library routines and interrupt handler use same cache sets, then the task will see a larger than necessary miss ratio. Locating the frequently used C library and frequently activated interrupt handler in different sets will always allows one set free for the task. The key is to avoid cache collisions as much as possible. 12. For the case of frequently activated interrupt handlers, consider locking the interrupt routine in the cache (if the same code runs in subsequent interrupts) or cache inhibiting the routine (if it is typically different code for subsequent interrupts). 13. Make sure not to work in "serialized mode" (controlled in the ICTRL register). This is part of the debug support. While it is very helpful at debugging code initially, leaving the serial mode turned on will cripple a production product's performance. 14. Make sure to avoid "Show-cycles" if not debugging. 15. Please read the applications note in the MPC860 User Manual in Appendix C.2 entitled "PowerPC Performance Impact". This appnote has additional information especially in the area of data misalignment and compiler switches not covered here. 16. When intending to use the caches, ensure that the Cache Inhibit Default (CIdef) bit in the Instruction MMU Control Register (MI_CTR) and in the Data MMU Control Registers are cleared. The MPC860UM/AD revision of the User's Manual erroneously states that the cache is disabled when this bit is cleared. The correct interpretation of the CIdef bit is: 0 = enabled (default value) 1 = disabled or inhibited. 17. Two other items to be aware of when working with the caches: - The Data Cache does not support bus error which occurs on the 2nd or 3rd data beat of a burst. Refrain from using bus error in these cases. - The last instruction executed from a certain page may get the caching inhibited attribute of the next page when the page change occurs between the time the fetch request was issued to the Instruction Cache and the time the Instruction Cache delivers the instruction to the sequencer. Since Instruction Cache inhibit is used only for performance reasons (mostly for not caching very fast memories or pages that include non real time programs), the effect of this bug is negligible.