Intel 80200, Processor B.5.2 Scheduling Data Processing Instructions

Developer’s Manual March, 2003 B-39

Intel® 80200 Processor based on Intel® XScale™ Microarchitecture

Optimization Guide

B.5.2 Scheduling Data Processing Instructions

Most Intel® 80200 processor data processing instructions have a result latency of 1 cycle. This

means that the current instruction is able to use the result from the previous data processing

instruction. However, the result latency is 2 cycles if the current instruction needs to use the result

of the previous data processing instruction for a shift by immediate. As a result, the following code

segment would incur a 1 cycle stall for the mov instruction:

sub r6, r7, r8

add r1, r2, r3

mov r4, r1, LSL #2

The code above can be rearranged as follows to remove the 1 cycle stall:

add r1, r2, r3

sub r6, r7, r8

mov r4, r1, LSL #2

All data processing instructions incur a 2 cycle issue penalty and a 2 cycle result penalty when the

shifter operand is a shift/rotate by a register or shifter operand is RRX. Since the next instruction

would always incur a 2 cycle issue penalty, there is no way to avoid such a stall except by

re-writing the assembler instruction. Consider the following segment of code:

mov r3, #10

mul r4, r2, r3

add r5, r6, r2, LSL r3

sub r7, r8, r2

The subtract instruction would incur a 1 cycle stall due to the issue latency of the add instruction as

the shifter operand is shift by a register. The issue latency can be avoided by changing the code as

follows:

mov r3, #10

mul r4, r2, r3

add r5, r6, r2, LSL #10

sub r7, r8, r2

Contents

Main ii March, 2003 Developers Manual Intel 80200 Processor based on Intel XScale Microarchitecture Contents 3 Memory Management.............................................................................. 1 Page 8 System Management............................................................................... 1 12 Performance Monitoring..........................................................................1 Page Intel 80200 Processor based on Intel XScale Microarchitecture 14 Performance Considerations..................................................................1 A Compatibility: Intel 80200 Processor vs. SA-110................................ 1 Page Page Figures Tables Page Page Page Introduction 1.1 Intel 80200 Processor based on Intel XScale Microarchitecture High-Level Overview 1.1.1 ARM* Architecture Compliance 1.1.2 Features 1.1.2.1 Multiply/Accumulate (MAC) Figure 1-1. Intel 80200 Processor based on Intel XScale Microarchitecture Features 1.1.2.2 Memory Management 1.1.2.3 Instruction Cache 1.1.2.4 Branch Target Buffer 1.1.2.5 Data Cache 1.1.2.6 Power Management 1.1.2.7 Interrupt Controller 1.1.2.8 Bus Controller 1.1.2.9 Performance Monitoring 1.1.2.10 Debug 1.2 Terminology and Conventions 1.2.1 Number Representation 1.2.2 Terminology and Acronyms 1.3 Other Relevant Documents Programming Model 2.1 ARM* Architecture Compliance 2.2 ARM* Architecture Implementation Options 2.2.1 Big Endian versus Little Endian 2.2.2 26-Bit Code 2.2.4 ARM* DSP-Enhanced Instruction Set 2.2.5 Base Register Update 2.3 Extensions to ARM* Architecture 2.3.1 DSP Coprocessor 0 (CP0) 2.3.1.1 Multiply With Internal Accumulate Format Page Page 2.3.1.2 Internal Accumulator Access Format Page 2.3.2 New Page Attributes Page 2.3.3 Additions to CP15 Functionality 2-12 March, 2003 Developers Manual 2.3.4 Event Architecture 2.3.4.1 Exception Summary 2.3.4.2 Event Priority Table 2-11. Exception Summary Table 2-12. Event Priority 2.3.4.3 Prefetch Aborts 2.3.4.4 Data Aborts Precise Data Aborts Imprecise data aborts Multiple Data Aborts 2.3.4.5 Events from Preload Instructions 2.3.4.6 Debug Events Memory Management 3.1 Overview 3.2 Architecture Model 3.2.1 Version 4 vs. Version 5 3.2.2 Memory Attributes 3.2.2.1 Page (P) Attribute Bit 3.2.2.4 Data Cache and Write Buffer Table 3-2. Data Cache and Buffer Behavior when X = 1 3.2.2.5 Details on Data Cache and Write Buffer Behavior 3.2.2.6 Memory Operation Ordering 3.2.3 Exceptions 3.3 Interaction of the MMU, Instruction Cache, and Data Cache 3.4 Control 3.4.1 Invalidate (Flush) Operation 3.4.2 Enabling/Disabling 3.4.3 Locking Entries 3-8 March, 2003 Developers Manual Example 3-3. Locking Entries into the Data TLB 3.4.4 Round-Robin Replacement Algorithm Page Instruction Cache 4.1 Overview 4.2 Operation 4.2.1 Operation When Instruction Cache is Enabled 4.2.2 Operation When The Instruction Cache Is Disabled 4.2.3 Fetch Policy 4.2.4 Round-Robin Replacement Algorithm 4.2.5 Parity Protection 4.2.6 Instruction Fetch Latency 4.2.7 Instruction Cache Coherency 4.3 Instruction Cache Control 4.3.1 Instruction Cache State at RESET 4.3.2 Enabling/Disabling 4.3.3 Invalidating the Instruction Cache 4.3.4 Locking Instructions in the Instruction Cache 4.3.5 Unlocking Instructions in the Instruction Cache Page Branch Target Buffer 5.1 Branch Target Buffer (BTB) Operation Branch Target Buffer 5.1.1 Reset 5.1.2 Update Policy Branch Target Buffer 5.2 BTB Control 5.2.1 Disabling/Enabling 5.2.2 Invalidation Page Data Cache 6.1 Overviews 6.1.1 Data Cache Overview 6-2 March, 2003 Developers Manual Figure 6-1. Data Cache Organization 6.1.2 Mini-Data Cache Overview 6.1.3 Write Buffer and Fill Buffer Overview 6.2 Data Cache and Mini-Data Cache Operation 6.2.1 Operation When Caching is Enabled 6.2.2 Operation When Data Caching is Disabled 6.2.3 Cache Policies 6.2.3.1 Cacheability 6.2.3.2 Read Miss Policy 6.2.3.3 Write Miss Policy 6.2.3.4 Write-Back Versus Write-Through 6.2.4 Round-Robin Replacement Algorithm 6.2.5 Parity Protection 6.2.6 Atomic Accesses 6.3 Data Cache and Mini-Data Cache Control 6.3.1 Data Memory State After Reset 6.3.2 Enabling/Disabling 6.3.3 Invalidate & Clean Operations 6-10 March, 2003 Developers Manual 6.3.3.1 Global Clean and Invalidate Operation Example 6-2. Global Clean Operation Page 6.4 Re-configuring the Data Cache as Data RAM Developers Manual March, 2003 6-13 Example 6-3. Locking Data into the Data Cache 6-14 March, 2003 Developers Manual Example 6-4. Creating Data RAM ... ... ...... 6.5 Write Buffer/Fill Buffer Operation and Control Configuration 7.1 Overview Intel 80200 Processor based on Intel XScale Microarchitecture Developers Manual March, 2003 7-3 ARM Architecture Reference Manual 7-4 March, 2003 Developers Manual 7.2 CP15 Registers Developers Manual March, 2003 7-5 7.2.1 Register 0: ID and Cache Type Registers Table 7-4. ID Register Table 7-5. Cache Type Register (Sheet 1 of 2) Page Developers Manual March, 2003 7-7 7.2.2 Register 1: Control and Auxiliary Control Registers ARM Architecture Reference Manual 7-8 March, 2003 Developers Manual Developers Manual March, 2003 7-9 7.2.3 Register 2: Translation Table Base Register 7.2.4 Register 3: Domain Access Control Register Register 4 is reserved. Reading and writing this register yields unpredictable results. 7.2.5 Register 4: Reserved Table 7-8. Translation Table Base Register 7.2.6 Register 5: Fault Status Register 7.2.7 Register 6: Fault Address Register 7.2.8 Register 7: Cache Functions Page 7.2.9 Register 8: TLB Operations 7.2.10 Register 9: Cache Lock Down 7.2.11 Register 10: TLB Lock Down 7.2.12 Register 11-12: Reserved 7.2.13 Register 13: Process ID 7.2.13.1 The PID Register Affect On Addresses 7.2.14 Register 14: Breakpoint Registers 7.2.15 Register 15: Coprocessor Access Register Developers Manual March, 2003 7-19 Table 7-20. Coprocessor Access Register 0 0 7.3 CP14 Registers 7.3.1 Registers 0-3: Performance Monitoring 7.3.2 Register 4-5: Reserved 7.3.3 Registers 6-7: Clock and Power Management 7.3.4 Registers 8-15: Software Debug System Management 8.1 Clocking Page 8.2 Processor Reset 8.2.1 Reset Sequence 8.2.2 Reset Effect on Outputs 8.3 Power Management 8.3.1 Invocation 8.3.2 Signals Associated with Power Management Page Interrupts 9.1 Introduction 9.2 External Interrupts 9.3 Programmer Model Developers Manual March, 2003 9-3 9.3.1 INTCTL Write-as-Zero Reserved 9.3.2 INTSRC Developers Manual March, 2003 9-5 9.3.3 INTSTR Page External Bus 10.1 General Description Page Developers Manual March, 2003 10-3 10.2 Signal Description Table 10-1. Intel 80200 Processor based on Intel XScale Microarchitecture Bus Signals 10.2.1 Request Bus 10.2.1.1 Intel 80200 Processor Use of the Request Bus Page 10.2.2 Data Bus 10.2.3 Critical Word First 10.2.4 Configuration Pins 10.2.5 Multimaster Support Page 10.2.6 Abort 10.2.7 ECC 10.2.8 Big Endian System Configuration 10.3 Examples 10.3.1 Simple Read Word 10.3.2 Read Burst, No Critical Word First 10-16 March, 2003 Developers Manual 10.3.3 Read Burst, Critical Word First Data Return 10.3.4 Word Write 10.3.5 Two Word Coalesced Write Developers Manual March, 2003 10-19 10.3.5.1 Write Burst 10.3.6 Write Burst, Coalesced Developers Manual March, 2003 10-21 10.3.7 Pipelined Accesses 10.3.8 Locked Access 10.3.9 Aborted Access 10-24 March, 2003 Developers Manual 10.3.10 Hold Bus Controller 11.1 Introduction 11.2 E CC 11.3 Error Handling 11.3. 1 Bus Aborts 11.3.2 ECC Errors Page Developers Manual March, 2003 11-5 11.4 Programmer Model 11.4.1 BCU Control Registers Page Page Page 11.4.2 ECC Error Registers Page Performance Monitoring 12.1 Overview 12.2 Clock Counter (CCNT; CP14 - Register 1) 12.3 Performance Count Registers (PMN0 - PMN1; CP14 - Register 2 and 3, Respectively) 12.3.1 Extending Count Duration Beyond 32 Bits 12.4 Performance Monitor Control Register (PMNC) Developers Manual March, 2003 12-5 12.4.1 Managing PMNC The following are a few notes about controlling the performance monitoring mechanism: Table 12-3. Performance Monitor Control Register (CP14, register 0) (Sheet 2 of 2) 12-6 March, 2003 Developers Manual 12.5 Performance Monitoring Events 12.5.1 Instruction Cache Efficiency Mode 12.5.2 Data Cache Efficiency Mode 12.5.3 Instruction Fetch Latency Mode 12.5.4 Data/Bus Request Buffer Full Mode 12.5.5 Stall/Writeback Statistics 12.5.6 Instruction TLB Efficiency Mode 12.5.7 Data TLB Efficiency Mode 12.6 Multiple Performance Monitoring Run Statistics 12.7 Examples Software Debug 13.1 Definitions 13.2 Debug Registers 13.3 Introduction 13.3.1 Halt Mode 13.3.2 Monitor Mode Developers Manual March, 2003 13-3 13.4 Debug Control and Status Register (DCSR) GE H TF TI TDTATS TU TR SA MOE M E 13-4 March, 2003 Developers Manual 13.4.1 Global Enable Bit (GE) The Halt Mode bit configures the debug unit for either halt mode or monitor mode. 13.4.2 Halt Mode Bit (H) Table 13-1. Debug Control and Status Register (DCSR) (Sheet 2 of 2) GE H TF TI TDTATSTU TR SA MOE M E 13.4.3 Vector Trap Bits (TF,TI,TD,TA,TS,TU,TR) 13.4.4 Sticky Abort Bit (SA) 13.4.5 Method of Entry Bits (MOE) 13.4.6 Trace Buffer Mode Bit (M) 13.5 Debug Exceptions 13.5.1 Halt Mode Page 13.5.2 Monitor Mode 13.6 HW Breakpoint Resources 13.6.1 Instruction Breakpoints 13.6.2 Data Breakpoints 13.7 Software Breakpoints 13.8 Transmit/Receive Control Register (TXRXCTRL) 13.8.1 RX Register Ready Bit (RR) 13.8.2 Overflow Flag (OV) 13.8.3 Download Flag (D) 13.8.4 TX Register Ready Bit (TR) 13.8.5 Conditional Execution Using TXRXCTRL 13.9 Transmit Register (TX) 13.10 Receive Register (RX) 13.11 Debug JTAG Access 13.11.1 SELDCSR JTAG Command 13.11.2 SELDCSR JTAG Register 13.11.2.1 DBG.HLD_RST TDI TDO 13.11.2.2 DBG.BRK 13.11.2.3 DBG.DCSR 13.11.3 DBGTX JTAG Command 13.11.4 DBGTX JTAG Register 13.11.5 DBGRX JTAG Command 13.11.6 DBGRX JTAG Register 13.11.6.1 RX Write Logic 13.11.6.2 DBGRX Data Register TDI TDO 13.11.6.3 DBG.RR 13.11.6.4 DBG.V 13.11.7 Debug JTAG Data Register Reset Values 13.12 Trace Buffer 13.12.1 Trace Buffer CP Registers 13.12.1.1 Checkpoint Registers 13.12.1.2 Trace Buffer Register (TBREG) 13.13 Trace Buffer Entries 13.13.1 Message Byte 13.13.1.1 Exception Message Byte 13.13.1.2 Non-exception Message Byte 13.13.1.3 Address Bytes 13.13.2 Trace Buffer Usage Page 13.14 Downloading Code in the ICache 13.14.1 LDIC JTAG Command 13.14.2 LDIC JTAG Data Register 13.14.3 LDIC Cache Functions Page 13.14.4 Loading IC During Reset Developers Manual March, 2003 13-39 13.14.4.1 Loading IC During Cold Reset for Debug Page 13.14.4.2 Loading IC During a Warm Reset for Debug Page 13.14.5 Dynamically Loading IC After Reset Debugger Actions Debug Handler Actions Page 13.14.5.1 Dynamic Code Download Synchronization 13.14.6 Mini Instruction Cache Overview 13.15 Halt Mode Software Protocol 13.15.1 Starting a Debug Session 13.15.1.1 Setting up Override Vector Tables 13.15.1.2 Placing the Handler in Memory 13.15.2 Implementing a Debug Handler 13.15.2.1 Debug Handler Entry 13.15.2.2 Debug Handler Restrictions 13.15.2.3 Dynamic Debug Handler Page 13.15.2.4 High-Speed Download 13.15.3 Ending a Debug Session 13.16 Software Debug Notes/Errata Performance Considerations 14.1 Interrupt Latency 14.2 Branch Prediction 14.3 Addressing Modes 14.4 Instruction Latencies 14.4.1 Performance Terms 14.4.2 Branch Instruction Timings 14.4.3 Data Processing Instruction Timings Developers Manual March, 2003 14-5 Table 14-5. Branch Instruction Timings (Those not predicted by the BTB) Table 14-6. Data Processing Instruction Timings 14-6 March, 2003 Developers Manual 14.4.4 Multiply Instruction Timings Table 14-7. Multiply Instruction Timings (Sheet 1 of 2) Developers Manual March, 2003 14-7 Table 14-8. Multiply Implicit Accumulate Instruction Timings Table 14-9. Implicit Accumulator Access Instruction Timings Table 14-7. Multiply Instruction Timings (Sheet 2 of 2) 14-8 March, 2003 Developers Manual 14.4.5 Saturated Arithmetic Instructions 14.4.6 Status Register Access Instructions 14.4.7 Load/Store Instructions Table 14-10. Saturated Data Processing Instruction Timings Table 14-11. Status Register Access Instruction Timings Table 14-12. Load and Store Instruction Timings 14.4.8 Semaphore Instructions 14.4.9 Coprocessor Instructions 14.4.10 Miscellaneous Instruction Timing 14.4.11 Thumb* Instructions Page Compatibility: Intel 80200 Processor vs. SA-110 A A.1 Introduction A.2 Summary A-2 March, 2003 Developers Manual A.3 Architecture Deviations A.3.1 Read Buffer A.3.2 26-bit Mode A.3.3 Cacheable (C) and Bufferable (B) Encoding A.3.4 Write Buffer Behavior A.3.5 External Aborts A.3.6 Performance Differences A.3.7 System Control Coprocessor A.3.8 New Instructions and Instruction Formats Page Optimization Guide B B.1 Introduction B.1.1 About This Guide B.2 Intel 80200 Processor Pipeline B.2.1 General Pipeline Characteristics B.2.1.1. Number of Pipeline Stages Developers Manual March, 2003 B-3 B.2.1.2. Intel 80200 Processor Pipeline Organization Table B -1 gives a brief description of each pipe-stage. Figure B-1. Intel 80200 Processor RISC Superpipeline F1 F2 ID RF X1 X2 XWB M1 M2 Mx D1 D2 B.2.1.3. Out Of Order Completion B.2.1.4. Register Scoreboarding B.2.1.5. Use of Bypassing B.2.2 Instruction Flow Through the Pipeline B.2.2.1. ARM* V5 Instruction Execution B.2.2.2. Pipeline Stalls B.2.3 Main Execution Pipeline Page B.2.4 Memory Pipeline B.2.4.1. D1 and D2 Pipestage B.2.5 Multiply/Multiply Accumulate (MAC) Pipeline B.2.5.1. Behavioral Description B.3 Basic Optimizations B.3.1 Conditional Instructions B.3.1.1. Optimizing Condition Checks B.3.1.2. Optimizing Branches Page B.3.1.3. Optimizing Complex Expressions Developers Manual March, 2003 B-13 B.3.2 Bit Field Manipulation B.3.3 Optimizing the Use of Immediate Values B.3.4 Optimizing Integer Multiply and Divide B-16 March, 2003 Developers Manual B.3.5 Effective Use of Addressing Modes B.4 Cache and Prefetch Optimizations B.4.1 Instruction Cache B.4.1.1. Cache Miss Cost B.4.1.2. Round Robin Replacement Cache Policy B.4.1.3. Code Placement to Reduce Cache Misses Page B.4.2 Data and Mini Cache B.4.2.3. Read Allocate and Read-write Allocate Memory Regions B.4.2.4. Creating On-chip RAM B.4.2.5. Mini-data Cache B.4.2.6. Data Alignment B.4.2.7. Literal Pools B.4.3 Cache Considerations B.4.3.1. Cache Conflicts, Pollution and Pressure B.4.3.2. Memory Page Thrashing B.4.4 Prefetch Considerations B.4.4.1. Prefetch Distances in the Intel 80200 Processor Page B.4.4.2. Prefetch Loop Scheduling B.4.4.3. Prefetch Loop Limitations B.4.4.4. Compute vs. Data Bus Bound B.4.4.5. Low Number of Iterations B.4.4.6. Bandwidth Limitations B.4.4.7. Cache Memory Considerations Page B.4.4.8. Cache Blocking B.4.4.9. Prefetch Unrolling B.4.4.10. Pointer Prefetch B.4.4.11. Loop Interchange B.4.4.12. Loop Fusion B.4.4.13. Prefetch to Reduce Register Pressure B.5 Instruction Scheduling B.5.1 Scheduling Loads Page B.5.1.1. Scheduling Load and Store Double (LDRD/STRD) B.5.1.2. Scheduling Load and Store Multiple (LDM/STM) B.5.2 Scheduling Data Processing Instructions B.5.3 Scheduling Multiply Instructions B.5.4 Scheduling SWP and SWPB Instructions B.5.5 Scheduling the MRA and MAR Instructions (MRRC/MCRR) B.5.6 Scheduling the MIA and MIAPH Instructions B.5.7 Scheduling MRS and MSR Instructions B.5.8 Scheduling CP15 Coprocessor Instructions B.6 Optimizing C Libraries B.7 Optimizations for Size B.7.1 Space/Performance Trade Off B.7.1.1. Multiple Word Load and Store B.7.1.2. Use of Conditional Instructions Page Test Features C C.1 Introduction C.2 JTAG - IEEE1149.1 C.2.1 Boundary Scan Architecture Developers Manual March, 2003 C-3 C.2.2 TAP Pins C.2.3 Instruction Register (IR) C.2.3.1. Boundary-Scan Instruction Set Developers Manual March, 2003 C-5 Table C-3. IEEE Instructions C.2.4 TAP Test Data Registers C.2.4.1. Device Identification Register C.2.4.2. Bypass Register C.2.4.3. Boundary-Scan Register Developers Manual March, 2003 C-7 C.2.5 TAP Controller C.2.5.1. Test Logic Reset State C.2.5.2. Run-Test/Idle State C.2.5.3. Select-DR-Scan State C.2.5.4. Capture-DR State C.2.5.5. Shift-DR State C.2.5.6. Exit1-DR State C.2.5.7. Pause-DR State C.2.5.8. Exit2-DR State C.2.5.9. Update-DR State C.2.5.10. Select-IR Scan State C.2.5.11. Capture-IR State C.2.5.12. Shift-IR State C.2.5.13. Exit1-IR State C.2.5.14. Pause-IR State C.2.5.15. Exit2-IR State C.2.5.16. Update-IR State C.2.5.17. Boundary-Scan Example Developers Manual March, 2003 C-13 Figure C-3. JTAG Example n nnnnn n abcd Figure C-4. Timing Diagram Illustrating the Loading of Instruction Register idcode Developers Manual March, 2003 C-15 Figure C-5. Timing Diagram Illustrating the Loading of Data Register