Caches, Write Buffer, and Read Buffer

6.3.2.2Writes to a Bufferable and Noncacheable Location (B=1,C=0)

If the write buffer is enabled and the processor performs a write to a bufferable but noncacheable location and misses in the Dcaches, the data is placed in the write buffer and the CPU continues execution. As with the cacheable case, merging is allowed only on store multiples. The write buffer performs the external write sometime later.

6.3.2.3Unbufferable Writes (B=0)

If the write buffer is disabled or the CPU performs a write to an unbufferable area, the processor is stalled until the write buffer empties and the write completes externally. This requires several external clock cycles.

6.3.3Enabling the Write Buffer

To enable the write buffer, ensure that the MMU is enabled by setting bit 0 in the control register, then enable the write buffer by setting bit 3 in the control register. The MMU and write buffer can be enabled simultaneously with a single write to the control register.

6.3.3.1Disabling the Write Buffer

To disable the write buffer, clear bit 3 in the control register. Any writes already in the write buffer will complete normally, but a drain write buffer needs to be done to force all writes out to memory.

Note: The write buffer is used for copy-backs from the Dcaches even when they are disabled.

6.4Read Buffer (RB)

The SA-1100 contains a software-programmable read buffer that can increase the performance of critical loop code by prefetching data. The RB enables the preallocation of read-only data into one of four 32-byte buffers without stalling the pipe. For subsequent loads that hit in the RB, data is sourced from the buffer instead of the Dcaches at a rate of 1 word per core clock. Also, because the programmer specifies which entry of the RB is used, critical data can be “locked” in to eliminate bus latency.

The RB is controlled using coprocessor 15, register 9, and provides the capability to allocate 1 word, a half-line (4 words), or a full line (8 words) into one of four entries of the RB. (See Chapter 5, “Coprocessors” for a detailed RB coprocessor description.) Half-line loads are automatically aligned onto half-block boundaries (the lower four address bits are ignored). Full-line loads are automatically aligned onto line boundaries (the lower five address bits are ignored). For partial cache line RB loads, only the words actually fetched are marked valid and can be sourced from the buffer. A small queue is used to ensure that subsequent RB load instructions go out in order.

When an RB allocate instruction is executed, the virtual address is looked up in the TB to check for a translation hit and possible access violations. If the access misses in the TB, the pipe is stalled until the page is fetched through the normal hardware tablewalk mechanism. If an access violation occurs, the RB load is NOP’d. For example, an RB allocate instruction can generate a data abort. Once the RB allocate has received a TB hit and no access violations, a bus access is requested that fills the appropriate buffer without stalling the core pipeline. Subsequent load instructions to this virtual address result in an RB hit and data is sourced from the appropriate entry to the core.

6-6

SA-1100 Developer’s Manual

Page 60
Image 60
Intel SA-1100 manual Read Buffer RB, Enabling the Write Buffer, Writes to a Bufferable and Noncacheable Location B=1,C=0