Stefan123 wrote: ↑
Tue Jul 18, 2017 6:21 pm
Is intrinsic_outi() faster than z80_otir() when, for example, transferring 256 bytes? Is it faster even if z80_otir() would be inlined? Will using intrinsic_outi() make the code size larger?
Yes it's faster. intrinsic_outi() approaches 16 cycles per output byte whereas the best an otir instruction can do is 21. Options in the library will also allow z80_otir() to use something like intrinsic_outi() and this will likely get it to 17/18/19 cycles per out. The reason why it's a bit slower is because z80_otir() will not know ahead of time how many outs are being done so it will have looping considerations to take care of.
And, yes, it will also make the output binary larger but not too much. For now, it causes a 64 x outi block to be incorporated in the binary which amounts to 128 bytes.
I assume that intrinsic_ldi() would be a good fit for implementing a blit function on the layer 2 screen, at least for some edge cases like copying whole 256-pixel lines. What is the performance of using intrinsic_ldi() for copying data compared to memcpy()? I guess that memcpy() is also quite optimized?
intrinsic_ldi() is also faster, again approaching 16 cycles per copied byte. This is as fast as you can get without resorting to stack tricks or the dma to move bytes around. The default behaviour for memcpy is the compiler will inline an ldir instruction which is 21 cycles per byte. You will be able to turn off this inlining and instead get memcpy to jump into the ldi block where it should see 17/18/19 cycles per byte with the slowdown compared to intrinsic_di due to looping considerations.
For layer 2 software sprites, what's being considered is customizations of ldi blocks to copy pixels and skip over pixels that are transparent in the sprite. So sprites may occupy three bytes per pixel (ldi = 2 bytes + colour) instead of one (colour). For more restricted copying, there will be dma task lists that can be fed to the dma to simulate copies of 2d shapes on screen. The dma cannot skip transparent areas though and neither can it perform logical operations on source or destination data.