HW Sprites and how to use them

Discuss ZX Spectrum Next Games, Tools and more.
Post Reply
User avatar
Maziac
Posts: 55
Joined: Sun Jul 09, 2017 5:56 am

HW Sprites and how to use them

Post by Maziac » Thu Jan 24, 2019 7:28 pm

Hi,

I'm currently looking into the problem to efficiently use the HW sprites from SW i.e. from Assembler.

But I ran into a few problems and I would like to hear your ideas or solutions.

The problem is simply put: How to deal with the sprite's position efficiently.

When working with sprites the usual loop you need to do is:

1. Moving sprites
2. Drawing sprites
3. Looking for collision
4. Remove sprites (e.g. because of collision)
5. Create new sprites (e.g. explosion or a shot has been created or what else)
6. Goto 1

For most of these operations the position of the sprite is required.
In the ZX Next the x and y position is 9 bit.
I.e. more than 1 byte which requires double register arithmetic.
Furthermore the 8th bit is shared in a byte with other bits which makes it difficult to retrieve and set.
Then the data need to be kept in a kind of shadow memory as the ZX Next Sprite HW registers are not readable.

So all in all this sums up to an algorithm like the following:

1. Move sprites in normal RAM (shadow memory)
2. Draw sprites: Copy this shadow memory to ZX Next HW
3. Go through all x/y positions in the shadow memory to look for collisions.
4. Remove sprites: Disable them in the shadow memory (on next copy to ZXNext HW the HW sprites are disabled)
5. Create new sprites: Enable them in the shadow memory and set their position (on next copy to ZXNext HW the HW sprites are disabled)
6. Goto 1

The shadow memory should use the the same bit packing as the ZX Next HW in order to move the complete block all at once from the memory to the ZX Next HW (I think this can be done also with DMA, but for DMA it is even more important that structures in memory are exactly the same as in HW).

That means I still need to implement an efficient way to retrieve the X/Y position.

Let's take the X-position as an example:
The lower 8 bits are at offset 0. The bit 8 is the first bit in the byte at offset 2.
The other bits are not 0 but contain other information that should not be changed and need to be bit-masked.

This would result in assembler in something like this:

Code: Select all

ld hl,shadow_memory + sprite_index*5    ; Load the address of some sprite 

; Load X 
ld e,(hl)   ; T7. Load low byte of X 
inc hl      ; T6.
inc hl      ; T6.
ld d,(hl)   ; T7. Load the 9th bit. de now contains X 
; T=26

; Add some delta 
ex de,hl    ; T4. hl contains X 
ld bc,x_add ; T10. The value to add to X 
add hl,bc   ; T11. x += x_add
; T=25

; Store new value
ex de,hl    ; T4. de = new X, hl = pointer to 9th bit 
ld a,(hl)   ; T7. Load the current flags 
rra        ; T4. Discard 9th bit 
rr d       ; T8. Move bit 9th bit to Carry
rla        ; T4. Set 9th bit from Carry 
ld (hl),a   ; T7. Store 9th bit without changing the other flags 
; Store low byte 
dec hl      ; T4
dec hl      ; T4 
ld (hl),e   ; T7 
; T=49

; Total T=100

To compare this with an ideal case where I just take a 2 byte value add another 2 byte value to it and save it, this is very much. Here is the ideal case:

Code: Select all

ld hl,(x)  ; T20. The X-position
ld de,x_add     ; T10. The value to add/subtract to X-position 
add hl,de       ; T11. x += x_add
ld (x),hl       ; T20. Store new value

; In total T=61
This is just 61% of the cycles required for the example above. Let alone the complexity and the additionally used double register BC.

The same has to be done similarly for the Y-value as well.

This was for the moving then there is the collision detection. Here we need to get the x and y values for comparison.
And here we additionally need to mask out the high bits when getting the values.
So the additionally code required would be similarly maybe 40 to 50% more.
And in collision detection it can also happen that the same sprite is compard several times against other sprites. Meaning: the getter algorithm has to be used a lot of times.

So all in all this is a huge area for optimization.

I would be happy to get any ideas for some better algorithm. Maybe there is also some HW support or some Z80N instruction that could help and that I have overseen.

Any help is appreciated.

Ped7g
Posts: 82
Joined: Mon Jul 16, 2018 7:11 pm

Re: HW Sprites and how to use them

Post by Ped7g » Thu Jan 24, 2019 9:11 pm

There's also one more aspect you didn't mention in your post, that is "z-index" of sprites, if the overdraw order of sprites is important for you, then you have to have way to insert-into the whole list, at particular position... (if you have somewhat fixed amount of particular sprite types, you can use fixed ranges, and insert/delete inside those, as the unused sprites can be set to "visible=off" meanwhile).

I can pretty much agree on majority of your post, so my further notes are not meant like refutation of something, but more like possible extension or slight modification... The major point is to have shadow copy in HW structured way for fast upload. While I believe you can probably go without that for simpler games like pac-man, for anything with larger amount of dynamic sprites like some bullet-hell shooter, it will be probably more sane to prepare one final buffer first, then DMA that in one go. But that introduces one more issue, you have to pick ahead which sprites will use 4 and/or 5 byte attribute structure, but you can overcome that easily by using either 4B or 5B only, it will in the end probably work quite naturally.

But I'm not sure if that does 100% imply you must use that shadow copy also as "live" storage, for some games/cases it may be viable to have your own internal structures (optimized for your code, i.e. DDD = data driven design), do all calculations over those, and produce the final structure into memory for DMA only, or even do upload while building the structure, not storing it. (oh, did I just contradict myself about "having shadow copy of HW structs is obvious"?).

Then some game types may lead to interesting consequences, like for example Warhawk with it's large score-board occupying right side of screen can pretty much operate with sprites using only 8bit X coordinates. Maybe storing half-res coordinates may be viable for some games (so things will be positioned by x*2, y*2 in the final step), maybe adding the +-1px movement only for specific sprites as "animation", for example chess pieces are usually positioned and aligned to "grid", so cutting coordinates resolution is viable.

Same goes with collisions, you may often get away with half-res or even less with "collision" coordinates, maybe even using some kind of hash buckets sort to first put all sprites into some larger squares depending on their pos/16 or /32, etc, and then do collisions only between sprites meeting in same bucket (+neighbours or inserting single sprite into multiple buckets if it overlaps).

Also moving sprites by +-1px and moving them by arbitrary amount can make difference, like using only inc/dec instead of 16b add.

I will probably end this generic/ideas post here, and write one more commenting directly your code with tiny code details.

Ped7g
Posts: 82
Joined: Mon Jul 16, 2018 7:11 pm

Re: HW Sprites and how to use them

Post by Ped7g » Thu Jan 24, 2019 9:54 pm

Maziac wrote:
Thu Jan 24, 2019 7:28 pm
....

Code: Select all

ld hl,shadow_memory + sprite_index*5    ; Load the address of some sprite 

...
In my recent code (working on Next tests) I fell into trap of preferring rather simple and short source over performance... with tests it's of course no issue, but while prototyping some game, I would maybe consider that too... simple code is easier to modify, and there's always possibility to optimize it later, if profiling shows the game is not fast enough, and the sprite handling is significant part of that.

And that leads me to.... consider IX/IY usage... it's very slow (`ld l,(ix+0)` `ld h,(ix+2)` is already 38T and we just fetched byte 1 + byte 3 sprite attributes, but you have all three 16b base pairs free for calculations, you can load/store directly into hl/de/bc saving something on `ex de,hl`, you have whole sprite structure accessible all the time (no need to `inc/dec hl` back and forth), so the performance loss cause by (ix+*) may be smaller than you would expect, plus it may keep the code simpler, easier to modify while prototyping, and you can still rewrite it for performance later.

But let's say your code was doing precisely what you want, and you want just few more final cycles squeezed out of it:

Code: Select all

ld hl,shadow_memory + sprite_index*5    ; Load the address of some sprite 

; Load X 
ld a,(hl)   ; T7. Load low byte of X 
add a,x_add&$FF ; T7 adjust low 8 bits, sets CF
ld (hl),a   ; T7 store it
inc hl      ; T6. (inc does not modify CF!)
inc hl      ; T6. (actually the 16b inc does nothing, but even 8b inc keeps CF)
ld a,x_add>>8   ; T7 load 9th bit of x_add
adc a,(hl)      ; T7 add all 9th bits together and carry
rra         ; T4 put final 9th bit into carry (!)
ld a,(hl)   ; T7. Load attribute "byte 3" for original flags
rra         ; T4 drop old X9 and load new X9 (but at wrong top pos)
rlca        ; T4 fix position of bits (with new X9 at bottom)
ld (hl),a   ; T7 store modified X9

; Total T=73
; (I didn't debug it, so let's hope I didn't do some stupid mistake there... :) )
Now if you can guarantee you are in aligned region of memory (wow, actually with 128 sprites and 5B structure the total buffer is 640B, well over 256, I didn't realize that before, felt so small, just 4-5 bytes, right? ... still if you will use fixed ranges for particular sprite types, you can probably guarantee blocks of them to be within `align 256`) => you can replace 2x"inc hl" by 2x"inc l", saving another 4T => total T=69.

And that's just +8T to your "simpler" proposal.. not that bad *patting myself on back*... :D

Also if "x_add" is some kind of constant used for several sprites, you can preload it into DE or BC (+10T), saving another 6T per sprite on those "add a,x_add&$FF" and "ld a,x_add>>8", i.e. for 10 sprites 10-60 = -50T. (if we ignore that +10T, then new total is 63T, now only +2T behind your faster variant :D )


One more idea... with relative sprites introduced, you may also in certain cases get away with some kind of "swarm" mechanism, where you have to deal with full 9b coordinates only on the anchor sprite (which may be actually fully transparent in worst case, holding top-left corner to make relative sprites operate in positive ranges of x/y values) and each anchor having some swarm of relative sprites operating relative to it. I can somewhat imagine it for some "wave of enemies"...

(one more edit: you can in that code also pre-load Y byte between those `inc hl` into some register, so you will not need to get back to it later for read, like `ld e,(hl)` ... may be worth it, maybe - it's always tricky with optimizations for me, until I have the final code, I have all sorts of those clever ideas back on mind, but after the dust settles, there's often some brain-dead simpler variant not looking clever at all, but "just working")

Ped7g
Posts: 82
Joined: Mon Jul 16, 2018 7:11 pm

Re: HW Sprites and how to use them

Post by Ped7g » Fri Jan 25, 2019 7:56 am

Hm, there's even some more space for optimization without too much of sacrifice... If you can guarantee to use "natural" form of either `add` or `sub` for movement right/left, i.e. `x_add` constant will be always positive, and you don't need bigger than 255px movement, you can use only 8 bit `x_add` (or let's call it `x_delta` rather), like this:

Code: Select all

ld hl,shadow_memory + sprite_index*5    ; Load the address of some sprite 

ld a,(hl)   ; T7 Load low byte of X 
add a,x_delta ; T7 adjust low 8 bits, sets CF
    ; or `sub x_delta` for left movement
ld (hl),a   ; T7 store it
inc hl      ; T6 (inc does not modify CF!)
inc hl      ; T6 (actually the 16b inc does nothing, but even 8b inc keeps CF)
jp nc,.Xdone; T10 if CF=0, the X coordinate is updated and HL is +=2 => done
ld a,(hl)   ; T7 carry was set, so 9th bit needs to invert
xor 1       ; T7 invert X9  (this can be T4, if can hold "1" in some base register)
ld (hl),a   ; T7 store modified X9
.Xdone:
; Total T=64 in 9bit case, 43T in 8 bit case
Again you can shave 4T of this in case you will stay within single 256B block by replacing 2x"inc hl" with 2x"inc l", and another 3T if you can hold value 1 in some base register for the `xor`, then the total T would be 57/39 depending if X coordinate did need full 9 bit update or 8 bit update was enough.

EDIT: and because the 8-bit overflow is reasonably rare case, if you have enough room to reorganize the code, you can even use `jr c,.resolveX9` to have only T7 for normal cases... thinking about it, you can even decide to end this block with HL pointing still at X normally, or let's bump it by +1 to point at Y at end:

Code: Select all

ld hl,shadow_memory + sprite_index*5    ; Load the address of some sprite 

ld a,(hl)   ; T7 Load low byte of X 
add a,x_add ; T7 adjust low 8 bits, sets CF
    ; or `sub x_add` for left movement, point is to have only positive "x_add"!
ld (hl),a   ; T7 store it
inc hl      ; T6 (inc does not modify CF!)
jr  c,.handleX9 ; T7 for CF=0, T12 for CF=1
.Xdone:

; total T=34T (or 32T with "inc l") for common case when only 8-bits of X need update
and somewhere near to be reachable by "jr"

Code: Select all

.handleX9:
inc hl      ; T6 update HL to point to X9 byte
ld a,(hl)   ; T7 carry was set, so 9th bit needs to invert
xor 1       ; T7 invert X9
ld (hl),a   ; T7 store modified X9
dec hl      ; T6 restore HL back to Y
jp .Xdone   ; T10 continue with common case code

; Total T=27+12+43=82T for rare case when X needs full 9 bit update

User avatar
Maziac
Posts: 55
Joined: Sun Jul 09, 2017 5:56 am

Re: HW Sprites and how to use them

Post by Maziac » Fri Jan 25, 2019 4:37 pm

Thanks very much for this valuable and elaborated input.
In the end it seems to drill down to bit fiddling and optimization depending on the available registers.

I must admit that I was silently hoping that there is some kind of HW support available that I've overlooked.

Ped7g
Posts: 82
Joined: Mon Jul 16, 2018 7:11 pm

Re: HW Sprites and how to use them

Post by Ped7g » Fri Jan 25, 2019 4:54 pm

I'm not aware of any, except the new `ADD HL/DE/BC,A` and `ADD HL/BC/DE,$nnnn` Z80N extensions, but I don't think those would seriously change the code above.

In principle, you have to upload new data to move sprite. To get new data, you have to either generate them by some formula, or have previous data (shadow copy) and adjust that.

Ped7g
Posts: 82
Joined: Mon Jul 16, 2018 7:11 pm

Re: HW Sprites and how to use them

Post by Ped7g » Wed Feb 27, 2019 4:29 pm

Maziac wrote:
Fri Jan 25, 2019 4:37 pm
Thanks very much for this valuable and elaborated input.
In the end it seems to drill down to bit fiddling and optimization depending on the available registers.

I must admit that I was silently hoping that there is some kind of HW support available that I've overlooked.
With latest extensions of Sprite engine, the relative sprites have now only [int8_t, int8_t] coordinates, so if you can group your sprites reasonably by some 256x256 screen area and you have no better use for anchor+relative shenanigans, it's now real option to anchor transparent sprite (must be visible to make relatives visible, so using fully transparent pattern is probably easiest hack) somewhere in the middle and update multiple "relatives" just by adjusting two 8b [x,y] coordinates. The X8/Y8 bits in relatives allows you then to use relative palette offsets and relative pattern names, so you can in some lucky scenarios use that anchor even to "animate" between frames or tint whole group of sprites into particular section of palette.

Of course the anchor brings also many limits like 4/8 bit graphics selection done by anchor, etc, so the overall benefit will strongly depend on the game/sw design, it may help a lot in certain cases which will fit to this model perfectly, or it may be too limited for others.

Post Reply