Lesson
M11 - Simple RLE
There will certainly be times when memory is short - and one option
is compression!
RLE is one of the simplest compression techniques, and
is used by formats like GIF images...
Lets learn how to use a simple RLE decompressor to show an image
onscreen
CPC_RleByte.asm SMS_RleByte.asm
RLE - Run Length Encoding
Run Length encoding works when consecutive bytes are the same -
suppose we want to compress the string "111100000223335" (15
characters)
We could do this by storing a letter - and a count for that letter.
this would result in a string '1105223351' (10 characters)...we've
saved a few characters!
We'd also want a way of storing bytes that are not the same (we'll
call it linear data)... lets look at the real algorithm
111100000223335 1x40x52x23x35x1 1405223351
Our actual format is more advanced, well a bit!
We'll have a 'header byte' -
If the whole byte is zero - then we're reached the end of the
compressed sequence
If the top bit is 1, the next sequence is RLE compressed... the
remaining bits are a bytecount (1-127)
If the top bit is 0, the next sequence is LINEAR uncompressed
data... the remaining bits are a bytecount (1-127)
Compressed data:
Uncompressed Data:
If you want to RLE compress a file in this format, you can use
AkuSprite Editor, the FileProcessor option
will compress in RLE format
The RLE
compression alrorithm here certainly isn't the best! it was
shamelessly stolen *ahem* borrowed from Phantasy Star!...
At the request of a patreon, the author of these tutorials did a
disassembly of the routines here
... This tutorial is based on what
was learned!
Z80 Decompressor
When we want to decompress data, we need
to specify a source in HL... and a destination in DE
The RLE decompressor is split into 3 parts... the Header
byte processor detects if the next block is Linear, Rle, or if
the stream had ended...
We use JP M, This is 'Jump if Minus' - effectively this
tests the top bit of the registers, which is what we need to detect
the RLE block
The Linear section handles uncompressed data.
The RLE section handles the compressed data
Of course we'll need our Compressed RLE data, and some storage
space for the destination after it's decompressed.
Here is the decompressed data.
Of course we can compress anything!... here's a simple CPC screen!
This screen is 16k... but compressed it's just 2.7k!
Of course that's only because it's so basic, but you get the point!
Some
Data will RLE compress well - others will end up bigger than they
started!
You need to pick what data to compress, and what to leave alone,
if possible, design your graphics to be simple if you need to RLE
compress them in this way
SMS Bitmap data - splitting bitplanes for better RLE
Lets use our RLE compressor to compress the bitmap data of the
Chibiko mascot we used in a previous example...
Unfortunately, the raw format of the file is 'problematic'
Looking ad the data, it looks a bit complicated, and maybe like it
won't compress so well (We would save just 3 bytes)... but we need
to think of what the data represents - and use our decompressor
accordingly!
Our graphics is 16 color - and is split into 4 bitplanes... Bitplane
0, Bitplane 1,Bitplane 2,Bitplane 3
Because Chibiko is only 4 color, bitplanes 3 and 4 are totally
unused... If we split up our file, and store each bitplane in a
separate RLE we'll get far more efficiency.
Splitting the file in this way reduces the file size by over 50%!
We can create 4 files split in this way with the 'Split
Four Bytes' option in Akusprite Editor
Filling One Bitplane of the SMS with RLE
Again The RLE decompressor is split into 3 parts... the Header
byte processor detects if the next block is Linear, Rle, or if
the stream had ended...
We use JP M, This is 'Jump if Minus' - effectively this
tests the top bit of the registers, which is what we need to detect
the RLE block
The Linear section handles uncompressed data.
The RLE section handles the compressed data
Note that we're OUTing the decompressed data to the vdp, and first
we're selecting the VRAM address in DE ... DE must contain VRAM
address +&4000 (&40000 denotes WRITE)...
We add 4 to DE each time, so we're sending the decompressed data to
just one bitplane
We need to include all 4 bitplane files in our source code.
When we want to stream the data to the 4 bitplanes, we load the
starting VRAM address in DE, and the RLE data in HL
We repeat, backing up the destination address before each loop, and
INCing the Vram address for each consecutive bitplane
HL will point to the start of the next bank of RLE data after each
execution, so we don't need to change it between bitplanes.
The image will look the same, but it took less memory!
We've not saved
much memory, but it's only an example of how you can use RLE
compression, and change the way you're writing (OUTing) data to
VRAM to optimize the data you're sending.
If you're doing a small game it's not going to be a problem, but
if you're making a bigger game, you may need to keep things as
small as possible.
Lesson
M12 - Stack Tricks!
We've looked at the stack before and discussed some tricks we can do
with it...
Lets look in more detail, and learn how we can manipulate the stack
for parameter passing, allocating temporary ram and more!
Test_StackFun.asm
Passing parameters to Subroutines
If we want to save program code and we want to pass parameters to
a function, rather than loading them into registers, we can store
them in the code after the sub with DB and DW statements...
We can read these in from within our sub, and modify the return
address to skip over the parameters so code execution resumes after
the parameters
To get the address of the parameters, we transfer the top of the
stack into IX...
We then use IX to read in the parameters.
Finally we alter IX to skip the parameters,
adding enough bytes to jump over the parameters, and put the new
return address back on the stack
A and HL have been loaded in by the sub
Another way of passing parameters to a function is to push
parameters on to the stack... this is handy if we need the
registers for other purposes.
We get the stack pointer into IX and use it to read
in the parameters.
Once we've taken the parameters, We need to move the return address
to the top of the stack with the parameters removed
Here are the results.
Returning parameters on the stack
Of course, we can reverse the procedure, taking the return
address off the stack, putting some
parameters on the stack, then putting the return
address back on the stack
We can then read the parameters back after
the sub finishes
Here are the results
Allocating Temp Ram on the stack
If we need a small amount of temporary ram for our subroutine,
rather than having a permanently allocated bank, it may make more
sense to allocate it on the stack, and set IX or IY to this
allocated data.
In this example, we've allocated six bytes...
when we're done, we reset the stack pointer
The data will be on the stack... above the
return address... so make sure you fix the stack
before returning!
Allocating a
few bytes of temp ram with a label will of course be much easier,
but what about if you're subroutine will need to be nested?...
you'd need to allocate those bytes for each iteration of the
function
By using the stack, once the function completes, all the bytes are
freed - just make sure your stack isn't too small - or something
may get overwritten!
Return as a Jump!
Lets take a look at a simple test program...
This program has a large number of branches, all of which end up
with a JumP to CaseThanks
a JP command takes 3 bytes each... and if space is super tight this
may be a problem if there are a lot of them... but a RET command is
just one byte
If we push an address onto the stack - then
any RETurn command will effectively jump to
that address - if there are a lot of Jumps we can save a few
bytes...
If there are some cases we don't want that jump to occur we can just
POP the return address off the stack.
This is just a
simple example of what you can do with the stack and RETurn... you
can of course do much more... like simulating a JMP (BC)
command - jump to a numbered entry in a vector table, or
whatever you can figure out the code for!
Of course, these are all advanced 'tricky' things, so don't worry
about them if you're starting out.
Jump with RST (Like making our own command!)
It's not really related to the
stack... but if we're looking for ways to save a byte or too, don't
forget RST!
There are 8 RST's, which effetively call to low addresses... on the
CPC RST6 is reserved for the user to use... that means we can do
with it what we want,without harming the firmware.
RST6 (RST &30) is a 1 byte command... and in this case we're
setting it to JP RstTest... As RST is a
call, this means RST6 is equivalent to CALL RstTest
Our 'RstTest' function does the same as our earlier example - taking
a word from the following bytes as a string to show to the screen...
Effectively we've made our RST6 into a
'PrintString' command.. it prints the following word- and this saves
3 bytes compared to 'LD HL,STRING / CALL RstTest'
Here is the result
Stack Misuse (For fast Read/Write)
We've covered it before - but
lets look again at Stack Misuse... this is where we use the stack
for bulk reading or writing...
We just point the stack pointer to the bottom of
the area we want to read, or top of the area we want to write,
and PUSH or POP away.
Of course we can't use the stack for anything else when we're doing
this, so no CALLs!
We need to restore the stack pointer once
we're done so we can CALL again!
This example copies the screen, scrolling the screen up 8 lines
The routine will cause a vertical scroll of 8 lines when we run
it!
We used the stack to copy the whole screen quickly
If
you're misusing the stac kin this way, you'll probably want to
disable interrupts first!.
Because interrupts are effectively CALLs, and calls use the stack,
they may draw corruption to your screen, and make quite a mess!
If your super clever you may be able to work around this - but
it's very tough so it's best to disable interrupts and make your
life easy!
Lesson
M13 - Fast Multiplication and Division.
We've looked at some very simple Multiplication and Division before,
which just repeatedly added or subtracted a value... that's one
solution, but it's pretty slow.
Lets look at some better ways to do things!
MultDivTest.asm
Multiplication
Lets look at
Multiplication!....
If we want to multiply 102 by 10 by hand, we would multiply 10 by
each of the digits in 102 (or the other way round!)
We can do the same in assembly...however instead we'll work in bits
- bit-shifting the value we want to multiply, and adding the amount
we want to multiply by each time.
This function will effectively set HL = HL*A
Through each iteration of the program, we rotate the bits in A to
the right... causing one bit from the right to be copied into the
Carry
If the bit is 1, we add DE to HL
Whether it was or not, we double DE - setting it to the correct
value to add for the next bit.
Here we can see what happens to the registers... on the right is
the bits in A and the Carry (C)
When the Carry reaches 1 - we add the current value of DE (1020) to
HL... this effectively multiplies the value that WAS in HL at the
start of the function into by the bit position (now it's shifted
into the Carry)
We multiplied &10 by &0102...
the result was &1020
Division
Lets Divide 1023 by 10
When we do division by hand, we need to work out how many times we
can take 10 from each digit - starting from the left moving to the
right - any remaining value is the remainder.
Of course, we'll have to do things in bits when it comes to our ASM
version.
In our example we'll perform HL=HL/A... where HL=1023 and A=10
We actually work sort-of backwards with the division - rather than
moving A right through the bits of HL, we'll move HL left (by adding
HL to itself - doubling it).
We'll move HL LEFT - popping the leftmost (top) bit into A... with
each shift we'll also double A - when A is greater than, or equal to
the amount we wanted to divide by (10) we subtract that amount - and
add 1 to HL
That '1' that was shifted into HL will continue shifting with HL -
until we've rotated HL round a full 16 bit 'loop'...
At the end HL will have been divided by the value (10)... and A will
now contain any remainder!
Here's the registers... and a bit representation of HL on the
right.
We move the bits left in HL (Blue highlighted bits) -and when a bit
gets pushed out, we move it into A (Yellow Highlighted)
A also shifts left each iteration, and once this exceeds the divisor
we add 1 to to HL (shown Yellow)
We repeat this loop 16 times - HL ends up containing the whole
number part... and A is the remainder.
We divided HL by A
The result is in HL... the remainder is in A
This is just
two examples of a dividing a 16 bit number by an 8 bit one - if
you need more, check out Grauw's
siteWhich has many other z80
algorithms for multiplication and division with great
documentation!
Lesson
M14 - SpeedTile software Tilemap Engine (1/3) - (Basic Tilemap)
It's time to start a new series... We're going to look at a
multiplatform software Tile/Sprite Engine
This allows for scrolling tilemaps, and even sprites with clipping
(meaning they can be partially offscreen) of pretty much any size.
Lets Learn how!
SpeedTile is a software tilemap... it uses 2 frame buffers to
facilitate software scrolling, and works at 256x192 on most systems.
It breaks up images into 8x8 tiles, and has optimized drawing
routines which are designed to work 'around' the screen co-ordinate
difficulties of each system.
The tile code has the potential to use multiple drawing engines for
flexibility - Normal (PSET tiles) Fill (solid colors for fast tiles)
Zero Transparent (byte zero not drawn) and Transparent (Tile
completely unused)
Tilemaps can be any size, and small 'tilemaps' can be used as
sprites - as the code allows tile reuse (compression) and flood fill
for solid color tiles (speed) and flexibility (small areas
transparent only when needed) using the tile code for sprites can be
extremely efficient.
Software
scrolling means we'll use two screen buffers, and redraw the
entire screen every frame - this is slower than Hardware
scrolling, but it's much easier and is less platform specific.
Many systems cannot hardware scroll (SAM/Speccy) and systems like
the CPC have limited ability - plus Hardware scroll is difficult,
as memory addresses change every time the screen moves, and if a
lot of stuff moves onscreen, it could even end up slower than soft
scroll as we'd end up redrawing everything anyway.
And beware! If you're struggling to understand the software scroll
tutorial here... you'll really find hardware scroll too hard!
The Logical Tilemap
There are 3 kinds of tilemap
we need to consider.
The Visible tilemap.... What we see onscreen
The Logical tilemap... this is a set of pointers to the data used to
draw the screen - it's slightly bigger than the screen to allow for
scrolling
The Supertile map... this is the 'compressed' levelmap... it's many
time bigger than the screen.
The logical tilemap is pretty simple... each 8x8 tile is represented
by 1 word in the format:
%AAAAAAAA AAAAAATF
Where A is the address of the bitmap data in ram for the 8x8 tile
(with the bottom 2 bits as 00)... eg &FF84
F=1 denotes a filled tile that uses 2 bytes to fill the entire tile
(Faster)... eg &FF84+1
F=1 and T=1 denotes a transparent tile ... eg &FF84+3
A=00000000 + F=1 denotes an empty tile - the tile is not drawn -
skipping this 8x8 block... eg &0+1
We use symbols to simplify defining these.
The Tilemap is rotated 90 degrees clockwise... so the XY
co-ordinates of data in the Logical tilemap go DOWN then ACROSS...eg
DW [Y0 X0],[Y1 X0],[Y 2X0]
DW [Y0 X1],[Y1 X1],[Y2 X1]
DW [Y0 X2],[Y1 X2],[Y2 X2]
This is because the drawing order of the screen goes down then
across.
Because
speed is paramount the logical tilemap code is platform
specific,however they're all pretty similar.
We'll take a look at the CPC version here, but the SAM and Speccy
version are pretty similar, it's just 'newline calculations'
differ
Drawing the Logical Tilemap
We need to draw the tilemap... we set DE to the first byte of the
logical tilemap to draw.
A/C are the XY offset (in 1/4 tiles)... to move a full tile we need
to alter the DE offset.
The CPC can only move the tilemap in 4 pixel chunks... ***
We need to calculate the base VRAM address of the top-left corner of
the tilemap - we use IX to calculate the start address in VRAM to
draw.
Our Logical tilemap will usually be taller than our visible
screen... Our Screen is 24 tiles tall - but what if our Logical
tilemap is 32 tiles tall???
After drawing each vertical strip to screen, there will be 8 tiles
unused... we'll need to skip over them... we calculate the amount to
skip here for later.
We set IYH and IYL for the end of the tilemap sizes.
This is consistent for the tilemap... but will be different when we
use the tilemap code for sprite drawing.
*** NOTE: The 4
pixel vertical shift is actually handled by an alternate version
of this module "CPC_V1_SpeedTile_ShiftedTile.asm"... this is
because the routine is optimized for 8x8 blocks, and the
optimizations are different for an effective 4 pixel vertical
offset... using extra ram for a second version makes more sense
than trying to write a single version that works for both
positions and lose speed we don't have.
Starting The Tile Draw
We use the Stack for fast reading during the main loop, so we back
up the stack pointer at the start.
shadow C is used for fast calculations on the CPC to move down a
line (add 8 to Vram H)
Note we're using shadow registers to allow less use of ram to store
paramters
At the start of a line, we reset the screen memory position to
draw the next vertical strip
We need to draw a tile... we load in the address of the bitmap
data for the tile from HL
we check the bottom bit of the address... if it's 1 this is not a
'normal' tile - it's either Filled or Transparent
Drawing a bitmap tile
we're drawing a normal tile.
We set the stack pointer to the start of the Tile bitmap data and
POP a pair of bytes from the data (one lines worth on the CPC)
We write the first byte to the screen and move right (INC L) then we
write the second byte.
we then move down a line ... we add 8 (from C) to the accumulator
(Which contains H) - we then update H
We then repeat the process... this time we write the bytes in
reverse to save unnecessary moves.
We repeat this for all 8 lines of the tile.
Finishing the tile
When a tile is complete we need to fix up HL... HL will now point
one tile below the last one - ready for the next tile.
We check B and repeat until the strip is done.
Once a strip is done, we skip any unused tiles (The tilemap will
usually be taller than the visible screen)
We need to move 8 pixels to the right... we INC C twice... moving
across the screen two bytes (8 pixels) and repeat to do the next
vertical strip
When there are no strips left, we restore SP and return.
The Solid tile
The Solid tile is basically the same, however alternating lines
are filled with just 1 byte - this gives the tile a filled color
(with checkerboard pattern color)
Because there's just one bytepair read It's faster than than the
regular draw
The Empty tile
Empty tiles have no drawn pixels - we just change HL to move down
to the next tile.
Transparent tiles (Zero Byte)
The 'Transparent' tile version is for sprite drawing.
It's basically the same as the regular tile routine, but before each
draw we test the byte, if it's zero, we don't put it onscreen.
this is used for the sprite drawing routines we'll look at in a
later episode.
Lesson
M15 - SpeedTile software Tilemap Engine (2/3) - (SuperTile map)
The logical tilemap is quick to draw, but using 2 bytes per tile is
a problem... if our game world used 256x128 tiles then the total
tilemap would be 64k - our entire memory!
To solve this, we use 'Supertiles' - each one contains 4x4 sprites
(16) ... making our 256x128 tilemap out of 1 byte supertiles takes
just 2048 bytes!
SpeedTileSprite.asm
V1_SpeedTile_SuperTileMap.asm
Supertile definitions and SuperTile Map
The SuperTile map uses 1 byte per supertile defining the tile
number
The tilemap is organized horizontally, so each byte has the
position:
db [X0 Y0],[X1 Y0],[X2 Y0]
db [X0 Y1],[X1 Y1],[X2 Y1]
db [X0 Y2],[X1 Y2],[X2 Y2]
Each Supertile defines a 4x4 square of tiles, so it's definition
is 16 tiles / 32 bytes in total
Using the SuperTileMap code
We're going to update the Visible/Logical tilemap based on a
'World position'
This uses a 16 bit X and Y co-ordinate whic splits into 3 parts....
%--------SSSSOOHQ
The Top byte -------- is unused (our world is
only 16 supertiles wide)
The SSSS bits are the position in the
SuperTilemap for the top corner of the Logical Tilemap OO is the tile offset of the Logical tilemap in
whole tiles HQis the offset of the
tile in Half/Quarter tiles (2 pixel blocks)... The SAM can use
quarter shifts... the CPC can only do half shifts... the SPECCY can
only do full block shifts
We have a single routine 'CalculateAndDrawTilemap' which will
update the Logical tilemap as required, and draw a full screen
tilemap.
IY needs to point to the variables defining the current position,
and last calculated position
Filling the Logical Tilemap from the Supertile Map
The Logical tilemap is 4 tiles (1 supertile) larger than the
visible tilemap on all sides, this means we can scroll up to 4 tiles
in any direction before updating the logical tilemap.
First we check the current position within the Supertilemap... if
the position of the currently calculated logical tilemap has not
changed we don't need to do anything.
If it HAS , we need to refill the Logical tilemap from the super
tile map
We need to convert the XY position within the supertile map into a
byte position within the supertile map.
The sample tile map is 16 tiles wide... so the offset address is:
SuperTileMap+(Ypos*16)+(Xpos)
We're going to transfer supertiles into the logical tilemap... we
load in a supertile from IX... and get the supertile definition from
"SuperTileDefs"
Each supertile is 4x4 tiles - 2 bytes each (32 bytes)
We need to transfer all all 16 bytes of each of the 4 lines of the
supertile definition into the tilemap.
We read data from the supertile definition by POPing it out
we repeat for all 4 lines
We repeat moving along the logical tilemap
We've done one 4 tile tall row of the Logical tilemap, we need to
move down a row and repeat.
Once the Logical tilemap fill is complete we restore the stack
pointer
You'll notice in
some parts of the code there are two versions - one that uses self
modifying code, and the other that uses symbols for other memory
addresses.
Self-Modifying is faster, but doesn't work if we're working from
ROM - we want both options so we can best server RAM systems like
the CPC, and ROM systems like the SMS
Calculating the Logical Tilemap area to draw.
We need to calculate the 'whole tile' offset of the logical
tilemap (Defined by bits 2,3 of our XY tilemap pos)
Each Tile contains two bytes, so to move down 1 tile, we multiply
the offset by two
We add this to the address of the start of the logical tilemap
Because the tiles are organized vertically in the logical tilemap,
To move across one tile, we need to add the height of the tilemap
(32) * the number of bytes per tile (2)
So we multiply the Xpos by 64, and add this to the address of the
start of the logical tilemap
We get the X 'Quarter tile' offset into A... and the Y offset into
C (for systems that can use it)
Due to the screen layout on systems like the Amstrad CPC it's not
easy to vertically shift the screen - to work around this, we have
an alternative draw routine to do a 4 pixel vertical offset.
Lesson
M16 - SpeedTile software Tilemap Engine (3/3) - (Software Sprites
& Sprite Clipping)
The Tilemap engine can be used to draw areas of any size, we can
therefore use it to create small 'Sprites' using the same graphics
code.
This will draw a small graphic, but we'll need to deal with the
instances where the sprite is partially onscreen - this is known as
'Clipping'
SpeedTileSprite.asm
V1_SpeedTile_SuperTileMap.asm
Drawing a sprite
Here's a 'Chibiko' sprite drawn with the Sprite/Tile engine...
because the graphics is being done by the tile code, we have all the
Tile options - Transparency, 8x8 fill areas and the possibility to
'reuse' 8x8 tiles without increasing the size of the memory
footprint.
The sprite can be any size, but it must be a multiple of 8x8 (of
course part can be transparent.
The DrawSprite routine is called to show the sprite to screen...
We pass a pointer to the 'Sprite Tilemap' in DE... a set of pointers
in the same format as the logical tilemap - one word per 8x8 tile.
We pass the X,Y pos in B,C
We specify the Width,Height (in Tiles) in H,L
This example automatically makes the Chibiko character move towards
the bottom right of the screen.
The Sprite needs to be defined in the same format as
the 'Logical Tilemap'... a grid of pointers to the tile data
Drawing a sprite - Cropping Check
The X,Y pos is NOT specified
in tiles - the co-ordinates are specified in pairs of pixels
'quarter tiles'... therefore our screen is 128x96 in logical units
So we can have clipping 0,0 is NOT the top left visible pixel of the
screen - there is an 'offscreen' area around the visible screen...
the first visible unit is 64,80 and the last is 192,176
We need to look at the position of our sprite, and see if it needs
any of its sides cropping
Sprite clipping takes time, so we'll use a marker (Shadow reg D) to
note if anything needs doing...
IX/IY registers will store the amount of cropping needed.
IXL - is the pixels to remove from Left
IYL - is the pixels to remove From Top
IXH - is the pixels to remove From Right
IYH - is the pixels to remove from Bottom
We can only crop full tiles ... there is a 1/2 or 3/4 tile border
around the tilemap so this is not noticeable!
For each direction we check if any tiles need to be cropped, and
store the number of tiles to remove - we'll calculate how to do this
later.
If the result of the 'crop' is no tiles are onscreen then we return
without drawing anything.
We check the Shadow D register, and run the crop routine if
required.
We also set the 'Y unused tiles' to zero - this will need changing
later if cropping occurs.
Drawing a sprite - Cropping
Because the Logical Tilemap uses 2 bytes per tile and goes down
the tilemap, to crop the left we have to remove an entire column of
tiles (2 * L)
To crop a row from the top, we move down the tilemap by 2 bytes per
tile... but we then need to skip those (in TileY unused)
To crop a row from the bottom, we reduce L... but we then need to
skip those (in TileY unused)
To crop the right hand side, we just decrease H
Hardware Sprite Support
If we're on a system like the
SMS we could use hardware sprites to draw the graphic - provided
we've not yet used all the hardware sprites this frame... we check
and branch to a special 'Hardware sprite' routine if possible
Drawing the sprite
We need to calculate the VRAM address of the first tile to draw.
This is Platform specific, there is a version for each system.
We've calculated the area to draw, and the address to draw to.
We run DrawTiles to do the job...
There's also an 'alternate' version for the CPC - which needs a
special version to draw with a vertical 4 pixel offset (due to
screen ram layout)
Using the
Tilemap code for sprites works surprisingly well!... Firstly
there's only one piece of code to port to new systems, and
splitting a sprite up into 8x8 blocks gives us opportunities for
saving speed and space... by making only small areas transparent -
reusing repeating parts, and solid filling blocks where possible.
It's win freaking win!
Lesson
M17 - Disassembly of a speech synthesizer! (Part 1/4)
I recently disassembled and optimized a software speech synth for
the CPC, and converted it for many other Z80 systems!
Lets take a look at the process!
This time, I'll share some tips for Disassembly!
Speech_PartDisassembled.asm
Speech_Final.asm
Generally The
author of these tutorials does not recommend Disassembly on the
basis that if you can't write something yourself from scratch, you
probably won't be able to understand it in disassembled source!
This was an exception though, as while the author knows very
little about speech synthesis, he did know Z80 and CPC wave
playback!
Warning! This
software has been distributed as 'Speech' (C) 1986 Superior
software (On Amstrad Action Covertape AA103) and the exact same
program was distributed as The
Speaker by Xavier GROSSETETE 1991
While it has been redistributed, and the final version here has
been heavily rewritten, The author of these tutorials cannot claim
ownership of the code, as it is a derivative of the fore mentioned
work.
As far as the author of these tutorials is concerned You are
welcome to use his code, however please do so at your own risk, as
he is not sure of the copyright implications of the rewrite.
Getting started with disassembly
First of all we need to identify what we're going to disassemble
(and check it actually works)
Here we can see our test program - conveniently in basic!
We already know a few things!
1. Our program code is loaded to &8690
2. There is some kind of INIT routine at &98FE
3. There is another INIT routine at &9907 - which took the
parameter 7
4. The String we want to say is passed to the routine at &998B
Taking a look at SAY.BIN we also learn our entire program is 8302
bytes.
If we change the value 7 on line 40, we will learn this is the
'pitch/playback speed' of the speech.
We've already learned lots of useful things,without even touching a
disassembler!
If we open the winape debugger, the Disassembler is available on
the right click menu!
Winape has disassembled the bytes into their equivalent source.
Referenced addresses by the code have been given labels with the
format .l{address}
But there's a problem! Some lines appear as ??? and this won't
re-assemble!
Why? Well, those lines must be BYTE DATA, that has no equivalent OP
code!
We'll need to fix these before we disassemble.
We will need to mark the problem areas as 'Data'
This will cause them to disassemble to DB statements rather than ASM
source.
It may take a few attempts to get things right. If we mark too
little as data, we will end up with an ASM source which won't
assemble, or has nonsense commands which may confuse us!
If we mark too much as data, we may find part of the functional code
has been stored as a DB statement - that's not so bad though, as
once we realize it (we'll see a JP or CALL to an address in that
data block) we can disassemble just that part of the source and
'Patch it into' our dissasembly
Detecting Data!
If the two are 'mixed' We'll want to try to work out what is data, and
what is source! Here are a few things to keep an eye out for!
This is very obviously data! We can see parts of words in it!
But there's something else we can learn from this... it looks like
parts of words, but they look kind of weird... 'OH3W' isn't a 'word'
but maybe it's the phonetic pronunciation of one?... It looks like
this is a table of data used to convert the speech we want to say
into phonetics!
This is something I would expect to see in a speech synthesizer. As
well as the fact this area is data, we've now learned it's purpose,
and if we see references to these addresses in the disassembled code
we'll know they are doing some kind of phonetic conversion.
Here is another block of very obvious data!
We have lots of repetitions of the same values over and over again!
It's very unlikely these are code - we can check that with the
disassembly view to be sure.
These 'Could' be bitmap data, but our program doesn't draw anything
so it's not that!
It does play sound samples though - so they could be some kind of
wave data? (Spoiler... they are!)
Here is the disassembled code!
we see "ADC B ADC B" - this is plausable, as adding a value twice to
the accumulator could be a crude multiplication.
But we also see "XOR D XOR D" ... The second XOR would undo the
first, so it's not likely this is code!
We also see "LD D,L LD D,L"... again, why would anyone do this, the
second command has no effect!
It seems almost certain this is data!
We asked the program to say "I am a fish"
We now see that converted to uppercase around area &9B00
We also see it's been converted phonetically around area &8710
We've learned some more useful things!
1. These areas contain data
2. The program has a 'preprosesssor routine that converts case'
3. The program stores the phonetically converted string in memory
before speaking it
Oh no! Our disassembly contains a call to an area we marked as
data!
We know this because we're calling to a # address (# means hex like
&)
If we'd specified this address as code there would be a label and
we'd see "call L8A32"
We will need to 'redisassemble' that part of the code, and paste the
disassembly of that part into our work in progress ASM file.
Here we've found a command that writes &0000 to address l883e
We thought that was ASM code, but maybe it's data?
Writing &0000 certainly says data, but It could be tricky
though... as our program may use Self Modifying code! (Spoiler: it
does not)
For now I've marked it as "Data?" as I'm not sure.
Looking just a few lines down I see LD B,&F4 with an OUT,
followed by LD B,&F6 and a OUT.
We may think this is just 'data mis-interpreted as code' but I know
it is not!... how do I know? because I recognize it as the command
sequence used to set an AY sound register on the CPC.... We NEED to
know what our code is likely to do if we are going to disassemble
it's source.
Disassembly
Once we have a disassembly we can reassemble and execute we need
to start going through the code.
We already know the INIT routine is at &98FE, so lets go label
that up!
Hmm... it writes the value 9 to L8843... no idea yet what that means
- we'll worry about it later!
We're pretty sure that address &9907 sets the speed of
playback. so lets label up that routine.
I know the CPC uses IX to point to the parameter passed on a Call,
and there's some 'checking and conversion' going on, but the final
value is stored to L1883 - so that seems to be the 'Speed setting'
Once we're sure (if we see it again later used for this purpose we
should do a search and replace and give it a proper name!
Once we know for certain what an address is we should use Search
and Replace to replace the L1234 type label with something
meaningful.
I've used the string "RawStringPos' for one label, as I can see the
words I typed in the memdump at this address, so I'm certain what it
does!
Ok, lets look at the speech routine.
We need to go through each line and try to work out what we think it
does.
We know we're processing text, so we need to lookup any written and
compared values in an ASCII table to see how they relate to the
character set.
Here we see a common trick!
We've subtracted 'A' from the value in the accumulator, doubled it,
and added it to an address (I've already labeled it as
'AlphabetPointers')... We then load two bytes from that address!
What's this do? We'll it's clear to me simply because I've written
the same kind of code many times before.
This is loading an address from a table of pointers - the first
entry in the table is the pointer for the letter 'A'.
We can confirm this!
Here is the labelled up table, and the memory!... we've seen it
before, it's the Phonetic table we saw before!
Each entry A-Z in the pointer table points to the first conversion
starting with that letter.
Here's an interesting one!
I've already worked out ConvAddr is used to point to an entry in the
Phonetic table used to perform a conversion.
This code uses CPIR twice!... This isn't a command I've ever really
used, but checking my documentation, it will scan from HL up to BC
bytes until it finds A
BC is set to &0000 - so it will scan up to 65536 bytes over both
CPIRs
XOR A is a quick way of setting A=0, so both CPIRs will search for
0.
So this routine will update HL, looking for two concecutive 0 bytes.
Looking at the phonetic table, each entry is separated by a 0, and
it now seems that maybe the first is a 'before conversion' string,
and the second is an 'after conversion' string.... So it seems this
routine skips Before-After pair in the conversion table.
That's
all we're looking at for disassembly!
What you want more?.... MORE??? Sadly there really isn't
much point, If you can't decipher the code by looking at it, then
however 'part disassembled' examples like this you're shown
won't help much.
You're best bet is to practice writing little examples yourself,
and looking at documented disassmblies... We'll look at the fully
finished disassembly in a later lesson.
Lesson
M18 - Z80 Optimization - speech synthesizer! (Part 2/4)
As the speech synthesizer had been distributed as a type in, I was
rather surprised by how un-optimized for size it was.
I decided to try to make it smaller! Lets take a look at some of the
savings we can make.
Speech_Unoptimized.asm
Speech_Final.asm
There are two
kinds of optimization... optimizing for Size, or for Speed.
Usually we can only have one, as we can make our code faster by
using unwrapped loops and more specialized code, or smaller by
using more subroutines and reusing code where we can.
It
may be worth doing light optimization as we go along, but heavy
optimization is probably best done at the end of our project.
Optimization is likely to make our code harder to read, and
there's always the risk of bugs creeping in.
Also it's not going to be possible to identify all the best
optimizations to make before we know everything the code will
finally contain.
Easy Wins!
Here are some simple optimizations that it's always good to look out for,
and can be made with little or no drawbacks! They don't really make the
code much harder to read, and they are 'self contained' so it won't matter
if our code changes later on!
Before
After
Always look out for JPs!... if the destination is nearby we can
change these to JRs and save a byte.
Here's two that didn't actually appear in todays source code, but
it's always good to look out for!
LD A,0 is a two byte command - XOR A also sets A=0, but does so in a
single byte.
CP 0 is a two byte command setting the Zero flag if A=0, but OR A
does the same thing in a single byte
LD A,0
CP 0
XOR A
OR A
Here is a common one we see in this code.
Here we have a CALL, followed by a RET
We can replace these with a JP (or JR) - This will save a byte of
code, and result in a faster program, as calls use the stack, but
jumps do not!
This code calls "PlayWavHL_Atimes" as a sub... then returns!
This wastes time and a byte, as "JP PlayWavHL_Atimes" would be
smaller and faster, but in fact "PlayWavHL_Atimes" is right next to
this code!
Lets just remove the call altogether, and let execution 'Flow into'
PlayWavHL_Atimes.
Here's a strange one!
It's common to see XOR A, SBC HL,BC ... This is because the Z80 has
a 16 bit subtract with carry, but no 16 bit subtract.
Here we see the same format, but with ADC... it's not clear why ADC
is being used instead of ADD, so we'll replace the command, and save
the 1 byte XOR.
There are several locations that call "PlayWavHL_Atimes_Alt" after
loading A with 5
Lets create a routine that sets A=5 before running
PlayWavHL_Atimes_Alt. We can the Call or jump to this routine, and
save all those extra two byte "LD A,5" commands!
Here we're loading A with 13 (&0D) then transferring it to
(HL)
But HL (and HL alone) can do this in one command, saving a byte.
Here we've done the same!... by using HL instead of BC, we can
load the space into (HL) without an extra command
When optimizing your code, Test often!... and
backup too!!!
You may think your changes have had no effect, but maybe you made a
typo, or maybe a value in an apparently unused register causes a
problem later in the code.
Somewhat Harder Wins!
Some changes are worth doing at the end, but may not make sense if we
change our code later on.
For Example: Suppose we have a block of code that's used twice, we could
split out that code into a Subroutine, but what if we replace one of those
peices of code. Now that sub makes no sense, as a subroutine that's only
called in one place does not save any code.
Before
After
The code had a subroutine to detect if HL was at the end of a
string (called AtEndOfString in my disassembly)
However this sub was only called once!... as there was no benefit to
it being a sub, we've merged it into the original code to save space
and speed.
Here we load HL with &0000, then D with 0.
We've checked our code, and E is unused... so Lets use DE for both
jobs!
Here we have two pieces of code that do almost the same thing!
the C register is spare in this code, so lets set that to 1 or 2, so
we can use the same block of code for both jobs.
There are two wave playing routines. "PlayWavHL_Atimes" and
"PlayWavHL_Atimes_Alt'
The Alt version changes the pitch, but 80% of the code is the same.
Lets break that code out into subroutines that are shared by both
versions.
This will also help us port the code to other systems!
Many parts of this code end with "ld
(HLSwitchToggle),a pop af ret"
This is a 5 byte sequence... A JR would only be 2 bytes, so lets
'share' this sequence between all these routines.
The Soundinit routine used to store the address and speed in
memory.
With a bit of stack use, we don't need these memory variables, and
can just store them in HL and A... saving variable ram and a few
bytes of code
After taking a look at the code, It's clear that the sample
playing routines always play 63 bytes.
Once all the WAV samples have been labeled up there is some 'Extra
Data' that is never used! It does nothing so lets get rid of it!
In fact, if we're really sneaky, we can organize the samples, and
remove the odd byte, as if the last byte or two of one sample, are
the same as the next, they can 'share' those bytes!
There are two WAV samples that are just a single byte repeated 63
times!
This an OUTRAGEOUS waste of 126 bytes - and cannot be tolerated!
BUG ALERT!
If the character in the source string is a Space or number a special
phonetic conversion is used.
It would be expected the code would load the address of the
conversion table, however it actually loads a pointer to the address
of the conversion table... which is never de-referenced.
There is also a 'limiter' on the conversion attempts (INC D) - a
maximum of 255 bytes, but this makes no sense, as the last
conversion is a 'catch all' which just keeps the original character.
I suspect this 'catch all' was to work around the bug. We've fixed
the bug, and removed the limiter.
Other optimizations not shown here!
1. There was a block of space in the middle of the code - this has
been removed.
2. The Uppercase version of the string, and phonetic version were
stored at two different fixed addressees. The phonetic one is now
stored immediately after the converted one, using ram more
efficiently.
3. There were 3 prep routines: Get String Length, Convert to
Uppercase,Get Last Char.... these have been merged into a single
routine.
Win hard 2 - win harder!
Before
After
It's time for the hardest of optimization wins!
The sample playing code decides which sample, or samples to play
based on the characters in the phonetic string.
Sadly in the un-optimized code this is a massive set of 'CP JP'
Combos, followed by a series of calls
We could easilly save a few bytes by changing a few JPs to JRs and
CALL+RETs to JPs... and maybe making a "PlayWavHL_6times" sub which
sets A to 6 before calling PlayWavHL_Atimes
But there's a more extreme and powerful solution!
Lets create a table of conditions and actions, and a 'Sequence
player'... kind of like a virtual machine, which will perform the
compare, and perform one of various actions depending on the matched
character....
The format for the comarison table is:
1. A comparison byte
2. An action byte (the delay value for A, and player to use, or
defining the address as a jump/sequence)
3. An address (A jump, Sequence or single sample)
The sequences also use a similar command byte. and an address to a
wave sample.
The code is long, but gives a significant saving in space!
The results:
Original example 8302 bytes...
Optimized version 6453 bytes... a 22% reduction
With no reduction in functionality!
As
a modern hobbyist it may seem easy to be critical of the
'inefficiency' of the original code, but it's important to
remember two things.
1. We have much better tools now, we can assemble and test in
seconds with modern emulators. On a real machine if a change
caused a crash, the time taken to assemble test reset and reload
could be ten minutes!
2. Projects are always built to requirements, If a commercial
project's goal was to be completed by monday so it could be sold
for $15 a copy in a magazine ad, the ONLY priority is it works
well by monday, not how big or 'well written' it is... If the code
works already, and spending more time on it will earn no more
money, then it doesn't make sense to do so!
Lesson
M19 - speech synthesizer! (Part 3/4)
Lets take a look at the source for the speech synth. This time we'll
look at the phonetic conversion routines.
Speech_Final.asm
Globals
We need some bytes of ram for various parameters,
We also need enough 'temp space' around 2x the length of the string
we want to say at 'TxtRawString'
One is the 'cleaned string' - uppercase with a few char tweaks.
The other is the 'phonetic string' which is used to play the sound
samples
When we want to speak, we use SayStringFromHL
We load A with the speed
We load HL with the zero terminated string we want to say.
The pre-converter
First we convert the speed in A into something we can use.
it's used as a 'delay' so needs to be in the correct range for the
sample playback code
We're going to copy the passed string into 'TxtRawString'
This first scan copies the string to our temp area, it also ensures
the string starts with as 'space' and ends with a chr 13
We record the 'LastChar' - we tweak the pitch of sentences ending ?
or .
We need to clean up the string.
We remove / - this has a special meaning to the phonetic converter.
we convert lower case to upper case.
Next we convert the string to phonetics that our wave player can
play
'SpeakPart' Plays all the samples to form the speech
The phonetic converter
We need to convert our string
from:
I AM A CPC, YOU ARE A WORM.YOU WILL BOW TO MY Z80
SUPERIORITY.
into phonetics like:
IY4 AEM AE SEE PEE SEE, YUW AA3R AE WER3M.YUW WIHL BOW TUX3 MIY ZAYD
AY3T IY SUX2PEE3RIHAORIYTEE.
We init some of our vars , and run TranslateToPhonetics
This converts the string
We then tag on a CHR 13 to the end of the string.
TranslateToPhonetics repeats until it comes across a CHR 13
We're going to try to find a conversion.
We have a lookup table (LUT) by letter to point to the entries in
the phonetic conversion table
Entries are in pairs, separated by zeros.
We load in a symbol from our string,
if it's a letter we convert it from A-Z into an offset 0-25. We
multiply it by two and load the address we want to start scanning
for a phonetic match.
We have two special sections of the table if we find a space (for
whole words) or a number
PhoneticMatchTest will see if the next entry is ok, it
returns C if the match is right NC if it's no good
We skip over a no good match by using CPIR twice, this skips 2 zero
bytes ahead.
We read in the next character in the conversion.
If we get a space we test the source string
if we get a ? we use char from the source string
we repeat through the characters of the string
Here are the handlers for '?' and ' '
If no match is found we point BC to the next entry in the lookup to
consider
If we've found a valid match, we update BC and HL to point to the
part of the string we want to phonetically transfer
ValidPhenTest will thest the character in the source string, and
see if it's a valid for phonetic conversion
C= no good
NC=ok
If we've found a good conversion, we transfer the converted string
? is a special case, it copies the character 'as is' from the
sourcestring
Vowels are processed specially - we need to track the last two for
the pitch change of a string ending ? or .
VowelSwitcher Checks if the letter is a vowel, if it is it keeps
track of the last two addresses of vowels, so when the speech is
playing back the pitch can be changed if required
Lesson
M20 - speech synthesizer: Sample playback! (Part 4/4)
Lets take a look at the source for the speech synth. This time we'll
look at the Sample selection and playback routines.
Speech_Final.asm
Our source data
We have already converted our text into a
'phonetic representation' in memory, as shown below:
IY4 AEM AE SEE PEE SEE, YUW AA3R AE WER3M.YUW WIHL BOW TUX3 MIY ZAYD
AY3T IY SUX2PEE3RIHAORIYTEE.
The speech needs to be made up of 'wave samples'
These are 4 bit per sample uncompressed PCM samples which make the
sounds which will build up the speech
Platform specific playback
Most of the code is multiplatform, however there are some small
platform specific parts to handle the specific hardware
We have an INIT routine to set up the hardware to be able to play
the samples.
On the SMS we set the tone pitch to LOW
We also have to transfer A to D - this is the loopcount for the
sample used by the player.
SoundPlaySample will play a single byte (2 samples)
The first sample is in the high nibble, the second sample is in the
low nibble.
These are effectively the volume we need to set to play the sample.
on the SMS we send these to port &74 after setting the top 4
bits to %1101 - this selects the 'volume setting'
Speech Routine
SpeakPart Will repeatedly 'speak' the text in the phonetic string,
until a newline is found
We're going to try to find a sample to play for the next letter in
the string, we use 'TableScanSimple' to do this.
If we can't find a sample we'll give up and play the generic 'silent
sample'
We have 3 possible sample routines.
PlayWavHL_Atimes_OneByte uses just one byte at address HL - it's
used for TTTT and silent sounds
PlayWavHL_Atimes is the 'normal' routine, it plays 63 bytes of
samples from HL
PlayWavHL_Atimes is a bit odd, it uses part of a sample with a
'random offset' it makes, it makes a kind of 'distorted' sound and
is used for a few sounds like 'SH'
The comparison table
Originally the code contained a huge case statement of CP and JR Z
statements. This wasted a lot of code memory, so has been replaced
with a loop and a matrix of characters and actions.
Each entry is 2 bytes and 1 word.
The First byte is the Letter Comparison, if this matches the other 3
bytes are used.
The 2nd byte is the Repeat count , or command
The 3rd/4th bytes are a 16 bit address
The table ends with a 0 letter comparison byte.
There are 2 types of command
0 performs a branch to the address in bytes 3/4
255 plays a 'sequence' at the address in bytes 3,4
Any other command byte is a loop count
1-15 uses the normal sample player
+16 uses the 'Alt sample playback.
The tablescan routine performs the comparison to B
if no match is found it will return to the caller - if a match IS
found it does NOT.
if a match is found we won't return, we also consume the return
address of the sub which called us (as it's job is done). We play
the sample if the delay/command is not 0 or 255
if it's 255 we play a Sequence at the address parameter
if it's 0 we do a jump to the address parameter
To make the TableScan easy to use we have some 'helper subs'
We also have 'ProcessDigitDelays' - When a number is found in the
string it alters the playback pitch by -4 to +5
ProcessDigitDelays also executes SubtractSwtichers
This checks if the current character is one of the last two vowels.
if the last character is a ? or . we change the pitch of these
vowels.
We detect if we're at one of these chars by subtacting the address
of the vowel from the current address, if the result is zero, we'lre
on the vowel... we do this twice.
We use the Tablescan routine to select samples based on the next
character.
The scan will return if nothing was found, meaning we can perform
'default actions'
Remember, this
lesson is based on Disassembled code, not something the author
wrote themselves.
It's questionable if PlayWavHL_Atimes is worth using, but it's
here simply because that's how the original program worked!
Optimizations and changes have been made, but only to reduce size,
not where it would reduce the function of the program.
Sequence player
Many times we need to play a
combination of sounds to make the sound of a phonetic.
To reduce the number of Calls and Loads, we define a sequence table.
The First byte is the 'Loop Count' (0=end of sequence)
1-15 uses the normal sample player
+16 uses the 'Alt sample playback.
+32 use the 'onebyte playback'
The 2nd and 3rd bytes are the address of the sound sample.
The PlaySequence routine works through the table, calling the
correct play routine for each sample