Text Compression

By Chris Dewhurst

Originally published in EUG #53

Given the memory constraints of the BBC, text compression can often come in handy when writing programs. If you were designing an adventure game, imagine how many more locations you could fit in if you could reduce the space taken up by the description of each. In fact, using the techniques discussed below, it's possible to achieve size reductions of 50 to 60 percent on reasonably large amounts of text.

What we do is to convert the text into four-bit blocks, or nibbles; each of which is a commonly used character, or a code giving information about the nibble that follows:

Nibble value      Meaning
------------      -------
0                 Next nibble is a less commonly used character.
1                 Next nibble is a token*.
2                 End of text.
3-15              The 13 most commonly used characters.
*See the following documentation.

When we speak of "most" or "less commonly used characters", we are referring to character frequencies. The order of frequency of alphabetic characters in the English language is:

      e t a o n r i s h d l f c m u g y p w b v k x j q z

Including the space, we would say that the 13 most commonly used characters are:

      e t a o n r i s h d l f [space]

and the remaining fourteen letters, plus a couple of punctuation marks, make up 16 less common characters:

      c m u g y p w b v k x i q z , .

Now, if we had a string like "hello you", before compaction it would look like this:

      byte   1  2  3  4  5  6  7  8  9  10
             h  e  l  l  o     y  o  u  .

Applying the rules above, it would be compressed to:

      byte   1  2  3  4  5  6  7
      value &B3 DD 6F 04 60 20 F2
             he ll o   y o  u  .

The six nibbles in bytes 1-3 index into a table of most commonly used letters, spelling "hello ". Remember that 0, 1 and 2 have special meanings, which is why the letter e has the index value of 3. The 0 in byte 4 indicates that the next nibble indexes into a table of less commonly used letters. And so on, until the 2 at the end flags the end of the text.

Now we'll put this into practice with program COMPRESS. After the machine code has been assembled, you'll be asked to type some text. At the moment, only lower-case characters, spaces, commas and full stops are allowed. If you try anything else you'll get an error message:

       Illegal character

The input string is held at 'intext%' and the resultant compressed string is placed at 'outtext%'. These locations are indirectly addressed via 'input' and 'output'.

The machine code routine 'compress' compares each character of the input string to those in 'commtab' and 'raretab', tables of more and less commonly-used characters. If a match is found, the index value in the Y register is written to the output string, preceded by a zero nibble if the matched character was a less commonly used one.

Note how 'mask' is used by 'wrtnib' to determine the position at which the nibble is to be written. If 'mask' is 0, the nibble is multiplied by 16 and stored in the output string at the current offset 'outptr'. If it's &FF, the nibble is ORed into the right hand nibble of the byte at offset 'outptr', and 'offset' advanced by one to point to the next byte in the output string.

If the character in the input string is a carriage return, a corresponding two nibble is stored in the output string to signify the end of the text.

'expand' is just the reverse of the compression process. The input string must be a previously compressed one. The nibbles are inspected in turn to see if a more or less common character should be printed on the screen.

Try typing in "she sells sea shells on the sea shore". The program works out the compression factor, and in this case it's about 48%! In the next article however we'll be looking at how tokens can achieve a factor of up to 60%!

Christopher Dewhurst, EUG #53