Optimizing Performance and Memory Usage of Assembly Code.
Introduction
In my previous post, I covered how to analyze performance and memory usage of assembly code. Here, I will start looking into how to optimize the code, using the 6502 Reference, and some programming logic.
Then, I will modify the code to achieve different results, using random colors, different sectioning and more.
Optimization
The goal is to optimize the old implementation of “coloring the entire screen yellow” code, to improve the runtime, and hopefully the memory. It is known that there is a solution that achieves the same result at almost half the cost in performance, and even bigger memory difference.
Previous Code
lda #$00 ; set ptr in adrs $40, point to $0200
sta $40 ; ... low byte ($00) goes in adrs $40
lda #$02
sta $41 ; ... high byte ($02) goes in adrs $41
lda #$07 ; colour number
ldy #$00 ; set index to 0
loop: sta ($40),y ; set pixel colour at the adrs (ptr)+Y
iny ; increment index
bne loop ; continue until done page (256 px)
inc $41 ; increment the page
ldx $41 ; get the current page number
cpx #$06 ; compare with 6
bne loop ; continue until done all pages
Previous implementation:
11.325 mS
,27 bytes
I noticed that the inner loop, coloring each pixel in the page, doesn’t look like it can be optimized much further. However, the functionallity of the outer loop, which executes 4 times for each page on the bitmap screen, seems like it can be optimized in some way.
Attempt 1: Expand Outer Loop
The first approach I can see here, is to “expand” the loop, and repeat the code 4 times, for each page. This might improve the runtime a little bit, but will definitely increase the memory usage of the program.
lda #$07 ; set color
ldy #$00 ; reset index
PAGE1:
sta $0200,y ; set pixel color
iny ; increment index
bne PAGE1 ; continue until done page (256 px)
ldy #$00
PAGE2: sta $0300,y
iny
bne PAGE2
ldy #$00
PAGE3: sta $0400,y
iny
bne PAGE3
ldy #$00
PAGE4: sta $0500,y
iny
bne PAGE4
Now, let’s analyze the runtime and memory usage of this approach. Since there is a chunk (highlighted below), that repeats 4 times, there’s no need to analyze each line separately.
Instruction | Cycles | Alt Cycles | Total Cycles | Memory | ||
---|---|---|---|---|---|---|
lda #$07 | 2 | x 1 | 2 | 2 | ||
ldy #$00 | 2 | x 1 | 2 (x 4) | 2 (x 4) | ||
loop sta $0200,y | 5 | x 256 | 1280 (x 4) | 3 (x 4) | ||
iny | 2 | x 256 | 512 (x 4) | 1 (x 4) | ||
bne loop | 2 | 1 | 3 | 255 | 767 (x 4) | 2 (x 4) |
Total Cycles = 10,246 cycles
Execution Time: 10.246 mS (1 MHz Clock Speed)
Total Memory: 34 bytes
Well – this didn’t really optimize the runtime, and it definitely didn’t optimize the memory usage. So this isn’t the way to go.
But let’s look closer at the code. Each page-loop repeats exactly the same amount, colors the same location relative to the page, and increments Y by 1. What if we combine all the loops into 1?
Attempt 2: Combine into Single Loop

lda #$07 ; set color
ldy #$00 ; reset index
loop:
sta $0200,y
sta $0300,y
sta $0400,y
sta $0500,y
iny
bne loop
Instruction | Cycles & Cnt | Alt Cycles & Cnt | Total Cycles | Memory | ||
---|---|---|---|---|---|---|
lda #$07 | 2 | x 1 | 2 | 2 | ||
ldy #$00 | 2 | x 1 | 2 | 2 | ||
loop: sta $0200, y | 5 | x 256 | 1280 | 3 | ||
sta $0300, y | 5 | x 256 | 1280 | 3 | ||
sta $0400, y | 5 | x 256 | 1280 | 3 | ||
sta $0500, y | 5 | x 256 | 1280 | 3 | ||
iny | 2 | x 256 | 512 | 1 | ||
bne loop | 2 | 1 | 3 | x 255 | 767 | 2 |
Total Cycles = 6403 cycles
Execution Time: 6.403 mS (1 MHz Clock Speed)
Total Memory: 19 bytes
Okay! That looks a lot better. As you can see from the little gif, each iteration of the loop colors 1 pixel in each of the 4 pages. But most importantly, it does so in only 6403 cycles, which is a significant optimization from the previous 11325 cycles! Similarly, the 27 bytes that were used in the old implementation, got reduced to only 19 bytes. This is because we used less instructions, and completely eliminated the use of the pointer.
Modifications
Now that we have such a short, maintanable and optimized version of out code, we can easily make modifications to it.
Modify Fill Color
The next task, is to simply change the color that the pixels will be set to, from yellow to light blue. Using this 6502 Emulator reference, we see that $e
cooresponds to light blue.

lda #$0e ; set color (light blue)
ldy #$00 ; reset index
loop:
sta $0200,y
sta $0300,y
sta $0400,y
sta $0500,y
iny
bne loop
Differently-Colored Pages
The goal is to color each page in the display in a different color. Given we already have an optimized loop that colors each page, the only component left is changing the color from page to page.
The selected approach below keeps the same loop structure, and stores the current color in the X register. Each iteration, the color is incremented for each page, and then reset back to the original color, to prepare for the next run.

define ST_CLR $02 ; define macro for starting color
ldy #$00 ; set y index
lda #ST_CLR ; load starting color
loop:
clc ; clear carry for addition
sta $0200,y ; color pixel on 1st page
adc #$01 ; increment color
sta $0300,y ; color pixel on 2nd page
adc #$01 ; increment color
sta $0400,y ; color 3rd page
adc #$01 ; increment color
sta $0500,y ; color 4th page
lda #ST_CLR ; Reset color to starting value
iny ; increment y index
bne loop
Instruction | Cycles & Cnt | Alt Cycles & Cnt | Total Cycles | Memory | ||
---|---|---|---|---|---|---|
define ST_CLR $02 | ||||||
lda #$07 | 2 | x 1 | 2 | 2 | ||
ldy #ST_CLR | 2 | x 1 | 2 | 2 | ||
loop: clc | 2 | x 256 | 512 | 1 | ||
sta $0200, y | 5 | x 256 | 1280 | 3 | ||
adc #$01 | 2 | x 256 | 512 | 2 | ||
sta $0300, y | 5 | x 256 | 1280 | 3 | ||
adc #$01 | 2 | x 256 | 512 | 2 | ||
sta $0400, y | 5 | x 256 | 1280 | 3 | ||
adc #$01 | 2 | x 256 | 512 | 2 | ||
sta $0500, y | 5 | x 256 | 1280 | 3 | ||
lda #ST_CLR | 2 | x 256 | 512 | 2 | ||
iny | 2 | x 256 | 512 | 1 | ||
bne loop | 2 | x1 | 3 | x 255 | 767 | 2 |
Total Cycles = 8963 cycles
Execution Time: 8.963 mS (1 MHz Clock Speed)
Total Memory: 28 bytes
This indeed seems to be an optimal, and correct solution to the modification that was required. The runtime increased, but by very little, and although the memory usage increased by quite a bit, other implementations, like coloring each page in a separate loop, would have had a greater impact on it.
Random Pixel Color
Random generator in assembly??? Well, our 6502 emulator actually holds a pseudo-random number generator, at $fe
. This memory location will store a random byte, after every read.
The code below, loads a random number into memory each time, and then loading it on the pixel. The result is a complete random image. Technically, we could just load a single random value each iteration, but that would produce a somwhat symmetrical image.

ldy #$00 ; set y index
loop:
lda $fe ; load random byte
sta $0200,y ; color pixel on page
lda $fe
sta $0300,y
lda $fe
sta $0400,y
lda $fe
sta $0500,y
iny ; increment Y
bne loop
Instruction | Cycles & Cnt | Alt Cycles & Cnt | Total Cycles | Memory | ||
---|---|---|---|---|---|---|
ldy #$00 | 2 | x 1 | 2 | 2 | ||
loop: lda $fe | 3 | x 256 | 768 | 2 | ||
sta $0200, y | 5 | x 256 | 1280 | 3 | ||
lda $fe | 3 | x 256 | 768 | 2 | ||
sta $0300, y | 5 | x 256 | 1280 | 3 | ||
lda $fe | 3 | x 256 | 768 | 2 | ||
sta $0400, y | 5 | x 256 | 1280 | 3 | ||
lda $fe | 3 | x 256 | 768 | 2 | ||
sta $0500, y | 5 | x 256 | 1280 | 3 | ||
iny | 2 | x 256 | 512 | 1 | ||
bne loop | 2 | x 1 | 3 | x 255 | 767 | 2 |
Total Cycles = 9473 cycles
Execution Time: 9.473 mS (1 MHz Clock Speed)
Total Memory: 25 bytes
Conclusion
Now that I’ve optimized my code, and played around with some simple modifications, it’s time to move on to experimenting!
Leave a Reply