What's the deal with windowed mode? Why is it so much slower than fullscreen? Can I make it faster?
In Mac OS X, all windows are double-buffered. The application (in this case, SDL's blitting engine) draws into the back buffer. The front buffer is that actual display video memory. When the application is finished drawing, it tells the window server to display the back buffer on the screen. The window server examines the back buffer of other on-screen windows that can affect the appearance of the window you are drawing into, composites your back buffer with the other ones to create the final image which goes back to visible video memory, and you see the result. This is how Quartz can achieve all those fancy alpha affects, and get tear-free window dragging and animation.
The downside is that in the worst case you have to process 2X as many pixels as you normally would, and worse still if there is lots of compositing to do. In fullscreen mode, we bypass the window server and draw directly to video memory, which is why that is so much faster. We can't get away with that in windowed mode.
You may be able to make things faster by calling SDL_UpdateRects() instead of SDL_UpdateRect() or SDL_Flip(). With SDL_UpdateRects() you can tell SDL exactly what rectangles of the surface you painted, and they will all be handed to the window server at once. That way, the window server does the least amount of work possible. Try using the QuartzDebug tool to investigate what areas of the screen you are telling the window server to redraw. If large areas aren't changing but are being updated, this optimization will really help you.
If you must scroll/redraw the entire window every frame, try OpenGL. OpenGL will use graphics hardware in windowed mode, thus bypassing the window server's compositing engine. Just about every system that can run Mac OS X has hardware OpenGL support.
Mac OS 10.2 (Jaguar) changes things a bit with QuartzExtreme. QuartzExtreme offloads compositing to the graphics hardware, so it can be much faster at some tasks. In addition, it uses busmaster DMA to transfer the backbuffer as a texture to the OpenGL compositor. This second point is of interest, since the backbuffer can be transferred while leaving the CPU free for other tasks. To get the best performance on 10.2, you have to take advantage of those idle CPU cycles.
Since the DMA transfer occurs in SDL_UpdateRects()(or SDL_Flip() or SDL_UpdateRect()), try to delay all drawing to the screen surface until just before SDL_UpdateRects() is called so you can overlap the DMA transfer with other tasks. Also, you can use a double-buffered SDL surface (which translates to triple buffering). That way, you can overlap screen blitting operations with the DMA transfer. Remember the aforementioned 10.1 optimizations still apply (though to a lesser extent). In 10.2, you are optimizing the amount of data being DMA'd, not the amount of compositing done in the window server.
Also in QE, the OpenGL compositor always uses 32-bit textures. So depending on the application, you may see better performance by using a 32-bit SDL surface since you avoid pixel format conversion. Note that you might not see a speedup since DMA overlap may be decreased if you're not careful. Remember that you can't draw into a surface while it is being DMA'd to video memory. If you try to, you'll just stall your application as it waits for the DMA transfer to complete, thus wasting valuable CPU cycles.