SSE Optimized Compositing

As part of a graphics API I've been working on (for my own use, it's hardly ready for production,) I decided to try learning SSE optimization by making the compositing routine faster. I came up with an implementation which, according to my tests, is 3-5X faster than the non-optimized version. The optimized version below composites a single color starting at a destination pixel buffer for a specified run of pixels. It uses source over compositing on a RGBA, 8bpc, integer

Transparency Layer Slowdowns

As the documentation mentions, a transparency layer allows subsequent drawing to be rendered in a separate, fully transparent buffer before being composited to the destination context. In the absence of a clipping region, this buffer is the same size as the destination context, requiring a context-sized buffer regardless of the actual drawing bounds. Creating a transparency layer for a small section of content, then drawing this layer in a window, for example, results in a window-sized buffer for the layer.

