shIsValid* functions do a search through array!!!
That is required because user handles are actually pointers and need to be found in the valid path array. Solution would be to use indices into an array of pointers for internal handle-to-pointer conversion. When a path is deleted, an empty space would be left in the array and used when the next path is created.

How to speed up image upload / manipulation
=============================================

1) shCopyPixels uses memcpy

First, manipulation can be speeded up by modifying shCopyPixels
to copy lines using memcpy directly when source and target
formats are equal. If stride is same too, than we can memcpy
the whole block of memory.

2) What about mapping image manipulation directly to OpenGL
texture manipulation calls? Which formats could support this?

PROBLEM: if NOPS textures are not supported, then writing
and reading the image data back results in a precision loss!
Even if PBO available we'd need to gluScaleImage into it.

--> means: no NOPS, need intermediate buffer anyway

=== Solution1: PBO are available ====

Extension required: EXT_pixel_buffer_object (ARB_pixel_buffer_object ?)

Complexity of implementation: really easy - PBO simply
replaces the buffer that would be used if NOPS were not there

Cannot just glBindBuffer(GL_PIXEL_UNPACK_BUFFER) and then
glReadPixels into client memory, because glPixelStore
doesn't allow for random row byte size ("stride" must be
a multiple of pixel byte size).

We can safely glMapBuffer and copy from it whatever we want
however we want, and do any kind of conversion inbetween.
Is glMapBuffer + memcpy into user memory faster than just
glGetTexImage? Probably yes, since glGetTexImage probably
first downloads the data from GPU anyway.

glMapBuffer better anyway, because we can directly do the format
conversions unsupported by OpenGL (premultiplied to unpremultiplied,
grayscale conversion with different per component coefficients
instead of simple averaging etc.). We use all the exact same code
as when NOPS not supported.


=== Solution2: no PBOs ===

- vgImageSubData     => glTexSubImage2D
- vgGetImageSubData  => glGetTexImage
- vgCopyImage        => glGetTexImage, glTexSubImage2D
- vgSetPixels        => glGetTexImage, glDrawPixels

  (PROBLEM: for glGetTexImage, row length in glPixelStore must
   be a multiple of pixel byte size!)

- when copying pixels to/from the texture, we still need to
  manually clip the working pixel region to the intersection
  of the source and destination rectangles, since opengl spec
  says INVALID_VALUE error is generated for invalid regions
  (e.g. dstX + copyW > dstW)


How to solve great slow-down when scaled up?
=============================================

Reasons:
- cpu is subdividing a loooong path
- fill-rate is a bad thing

1. By writing gradient shaders, there would be no need to
draw into stencil first and then fill the whole area where stencil
odd - at least not when drawing stroke (optimizes half of the pipeline)

2. Real tesselation would reduce fill rate for filled paths, but does
the CPU bottleneck outweight the gain?

3. Early path discarding (transformed bounds outside surface? maybe
early convex-hull rule removal?)