Very interesting (and instructive). Our codes are almost identical, except I wasn't using the "parallelization" trick, and was using a slightly different "clock" for timing. I like your clock -- where does it come from? I'm using code I found several years ago on an NI site that uses the CPU's System Clock -- maybe that is showing its age?
Anyway, on my machine, with your code and your clock, but with parallelism turned off (which, on my dual-core machine, sped things up), the Reshape version runs about twice as fast as Concatenate. Time to get out my watchmaker's loupe and figure out why my clock is doing so poorly.
Thanks to both of you for your comments.
BS