(It is probably a misconception that shift registers will slow you down. They are very fast. slow is only constant array resizing).
For your problem, just use "reshape array" after padding the input array to contain a multiple of 32 elements and make it into a X-by-32 array.
Feed this into an autoindexing "for loop" where you "split array" each 32 long chunk at position 16. Autoindex each of the two outputs at the loop exit, then reshape each back to a 1-D array.