Adding a "transpose" before the tunnel is actually very cheap, because it only changes a flag and does not actually touch the array in memory. It is not a big deal.
For a complete idea, you would also need the same functionality at the output tunnels.
It might need to make a new copy for various reasons (e.g. branching), but it might just flag it as "transposed" instead of shuffling around all elements. I don't know the details. Autoindexing on colums might also be less efficient, because column elements are not adjacent in memory.