Thank you CoastalMaineBird.
I did what you said before.
But after testing many times, I aware that the chunk size is the only way to minimize the cost time for my program. That's the reason why I would insist on finding the best chunk size.
I mentioned this before but, perhaps I wasn't clear. Chunks must execute as..Well, chunks. Using you example. The loop iterates 10k times. Iterations 0 through 5999 must complete as the first chunk before processing can begin on the second chunk , iterations 6000 - 8999 etc... Iterations within a chunk can execute in any order and in parallel. (Debugging disabled only) Chunks execute sequentially.
While processing a chunk of iterations the OS schedules each iteration to any available core. With the P terminal unwired all. Cores will be used. You can restrict the number of cores by wiring a value less than the number of cores to P.
By the way, the cost time is vital for me and if the code can reduce 10-20% time, that would be very helpful. But sorry that it is impossible for me to share the code which is banned by my company.
An improvement of 10-20% is insignificant in the grand scheme of things. You can get the same by buying a computer that is 6 month newer. I am with others and strongly suggest to leave things at the default instead of micromanaging every configurable aspect of the parallel FOR loop. In order to fully test this, your benchmarking code needs to be perfect. Is it? Can you show us your testing harness code?
As a first step (and as a baseline) you should try to not parallelize. If the loop code is very simple, it might be faster by avoiding the parallelization overhead. (i.e. split the problem and reassemble at the end). In your trivial example of simply multiplying by 5, I would leave out the loop entirely and maybe SIMD instructions can speed things up (not sure).
Can you tell us a little bit more about the real loop code itself? How long does an iteration take? Do all iteration take the same amount of time? Does the loop code itself contain parts that can execute in parallel? How big are the data structures? How much else do the CPU cores need to do (complex FP updates, background processes, UI thread, etc)
Are your CPU cores hyperthreaded? What is the exact model of your CPU? Is the code supposed to be optimized for exactly one particular PC or should it run well on any possible current or future machine?
I think your fixation on the chunk size is ill advised.
Or, if your algorithms require massive parallelism, you might forget using a PC CPU. Look for solutions using FPGA or GPU computing. Hundreds/thousands of processors can make a huge difference...