04-02-2007 02:27 PM
04-02-2007 03:51 PM
126.96 sounds about right, so yes, subtract this from the samples before doing windowing or FFT. Be sure the raw sample values are expressed as floating-point values (not as unsigned integers) going into the subtraction. So,
A = raw 8-bit unsigned integer values from hijacked.wav
B = values from hijacked.wav converted to 8-byte floating-point DBL's
C = result after subtracting the average DC offset level, 126.96, from B
Now, perform windowing and FFT on result C. Evaluate as before to create something like pp_array.vi containing the 97 values. Save as defaults and repost. (Thanks).
Now then, most of the voiced chunks will want to be sub-divided into lengths that don't evenly divide into 512. Suppose one comes up with a desired length of 29.1 for example. What exactly should be done in such cases? Either you can perform this step for me, or describe it clearly, or I can take my best guess. FYI, here's my best guess:
It may be a few days from whenever you re-post the array of sub-chunk lengths until I get a chance to noodle around with this stuff. With limited time, I won't be taking great pains to optimize the solution for memory or speed, so you can expect some opportunities there.
It's a neat app, and hopefully I'll remember to go back and learn more about your pitch-detection algorithm from your early postings and the links. Or maybe you can post the code that does the Windowing/FFT/characterization algorithm? Fairly long ago I tried very naively and very unsuccessfully to invent my own method for doing things like pitch-shifting and time-shifting. It'd be neat to finally bring it all full circle.
-Kevin P.
04-02-2007 08:24 PM
04-03-2007 02:30 PM
04-03-2007 08:02 PM
04-04-2007 04:06 PM
B) I'm reluctant to get too opinionated about what's the "best" way to handle things. I know LabVIEW pretty well, but I don't know speech-processing theory. Still, for what it's worth, I'll give you my thoughts as a "man on the street."
I agree with you that zero-padding just *feels* wrong. Your idea to save the extra samples and treat them as part of the next chunk seems intutitively sensible to me, though with a couple subtle possible issues.
1. Passing 30 or so voiced samples into the next unvoiced chunk unchanged. All the other voiced samples from that chunk were pitch-shifted, but these aren't. What will be the audible effect? [Offhand guess -- probably not significant. The original breakdown into chunks with a constant size of 512 samples was somewhat arbitrary and chosen for FFT calculation efficiency. No special reason that time-domain processing must proceed in such constant-sized chunks.
2. Passing 30 or so voiced samples into the next voiced chunk where the dominant freq of each is different. Offhand guess on effect very similar to previous.
I think the main downside of what I proposed is that it may slightly restrict the available range of compression / expansion. Probably only a few % in theory, and that may fall out in the wash due to the inherent integer-nature and implied rounding involved in some of these steps.
One last key question for clarification. I referred back to "Pitch and Time scale.png" that you posted early in this thread. That appears to show that the original division into sub-chunks should account for 50% overlap. Based on the formula I proposed recently, if the sub-chunks were supposed to have size = 60, I would define 16 regions which overlap one another by 50%, i.e., 30 samples. Your posting just now referred to 8 sub-chunks of size 60 that span 480 samples, thus implying *no overlap*.
Isn't it necessary to start from 50% overlap? Then sub-chunk ranges can be either deleted or duplicated, causing as little as 0% or as much as 75% overlap?
-Kevin P.
P.S. It doesn't really matter much when you subtract the DC Offset. I'd probably just do it once on the entire array of samples before breaking anything down, but the end result will be the same either way.
04-04-2007 04:57 PM
04-04-2007 05:21 PM
04-04-2007 06:13 PM
04-07-2007 04:58 PM