Update: This article is due for a re-write; I have split off the basic audio discussion to a new article, which is currently in the process of being written. Once written, this article will be modified to focus on the conversion process, and explain the tradeoffs in quality vs. performance. In the meantime, I did update the visual aids a bit, now that I know how to use Sox, Gnuplot, and Inkscape to good effect. 😉
I don’t intend to go through the math, but knowing how sound and audio conversion works can help understand some of the settings and programs we’ll be using, so I’ll just cover them in this post. If you want deeper information, such as the mathematical formulas for anything in this post, feel free to research them, but for my purposes, the basics are all we really need to get stuff done.
Sound and Measurement
Sound is a form of sequential compression, meaning that air expands and contracts in response to the air around it. Sound compression meaurement looks like a wave, such as the following sine wave:
Now, the top of the waveform identifies when the air is the most compressed, while the bottom of the waveform identifies when the air is the most expanded. As sound is produced, it just keeps going back and forth until it settles back into a non-compressed, non-expanded form, which is identified as the center of the waveform. Keep in mind that sound actually is 3-dimensional, meaning that it compresses and expands in spheres from the sound source; the above wave is just a measuring device at the receiver.
There are two major measurements when dealing with sound, the amplitude and frequency of a wave. The above wave has a specific frequency, because the wave crests and dips happen on a regular basis. The following wave forms show a higher and lower frequency than the above wave form:
In this example, the left image shows a higher frequency, because the sound wave dips and peaks more times in the same time, while the right image shows a lower frequency, because fewer dips and peaks happen over the same span. Frequency is usually perceived as pitch; higher frequencies mean higher pitches.
Frequency is usually identified in Hertz(Hz). Hertz is a measurement identifying how many cycles occur in one second, the higher the number, the more compressions happen in a single second. The human ear can perceive frequencies between 20Hz, or 20 cycles a second, and 20KHz, or 20,000 cycles per second.
Amplitude, or degrees of compression, determines how strong the air is compressed or expanded. The stronger the air compression/expansion, the more air speed is required to snap back to normal, which increases the strength of the sound. This comes across as loudness, or “volume” as it’s commonly called.
Amplitude is measured in decibels(dB), describing the the pressure of the compressed air at its maximum level. The human ear can safely listen to up to somewhere around 70dB over a long period of time, but can withstand up to 110 dB for a very short period; any more than that can mean that the vibrations in the air are too strong for the eardrum to withstand, causing a rupture, or at the very least, damage that will result in scarring.
The range between the lowest and highest amplitude is referred to as the dynamic range, and is often manipulated as part of the audio production process. All the above samples show a minuscule dynamic range, since the waves are all the exact same amplitude. Waveforms usually have varying amplitudes for each cycle, such as the following:
What we’ve been referring to so far is sound in it’s natural form, commonly referred to as analog audio. However, in order to use the sound with a computer, it has to be converted into numbers.
This is done by an Analog-Digital Converter, or ADC for short. What this tool does is takes an amplitude value at a precisely-metered rate. The usually-recommended rate is at least twice the value of the highest frequency being recorded. Most people will generally record at a rate of 44.1KHz or 48KHz, in order to be sure they can cover the entire audible human spectrum.
As you can see, instead of the steady line, you can see the amplitude samples in the audio wave where the audio was recorded (each red plus is a single sample). But, you can see that the recorded samples still come together in the shape of the audio wave, which is the result we want. If we record at a higher frequency (the example is a 5-Hz sinewave recorded at a sample rate of 200 Hz), there will be more of those samples for each wave, thereby increasing its resemblance to the original wave, resulting in a clearer sound. However, it also results in more measurements, increasing the amount of data stored for the same sound.
While this particular explanation was a little dry, I hope this helps you understand how sound is interpreted by a computer, which should make the effects of other tools a little easier to understand.
Have fun, and make something good!