Teaching computers to hear like humans do.
Along with vision, hearing is one of the most beloved and useful sensory experiences we have but what would we need to do to grant this sense to a computer or Ai program. While we already have plenty of ways to interact with sound on computers I am more interested in the natural way we perceive sound and how that can be written as a software program, it’s just a fun topic.
Game plan: We will first talk about sound in general so we have something to work with and briefly delve into the biology of sound as related to us humans; we will then implement a basic digital version of whatever we found in a second part.
Sound & Hearing
Sound in its simplest form is what happens when energy travels through a physical medium like air and water creating acoustic vibrations we later perceive as sound.
Sound is usually described in wave notation which is useful to better understand and communicate concepts, let’s examine a simple beep sound and how it is represented in notation:
(1) Let's say there is a box that emits a beep ( by rapidly vibrating a speaker cone) (2) Those vibrations travel in all directions from the box, (3) picking just one in horizontal axis (front view) we can then analyze it further...Auditory vibrations (and waves in general) can be described via their cyclical components, amplitude/volume and shape ( here a sine wave ).We can then describe sound based on the number of cycles per second, in this case 20 which equates to a 20Hz(hertz) 1 second tone. Note: A 20Hz tone is normally out of our hearing range, I am using it here for simplicity.
Going back to how we perceive sound, after entering your ear sound waves get amplified and eventually get picked up by hair cells inside the cochlea ( at the organ of corti/basilar membrane), these sound receptors are organized by frequency in an orderly way, a tonotopic map which we will try to eventually emulate.
Focusing on a few key components...(1) Sound enters the ear (2) it gets captured as vibrations and sent into the cochlea where they get converted into neural code and relayed through the auditory nerve to the brain for further processing. (3) The hair cells responsible for coding frequency are located in the basilar membrane ( in Red ), note there is a range (what we can hear) from around 20,000 Hz (20KHz) to 20Hz across this membrane.
Some Hairy stuff: Cross section of the Basilar membrane which is covered in rows of hairs, 3 outer rows (outer hair cells) and one inner row (inner hair cells). To simplify things we can focus on the inner hair cells for now since the outer hair cells are thought to provide amplification and gain control, both respond to sound frequencies though.The bottom illustration is the unfurled basilar membrane with a top view of hair cells, note that there is a range of response and sub ranges which are not linear, that is more cells are employed in the first 2,000 Hz than in the last 18,000 Hz, this is where most of the interesting sounds like speech happen.Speaking of numbers 4,000 Hairs per ear is all you get to hear if you ignore the outer hairs, and unfortunately they don't regenerate which is the main cause of hearing loss.
What the hertz ?
All this is a bit vague without some real examples we can hear and code we can play with, let’s start with a simple tone:
And here’s the python code to make it:
If you are on a mac or linux here’s an alternative cross platform ( but longer ) method :
As mentioned we can only hear sound in a certain range, for instance here we have a number of tones across a number of frequencies that double in the number of cycles:
The first and last frequencies are really hard to hear even if you have good hearing:
The following example shows a common musical scale that still uses our simple sine wave, we’ll mention complex sounds later but for now just notice that these sounds sit within the preferred zone of hair cells:
What about volume ?
Hair cells are sensitive to specific frequencies, you can hopefully see now how they map in the cochlea/basilar membrane. When they are active they respond with action potentials which can be coded* as 0’s and 1’s… volume is believed to be coded* predominantly as the firing rate of these cells:
* When modeling something you usually need to simplify the real thing to some extent while preserving key behavior, in this case we are making the following simplifications:Action potentials can also be represented at a finer range closer to the real response of a hair cell, so instead of 0 or 1, you would get 0.1 for a low volume stimulus and 0.9 for a louder one and some repetition of 1’s if the sound goes beyond the cells preferred range, this would correspond to a whisper, regular talk and a loud horn for instance.Second, outer hair cells and to some extent inner hair cells modify how sensitive a hair cell is, when you are told to open you ears, or are attentive to certain sound, you are increasing the gain in certain frequencies, this behavior while very cool is probably left for now.And third, hair cells have a range of frequencies they respond to usually centered around a preferred one, so volume and response would be distributed in many cells.
Very few sounds we encounter are as simple as a sine wave tone, complex sound can be understood as a combination of individual frequencies, think about the word hello :
You might be used to seeing these sounds represented as waveforms :
A waveform does not tell the whole picture, it just shows you how loud some sound is over time and not what frequencies are in that sound, for that we need something called a spectrogram :
A lot is going on here, so let's pick it apart:(1) This amplified spectrogram shows you a number of frequencies on the vertical axis, here we are using a scale called MEL which better approximates the human hearing range. (ie it's not linear).(2) Volume is coded as color intensity, so green or light colors are mostly silence or noise and darker colors signify higher volume and activity in those frequencies, note that there are some low db sections (white) surrounded by higher db values (dark blue). Color schemes vary depending on the specific spectrogram software and other factors. (3) Let's take the first sound of the word Hello (He or Je), the red region signifies the unfurled cochlea/basilar membrane and whatever is in blue represents the activity of the hair cells over that slice of time.If we plot our simple tones from before we can better see the dramatic difference between complex and simple sounds:
The color scales here are different because the sounds are not amplified but the idea is the same and intensity goes from white>to red> magenta>blue>dark blue. The important thing to note is how a single frequency tone is represented vs complex sounds.
Perception of complex sounds in the brain
We now have all the elements necessary to tell the story of how we perceive complex sounds:
The individual frequencies of complex sounds (1) affect hair cells across the cochlea/basilar membrane. (2) Individual frequencies generate action potentials (3) which together are perceived as complex sounds.Remember that action potentials can be coded as repeating 0’s and 1’s, so parallel streams of 0’s and 1’s seem like a good starting point for emulating this early stage of sound perception in the computer. ----*---
Where do we perceive sound ?
Have you had a catchy tune “playing” in your mind or conversations with yourself (aka inner or covert speech) ? Both perfectly normal and also somehow common not to have.
Well, in essence the question here is what/where is producing these sounds and a similar question is why do we hear at all ? This last one is related to the even thornier concept of “qualia”, but can also be demystified by a course in the cognitive neuroscience of language.
Unfortunately we don’t have all the answers but we do know that without these early stages of hearing both are severely impaired or absent all together.
Another clue comes in the form of the tonotopic map we have been studying, after leaving the ear, the sound information gets processed in different parts of the brain while preserving the same organization, perhaps it is here.
And the mixed clues go on, information from the brain and other cortex areas goes into the ear and other layers as much as it goes from the ear to them, so perhaps we replay sounds by re activating again hair cells in the ear or some related network somewhere else.
For now we will leave the biology, unknowns and higher order functions like speech perception and language so we can focus on the basics and how to translate them to digital counterparts, which I will hopefully cover in a future post.
I hope this has helped you with some sound basics and hope to see you next time.
Thanks for reading !
References & further reading:
Kandel, Eric R., and Sarah Mack. Principles of Neural Science. McGraw-Hill Medical, 2014.
Purves, Dale, et al. Neuroscience. Sinauer Associates, 2018.
Kemmerer, David L. Cognitive Neuroscience of Language. Psychology Press, 2015.