Select Page
On Combining Image And Sound In Film

On Combining Image And Sound In Film

Many screenwriters and filmmakers find it very difficult to come up with ways of using sound more cinematically and creatively. Whenever they try, all they get is silence in their heads. It’s almost as if the muses themselves panicked and stampeded out of their minds the moment they heard the word “sound”.

This is not because sound doesn’t have much cinematic potential – it does. The problem lies in the way film sound is thought about. So that’ll be the subject of today’s post, the wrong way of thinking about sound and the alternative that can break the impasse.

Movies are so compelling because they’re made of the same stuff of reality – electromagnetic waves the eyes can see and mechanical waves the ears can hear. It is a mistake, though, to think that faithfully imitating perceptual reality will more effectively draw audiences into the story world. It is in fact the main cause of the creative impasse, like trying to complete a shape in a sliding tile puzzle that has no empty space to move the tiles around.

First, combining image and sound according to ready-made perceptual configurations amounts to mere mechanical reproduction, a big no-no in any form of representation.

Second, real does not necessarily mean authentic.

But most importantly, mechanically adding sound to images for the sake of veridicality can have the undesired effect of creating an attention deficit in the audience.

Attention is what makes a film possible. If a movie fails to engage the audience’s attention, then it is not a journey through a fictional world but a mere pile of celluloid or pixels.

The ability to get the sustained and undivided attention of an audience is what sets a good filmmaker apart from the rest. Filmmaking is in fact all about designing the cognitive processes required to create the illusion of reality in the audience’s mind. It’s the art of guiding attention through the skilful use of cinematic techniques such camera movement, composition, editing, and sound.

But anyone who’s had a go knows that this is not easy. Why? Because attention is metabolically expensive. It requires a lot of energy, of which the brain has only so much at its disposal. If it’s to make it through the day, it needs to find ways of making ends meet. In the case of perception, which is the outcome of attention, the brain goes about saving energy by storing and organising sensory information that finds in the environment by means of two systems known as event files and schemas.

Event files is where all the relevant sensory information about people, places, events and objects is stored, so that the next time the brain encounters them, it can retrieve the necessary information – characteristic visual features, sounds, and so on – quickly and efficiently.

Schemas are mental structures for organising more complex types of information: general knowledge about sequences of events, rules, norms, procedures, and social situations that have been acquired through experience. Film is a good example of a schema. It contains all the norms and conventions required to make sense of a movie, all of which we have acquired through experience, i.e., repeated exposure.

This system makes sense: why waste energy reinventing the wheel each time? By using event files and schemas, the brain can quickly and efficiently form a percept whenever it detects a familiar cue.

The brain also has a surveillance system which constantly monitors the environment by scanning every one of the 11 million bits of data that the sensory organs send each second. The brain constantly compares all this data against its existing event files and schemas. If everything is as expected, then it can carry on running in economy mode. Only if anything deviates from the expectations or if anything suddenly changes will the brain ‘wake up’ and deploy its attentional resources to find out whether these deviations or changes represent a threat, an opportunity, or nothing worth investing energy in.

And this is why it is risky to think that combining image and sound realistically will draw audiences deeper into the story world. Ready-made perceptual configurations can have the opposite effect of telling the audience’s brain that there’s no need to waste on attentional resources as everything is as it should be.

This instinct towards naturalness is understandable. From a survival point of view, one major advantage of being able to acquire information about a same event through multiple sensory channels is that it allows the brain to verify the truthfulness of our perceptions. After all, despite their complexity, our senses are still prone to misperceptions and illusions which could potentially be deadly. Therefore, having senses that carry information along separate pathways allows the brain to cross-check our perceptions and confirm that they’re accurate. So it’s little surprise that many feel realistic combinations of image and sound in film are the best approach, lest the brain might detect the perceptual fallacy that a film is.

But the truth is that we don’t need to worry about veridicality when it comes to film. First, for reasons we’ll see later, the brain is hardwired to automatically accept images and sounds that happen simultaneously in time and space as belonging to a same object, person, or event. In normal circumstances it may immediately carry out a reappraisal and either confirm that it was right or realise that it was wrong. But in the case of film, because it knows it is a schema where everything happens for a reason, it would never discard the image-sound connection as wrong.

Also, film is a form of pretend play. Pretend play allows us to modify representations of reality in our heads. It is good in that it opens up a world of new possibilities for exploring different options. But for it to be effective, the brain needs a way of making sure that real and imagined don’t get mixed up – it would be disastrous if we took a fire or a lion to be imagined. The brain goes around this problem by creating a copy of the original percept and then activating a decoupling mechanism that dissociates the copy from reality. That way we can modify the copy as much as we like without jeopardising the truth value of the real percept. So, our safety being secured, the brain is happy to suspend disbelief and go along with whatever comes its way, no matter how out of this world and improbable it may be.

Veridicality, therefore, is not something a filmmaker should be worrying about when it comes to combining image and sound in film.

Does all this mean we should avoid perceptual realism? No, so long as it serves a specific cinematic purpose. It should be a deliberate choice aimed at having some kind of effect on how the audience will perceive the scene. It’s a bit like deciding whether to use a standard medium shot that feels natural and safe instead of a dramatic low-angle, or mid-key lighting as opposed to low-key.

It doesn’t mean either that we can combine image and sound willy-nilly. The combination of the two must meet one fundamental requirement: that it is done in a way that the brain can make sense of. And what is that way?

Since filmmaking is all about hacking the perceptual and cognitive processes of the brain, it’s simply a matter of finding the right process to hack. This shouldn’t be too difficult, since the brain too has to combine images and sounds that have been captured separately into a coherent and meaningful whole.

The brain receives information about the environment through five different sensory organs. Each captures a different spectrum of physical reality. The eyes pick up electromagnetic waves, the ears mechanical waves, the nose chemical substances in the air, and so on. Also, each sensory organ is processed for the most part in a different region of the brain.

To integrate these multiple sources of information into a unified and meaningful percept, the brain has to solve what is commonly known as the binding problem: it must determine which features belong to the same object or event. And it has to do so at two different levels – physical and semantic.

At the physical level, the brain has to determine what multisensory stimuli belong together. It does so by searching for patterns of neural activity across the different cerebral regions.

The senses send signals to the brain which cause the cerebral parts responsible for processing each sense to become active with neural activity. There are other aspects of reality such as time and space that also have their own dedicated regions and that also get activated by sensory stimuli. So what happens every time we perceive a subject is that the regions in the brain responsible for processing vision, sound, time, and space, all fire up at the same time, and this is how the brain determines what stimuli belong together – by detecting overlapping patterns of neural activity across its different regions.

At the semantic level, the brain solves the problem of integration also by searching for overlapping patterns of information. The only difference is that the information is of a semantic nature instead of physical.

Here the problem is not to determine what belongs together but how it belongs together. If the brain grouped multisensory stimuli physically but not semantically, we would experience the world incoherently. We would be able to recognise the different sounds and images but they would not be grouped meaningfully. On a busy road, while talking to someone, we might perceive the noise of car engines as coming out of their mouth and their words as coming out of car engines. Everything would be random and incoherent and we would not be able to use that information to guide our actions and decisions effectively.

You can get a sense of the nature of the binding problem at the semantic level by trying to answer the following question:

Any luck? Are you maybe thinking, “It depends…?”

What makes it impossible to answer this question is that there are three dimensions of information – shape, size, and colour – spread over three different layers – triangles, squares, and circles but no way to link them. In order to answer the question, you’d need to be given the specific dimension you must use as a reference point or unifying value for making meaningful associations across the different layers. If the question was “Which three figures are alike in the dimension of size?”, then your brain would be able to look for patterns of the “size” cue across the three layers, and thus connect the layers meaningfully. If it was colour, then it would look for overlapping patterns of this cue instead.

This is how it would work in a simple real life audiovisual situation. Let’s say the brain wants to determine what gender a person is. In such case, the physical elements are the face in the visual channel and speech in the auditory channel. The semantic element is “gender”. “Gender” then will be the linking value or dimension the brain will use to integrate audiovisual information meaningfully.

The brain will then start looking for overlapping patterns across the two channels. At the visual level it will find things like skin texture and bone structure. At the auditory level, it will find things like pitch and sound power. It will then fuse them and the result will be a coherent percept that communicates something meaningful, i.e., the gender of the person.

We could change this parameter to “truthfulness” and the final percept would be different. The brain would be searching for different types of cues in each sensory channel that normally indicate whether a person is telling the truth or not – sweat and stress in the voice for example – and the final percept would have a different meaning.

For the record, in perceptual terms, dimensions are all the different things going on in the environment on which we could potentially focus our attention and the layers are the senses.

So how does all this apply to film?

You may recall what I said earlier about the brain being hardwired to automatically accept images and sounds that happen simultaneously in time and space as being causally connected. This is because of physical binding. It is so common that the auditory, visual, temporal, and spatial processors fire up simultaneously in the brain that evolution has concocted a rule that goes something like, “If auditory and visual stimuli happen simultaneously in time and space, then automatically synchronise them”. That’s just what we get at the movie theatre.

As for semantic binding, this is the principle we need to follow to combine image and sound in a way that the brain can make sense of, whether the combination is faithful to our everyday perceptual reality or not. It is the process a filmmaker can exploit to use the construction of auditory and visual elements in ways that serve the dramatic, narrative, and cinematic needs of the story and not just the brain’s demand for veridical perception.

The process is the same: to select the auditory and visual elements that will overlap to form an audiovisual pattern, and to manipulate them so that they are congruent with each other by way of sharing a common unifying dimension or value.

So first you need to select a unifying value to integrate image and sound meaningfully. Then you need to include features in both the image and sound channels that a) are semantically related to that unifying value and b) have a counterpart in the other channel that the brain can associate. Or put more simply, you need features in both image and sound that are related to a unifying value and that combined form an audiovisual pattern that the brain can detect and make sense of within the context of the story.

One phenomenon that clearly demonstrates how semantic binding works in film is the ability to successfully use different types of music with a same set of images. Say we have as the visual setting a couple by the beach at sunset. If we add a romantic melody, it will work perfectly well, but so will a suspenseful tune. Why?

In the first case, the visuals contain elements that, at least in our culture, are perceived as romantic: a detached natural setting and dim light that inclines couples to more freely express their feelings. The music also contains elements that are also perceived as romantic: simple chord progressions, a predictable linear melody, use of a major key, and so on. In such case, the brain has no problem finding overlapping patterns across image and sound that it can integrate meaningfully, i.e., that the couple are about to consummate their love for each other.

But the visuals also contain dimensions that we tend to perceive as danger. In the darkness our sexual inhibitions may decrease, but also our darker side may feel freer to come out. In the darkness we also feel more vulnerable. Therefore, the elements that characterise suspenseful music – dissonant chords, eerie intervals, non-linear sounds, a minor key and so on – will also align well with the visual elements of danger.

In short, a lonely beach at sunset is as good a setting for romance as it is for murder. Therefore, whether we use a romantic melody or a suspenseful tune, the brain will find corresponding patterns in both image and sound and will align them in our minds either way. Each alignment, though, will produce a very different meaning.

As we saw earlier, in everyday life perception, our needs and intentions dictate the unifying value we choose to focus on, which in turn will dictate what aspects of the environment make it to our perceptual field.

And as we’ve just seen, in film, the cinematic needs of the story (romance, suspense, and so on) dictate the unifying value, and the filmmaker determines what auditory and visual elements will be selected to align with this value, so that both image and sound work harmoniously with each other to create the desired effect and meaning.

But there’s more to it. This unifying value will not just determine what auditory and visual elements go in but also how they will be manipulated to make the overlapping pattern work. In the beach example, we would use different camera angles, framing, and editing for the romantic option than we would use for the suspenseful one.

In summary, in film, the unifying value is determined by the cinematic needs of the story and the scene in question. The unifying value in turn determines what auditory and visual elements will be required and how they need to be manipulated so as to create correspondences or patterns between image and sound, so that the brain can make the right associations.

Two scenes from two different scenes that fabulously illustrate this principle are the chopper scene in Predator (1988) and the chopper scene in Apocalypse Now (1977). They are ideal because both films share a same subject matter, war, and both scenes share the same setting and many elements: a chopper, soldiers, and a stereo playing music. But because each has a different purpose and therefore requires a very different set of choices regarding the selection and manipulation of auditory and visual elements to create an effective audiovisual pattern.

Predator. This is one of those films where the audience must be made to care for the team as a whole and not just fo the main character. It is in fact the group’s bantering, their comradeship, and how well they work as a team what makes the film so enjoyable to watch.

The chopper scene plays a key role in that respect. Its aim is to establish and build that sense of comradeship among the team and their knack of teasing each other. Comradeship therefore is the unifying value that drives the interaction between image and sound. And this is how the auditory and visual elements were selected, manipulated, and combined to serve that purpose:

On the visual side of things, the inside of the helicopter is lit with a dim red light to create a sense of warmth and intimacy, perfect for creating an atmosphere conducive to bonding. Framing consists of medium close-ups, and editing mostly of action and reaction shots that not only create a stronger connection with the characters but also show the nuances of their interactions and the sense of camaraderie emerging between them.

Sound wise, the dialogue consists mostly of their bantering. “Long Tall Sally”, a 1956 Rock and Roll song, is playing in the background with a small portable stereo player as its source. This is an interesting choice of music, since Rock and Roll has traditionally been used for male bonding purposes. And that’s just what the song is doing here, helping them bond and putting them in the right frame of mind for bantering and developing that sense of care for each other that is so essential to the plot and to the audience getting to like them and wanting to spend time with them. As for the manipulation of the music, because it is there for the benefit of the characters, it has to be diegetic, so the frequency range of the song had to be adjusted to make it sound like it is coming from small stereo and also so as not to interfere with the dialogue.

The correspondences between image and sound that form the audiovisual pattern then are the dim red light, which works well with rock and roll to create an atmosphere conductive to bonding; and the reduction of the frequency range of the music, which gives a natural sense of space and allows for the dialogue to be clear; the clarity of the dialogue in turn plays well with the medium close-ups, which in turn serve the purpose of conveying the sense of emerging camaraderie.

Apocalypse Now. This film is about the moral ambiguity of war, particularly of the Vietnam War. It reveals this theme through the actions of US Army soldiers whose moral values rapidly disintegrate as a result of their participation in a futile, morally unjustified, war. Most notorious is their use of Western cultural artefacts (Wagner, T.S Eliot…) as ‘weapons’ intended to represent a greater “civilised” power that can easily subjugate the indigenous peoples of Vietnam.

The scene of the Ride of the Valkyries captures these thematic elements very skilfully. Its aim is to display the scale and might of the US Army and the way they exploit Western cultural artefacts to tyrannise the invaded. The unifying value that will drive the interaction between image and sound then is scale (of superiority).

Visually, the scene consists of a large number of espectacular extreme wide shots of the fleet getting ready to attack and then charging at the inhabitants of the village.

Aurally, we have Wagner’s opera Ride of the Valkyries being played full blast by a soldier from a stereo inside the helicopter because “It scares the hell out of the slopes”, and the sound of large explosions.

If in Predator images and sounds were warm and intimate, here they are large, distant, and even intimidating. Most interesting is the manipulation of the music. In both scenes the source of the music is a stereo player. But in Apocalypse Now the music had to be manipulated very differently to serve the unifying value of scale and superiority. Even though the music is technically diegetic, it had to be made to work ambi-diegetically, since reducing the frequency range the way Predator does would defeat the point and go against the purpose of the scene, to display scale and might. It wouldn’t match the extreme wide shots and it would not sound large, threatening, and imposing. It would also be asking too much of the audience to believe that the “slopes” would be able to hear a thin-sounding tune, let alone be intimidated by it.

The final point I’d like to make is, why bother? After all, most films seem to be doing just fine with the naturalistic approach.

The reason is simple: the audience. When people invest money, energy and two hours of their time, they want to get a return for it. And what would that return be? Pleasure.

As I mentioned earlier, film is a form of cognitive pretend play. Pretend play is a behaviour that has survival value. Anything that is good for our survival comes with a ‘thank you’ gift from our genes – a shot of dopamine and other feel-good chemicals. And this is ultimately what the audience are after and pay for.

Films are like a gym for the mind. They allow us to hone one of the most fundamental skills we humans need to survive our environment: pattern recognition. That’s what films are, a system of interconnected patterns. And that’s great, because the brain gets a kick out of completing patterns. It loves to impose order on an otherwise highly chaotic environment. It constantly looks for coincidences that alert it to possible causal relationships between events. And when it makes the right connections, that’s when dopamine flows into the bloodstream.

The most common types of patterns used in film tend to be patterns of shapes, light, colour, sound, movement, cause and effect, time, space, behaviour, character, and action. But the relationship between image and sound itself is another rich source of pattern, one that filmmakers seldom exploit. It offers the opportunity to create patterns that convey meaning in a non-linear, more interesting, way and it offers the opportunity to take the brain out of its perceptual slumber. What is there to lose by bothering? And what is there to win? The bottom line is that, when it comes to film, the brain wants to be engaged, and the more layers of engagement, the better. The more patterns to solve, the bigger the fix of dopamine the audience will get.

What I’ve been talking about in this post is far from everything that there is to know about the relationship between image and sound. You can think of the organising principle I’ve described as the overall strategy. Then there’s tactics that offer myriad ways of putting image and sound together. Ultimately, it’s all about creating a whole that is greater than the mere sum of the parts, and that requires a dynamic process.

Exploring what dynamic means requires a different approach to film sound than I’ve been taking so far. So shortly I’ll putting evolution and cognition aside for some time and instead explore the relationship through the lens of semiotics, which deals with human-made meaning.

But before jumping into this fascinating world of meaning making, I’d like pause and take stock of some of the things I’ve been talking about so far. Nothing like seeing things in action, so in my next post I’ll be discussing sound in Lars von Trier’s Dancer In the Dark and Breaking the Waves.

I hope you’ll be joining me. Till then, have a great time.