Dialog Editing Workflow

Yanni Caldas
Oct 13, 2023
7 min read

Updated: May 24, 2024

Dialog editing is an entire art in itself that I believe everyone is constantly learning and adapting for every project. Streamlining the editing process can make the whole pipeline run smoother so that the recorded dialog is heard in your project as quickly as possible. Editing dialog generally has two phases: cleanup, and then enhancement. Disclaimer: This is based on game audio workflows that I have found work for me and am continuously looking for more ways to modify and improve. There are many ways to get the job done.

The first thing I do before editing the dialog is set up and organize my project. The setup involves arranging the microphones to be on two separate tracks (generally with motion capture or studio recording sessions, there is a primary microphone and then a secondary safety microphone in case the primary microphone clips or has an unwanted noise), splitting the recording by character onto separate tracks, creating empty "dummy" tracks beneath the main edit tracks, and creating a grouped track with that has the raw audio duplicated. This setup allows for any non-destructive edits I do (cuts, trimming items, etc.) to become mirrored across the raw audio and the main track. At some point in the process, I usually have to do destructive editing to the audio. Using a setup like this allows me to quickly reference or revert to the raw audio and bring a segment to the main edit track without having to recreate any non-destructive edits or search for the raw audio files. At this point, it can be helpful to create your own reusable piece of room tone by editing together segments of clean room tone from the recording - 15-30 seconds should be plenty.

Once the initial setup is finished, I like to start with two basic filters: a high pass and a low pass. These will get set depending on the voice and the overall style that the dialog should be in. The high pass filter is set up to clean up any low rumble, noise, or anything that is below the fundamental of the spoken dialog so that no unwanted low frequencies are causing potential problems. The low pass filter is something that I will typically set at different spots depending on how “airy” the dialog should be, it would be as low as 12khz if I want a more focused sounding line or it would be set at 17khz or higher to preserve some of the high air so the dialog sounds like it is coming from higher up. Once the initial setup is finished, I like to start with two basic filters: a high pass and a low pass. These will get set depending on the voice and the overall style that the dialog should be in. The filters will generally look similar to the ones below.

After the filters, I begin the first phase of the edit - the non-destructive portion. It involves going through the audio and cutting out unwanted noises, removing long gaps between lines, assembling alt takes to the dummy tracks (also called stacking), and trimming the start and ends of the lines to start exactly when the line (or breath) starts. In Game Audio, the dialog is usually implemented as separate lines (except for cinematics, usually) and is triggered programmatically. Keeping the edits tight allows for lines to trigger exactly when you want them to without any unwanted spaces before the line begins.

There will generally be a take sheet from the recording session that notes which take to use for each line. If the main take to keep has an unwanted characteristic for a single word, you can often edit in that word using one of the alt takes that conveniently are on the dummy tracks below the main edit track.

While doing this initial edit, there can be a handful of various issues that would need to be addressed. Some of these include distortion, unwanted noises between words, or even a word that is not intelligible enough within the take to use. The best approach to this is often to sneak in a word from one of the alt takes (usually it will sound more natural). Other approaches to this are editing in the backup microphone for that word or line (be sure to level and tone match whenever you are doing this to prevent any jarring changes), replacing the noises between the words with the clean room tone that was created initially, or as a last resort, using a destructive process to fix that word.

When switching between takes or the backup microphone, the listener is more likely to hear an abrupt start over an abrupt stop. With this in mind, using a long fade-in at the start of an item and a shorter fade-out to the next item you're blending into allows for a natural-sounding edit that will usually be transparent.

Once the non-destructive portion is complete, the next step in how I approach dialog is to begin the cleanup phase - which tends to be more destructive. There are a handful of different tools that can be used for this. I use iZotope RX but any option that can get the same result will work. Once the line is in RX, I’ll start by cleaning up any unwanted mouth noises. There can be times when mouth noises are part of the performance. It is best to use your judgment on how many noises you want to remove to best fit the performance. For the cleanup process itself, I will generally listen through line by line and highlight the noise like the one below and then use Mouth-DeClick and remove the click. Other times when timelines are tight, sometimes I would use a tame setting to catch the largest clicks and then run all the dialog with this setting. RX comes with many modules that can be used to solve various issues. It is best to use your instinct and ear on what to do on a case-by-case.

Another consideration to keep in mind when editing mouth noises is to listen for any nose knocks or nose whistles. To remove these, I find using spectral recovery works best. However, Austin Mullen has a fantastic mini-video series highlighting how to remove these and it is worth a watch.

Often people will want to know about noise reduction at this point, in my experience it is something that should be reserved for when the recorded audio is in less-than-ideal conditions. If you did have to do noise reduction, using spectral repair and delinking the voice and tonal sensitivity tends to work best in my experience. Too much reduction can suck the life out of a performance and leave it in an unnatural void of silence so it’s best to listen closely for any unnatural artifacts (especially in higher frequencies).

Once the dialog is cleaned up in RX, it then transitions from the cleanup phase to the enhancement phase. The goal here is to control the dialog a bit more, bring out more of the performance, and make it sound as good as possible no matter where it is being heard. The first thing I will go and do is perform item/clip gain on the dialog to turn down the sibilance by a little bit so a later de-esser does not have to work as hard. There is a handy script for Reaper that can do this quickly called: Script: gen_Envelope-based Deesser.eel

While turning down the sibilance, I will also be turning up or down the quiet/loud bits of dialog so a compressor does not have to work as hard. This is also a moment where if you want to make a line sound bigger or softer, changing the volume of those lines can have a very dramatic impact on the emotional delivery. For example, having the line slowly come up in volume as its being performance (by only a few dB) can make the line come across more powerful. At the same time, the opposite can make the delivery sound weaker. After doing the item automation, I’ll then set a gentle compressor to “glue” the dialog together and control the peaks, often I will have different attack and release settings for different performances to change how punchy or aggressive the line comes across. For example, a slower attack will make the line sound more “in your face” whereas a faster attack will make it sound smoother. The goal here is around 2-3dB of reduction on average so that it is transparent - unless you want to hear the compression effect.

Since compression tends to bring out the subtle details tone-wise of a line, I like to do a final EQ pass for any extra “corrections” that are needed from the compression. Slower attacks tend to make the voice brighter since it’s not latching onto higher frequencies as much. This is also where I will do some dynamic eq to control the low and low mids of the voice of the performer going up and down their register a lot. This allows control over the amount of bass in their voice or even control the amount of proximity effect from the microphone. to control the amount of bass in their voice), if they were moving around a lot in front of the mic, or to correct any “boxyness” in the recording. I’ll often also turn up or down somewhere between 2-5khz depending on if the line needs to sound more aggressive or soft to fit the context. Turning up or down ~1khz can change how focused sounding the line is as well.

Once the edit for that line is complete, I will create a render region with the correct filename and then move on to the next line.

All and all, this all changes from project to project, performer to performer, and even recording to recording. It’s best to use your judgment on whether something needs to be done to each line to bring out the best of each performance. I hope all of this is helpful or inspires you to try something new - happy editing!