It is common in audio (and other kinds of signal processing) to

- convert a time-domain signal to a spectrogram via the short-time Fourier transform (STFT)
- modify the spectrogram to get
- convert the modified spectrogram back to a time-domain signal via the inverse short-time Fourier transform (ISTFT)

An example of such processing is spectral subtraction for noise reduction.

When creating the spectrogram in Step 1, a window
is usually applied to the samples in each frame, and the frames overlap. Because of this, it is not completely obvious if and how it is possible to construct a time domain signal from the modified spectrogram.

In fact, depending on the processing that is done, there may not *be* a time-domain signal whose STFT is exactly the output from Step 2. In this note, we are going to ignore this issue and consider only the simple case where we don’t do anything in Step 2:
. That is, we want to recover the original signal
and understand under what conditions it is possible to do so. The good news in this case is that, given conditions that are easily achieved in practice, it is possible to invert the STFT and exactly reproduce
.

It is less common, but sometimes done, to carry out similar processing in the time domain. The steps are

- convert a time-domain signal to a series of overlapping, windowed frames
- modify the frames to get
- convert the modified frames back to a time-domain signal

An example of such processing is to remove segments of an audio signal whose energy is below a threshold and are regarded as silent.

Again, we will focus on the case where no modification is done
and ask when is it possible to recover
from the
It is expected that if we can do that, similar techniques will be valuable for the case where modifications are done.

The first condition for invertibility is that there is no gap between the frames. This makes intuitive sense because otherwise we are leaving out some samples when doing the windowing, so there is no way to reconstruct them.

The second condition is related to the specific technique used, which is called *overlap-add* and is explained below. It is common in the literature to find mention of the *constant overlap-add* (COLA) constraint. As we will see, this is a sufficient condition to achieve invertibility, but it is stronger than necessary and somewhat difficult to satisfy. We will show that the necessary condition, *nonzero overlap-add* (NOLA) constraint, is much weaker and is easily achieved.

Recent versions of SciPy include functions for the STFT and ISTFT. This note was prompted by the implementation of the `istft` function in `scipy.signal` (v1.0.0) which will check if COLA is satisfied and will abort if not. We want to make the case for removing that check.

There are lots of equations in what follows, but the mathematics is elementary and we have taken pains to be clear, perhaps to the point of being pedantic. A short Python script which illustrates the points made in this note is available here.

Let’s suppose we have a sequence
of length
that we want to analyse in successive frames of length
with a hop size of
from one frame to the next. We will only consider the case where there are no gaps between frames, which means that
In other words, successive frames overlap by
samples.

We’ll use
(for time) as the index of the frames and
as the index of the frequencies in a given frame. The samples in frame
are

Let
be a window function
So far we have regarded
and
as finite sequences of length
and
respectively. To simplify the notation in the following, we will assume that they have both been extended to be defined for all
by zero padding. The samples in the windowed frame
can then be conveniently described as

Again,
is defined for all
but only a finite number of samples are nonzero.

We start with the easier case, which is time-domain processing.

We will arrive at the *overlap-add* technique by starting with the following sum. Note that we use infinite limits in the following, but only a finite number of terms are non-zero. For any
(unless some of the
, in which case we must have
consider

We can then recover
from

as long as the denominator is not zero. We call this the *nonzero overlap-add *(NOLA) constraint

A stronger constraint is *constant overlap-add* (COLA), defined by

for some constant
An even more restrictive condition is that

If
and the window
satisfies the strong COLA constraint then we have a particularly simple inversion formula

However, while the COLA constraint is *sufficient*, it is not *necessary* for exact invertibility to be possible.

Although COLA is often introduced for the value
the most common choice is
for reasons that we will explain below.

To get ready to talk about inverting the STFT, we review the definitions of the forward and inverse Fourier transform for finite sampled sequences.

Given a sequence
the discrete Fourier transform of
is given by

We can exactly recover
from
via the inverse Fourier transform

See Lemma 2 in the Appendix if you want a refresher on why this is true.

Again we start with windowed frames

where all sequences are extended to
by zero padding.

The Fourier transform of frame
is

We define a sequence
which is the inverse Fourier transform of
in frame
and zero-padded for other

Because we did not modify
we have
But we’ll keep the notation
to look forward to the situation where we do have a modified

By analogy with overlap-add in the time domain we consider the following sum. The equations look horrendous, but the calculation is straightforward.

In the line marked (*), we used Lemma 1 (below). In the next line, we used the fact that
and
are nonzero for the same values of

We conclude that

which is valid when the denominator is not zero. The inversion formula is valid under any of the following conditions, listed in order of increasing strictness

- NOLA: for all
- for all (see Lemma 3 below)
- COLA: for all and some constant
- Strong COLA: for all

As mentioned above, it is most common to choose
The reason is that in the case where we did modify the STFT, there may not be a time-domain signal whose STFT matches our modified version. Choosing
gives the Griffin-Lim optimal estimate (optimal in a least-squares sense) for a time-domain signal from a modified STFT. SciPy implements
for the signal reconstruction in the `istft` routine.

We have shown that the commonly-used COLA constraint is sufficient but not necessary for the windowed time domain or STFT to be invertible. The weaker NOLA condition is necessary and sufficient. Almost all commonly-used window/overlap combinations satisfy NOLA.

We were motivated do this by the implementation of the SciPy library function `istft`. We recommend that the COLA constraint imposed by that function should be removed or at least made optional. Update: There is now a pull request for this in the SciPy repo.

where

The case
is obvious because in this case each term in the sum is equal to
. QED

Given a sequence
its sampled discrete Fourier transform is

We can recover
from
by the equation

QED

Let
be a window with strictly positive values:
for all
Extend
to infinity at both ends by padding with zeros. Then if

for any

We’ll pick a
that works. Let

By definition of the floor function,

Flipping this around and using the fact that
, we have

which is what we needed to show. QED