Introduction

To begin with we consider a calculus problem that you may have seen in your exam:

Let $f$ be a continuous function on $[0,\infty)$ that $\lim_{x \to \infty} f(x)=l$. Prove that

And we solve this problem as follows. Put $g(x)=f(x)-l$, then $\lim_{x \to \infty}g(x)=0$. Consider the two variable function $F(x,y)=-g’(xy)$ and the range $D=\{(x,y):x \ge 0, a \le y \le b\}$​, we have this result:

Substituting $g(x)$​ with $f(x)-l$​ gives exactly what we want, isn’t it? Well, the more analysis you learn, the more absurd this proof has been you will realise. If you write this in an exam you will get $0$​​ mark no matter what. There are two major mistakes:

  1. Can we change the order of integration? We have no idea. But it is certain that we cannot change the order with ease, and we have some counterexamples.
  2. Is this function even differentiable? We also have no idea. It is almost certain that $f$​ is not (the probability that $f$ is differentiable is $0$), see this post to learn why if you have some background in functional analysis.

For a good proof, please turn to math.stackexchange. This is not easy at all.

The problem is, it is really unfair that in some circumstances we have to axe out all properties of differentiation. If you are studying differential equations, and a non-differentiable function pops up, you have no way to go. Sometimes, chances are that you even have no idea whether a function is differentiable.

So this post is written. We introduce the concept of (Schwartz) distribution (a.k.a. generalised functions), where differentiation is significantly extended, to obtain derivative in a generalised sense. Roughly speaking, after distribution being introduced, differentiation can be done with absolute ease.

A function of bad blood among physicists

In fact, physicists have been using distribution long before mathematicians established formal theories. For example the $\delta$​​ function introduced by Dirac that you may have met in Fourier transform:

And it is required that

But this does not make any sense in calculus. Von Neumann, in his book on quantum physics, warned against the theory using this function, and dismissed this function because this was a “fiction”. Not so pleasant. He tried with a lot of effort to demonstrate that, quantum physics could live without such a “fiction”. As you can imagine, this function may have created some bad blood between von Neumann and Dirac.

Laurent Schwartz however, managed to be a peacemaker. He developed the theory of distribution (which is exactly what we are talking about in this post), and the “fiction” became an easy “fact”. Years later, he became the 1950 Fields Medalist (one of the most prestigious medal/awards in mathematics) at the age of 35 with reason

Developed the theory of distributions, a new notion of generalized function motivated by the Dirac delta-function of theoretical physics. (Source)

As you can see later, thanks to Schwartz, the twisted $\delta$ function is well-defined and is really plain and elegant. So von Neumann didn’t need to be angry later.

On backgrounds

By concept I mean, I will try to include basic ideas (without many proofs though they can be delivered), so that the serious study of it can be simpler (it can be really tough!). It is not possible that you can solve problems on distributions after reading this post.

There will be two parts. Part one focus on motivation and what is going on. I will try to make it readable to many people having finished calculus or more ideally undergraduate analysis and linear algebra, though rigour is not always guaranteed. It would be better if you know some differential equation theory, but that’s not a must. If you already have the background to read part 2, then part 1 is much easier for you and therefore is served as a good source of intuition and motivation.

If you still need to understand differentiation in single-variable calculus, then you have no need to struggle on generalised differentiation at an early point. It does not help. The requirements of linear algebra are vector spaces, subspaces and linear maps. You should know that integration and differentiation are linear maps. This is a graduate course topic, it is not realistic to assume reader to have no idea about calculus and linear algebra.

The second part will be much more advanced, and you are expected to have some background in topological vector spaces (functional analysis). Both parts cannot be considered as a lecture note but they may help you find where you are when you study this concept seriously.

Part I - Integration by Parts

Throughout, we consider functions on $\mathbb{R}$​​ with real value. These theories can be generalised to $\mathbb{R}^n$​​ with complex value where partial derivative can take part in, but we are not doing that here. At the end of the day, these work would not be a big deal.

Motivation: the vector space of distributions

In calculus, a lot of functions we study are smooth (for example, $y=\sin{x}$), and we write $C^\infty$ as they are infinitely differentiable. This is a vector space and this vector space differentiation can be done with absolute ease. For given $f \in C^\infty$, we have $f’,f’’,\cdots,f^{(k)}$ well defined for all $k = 1,2,\cdots$. But in vector spaces like $C^2$, $C^1$, or even $C$, differentiation can only be done with caution: we may only have $f’’$ and no $f^{(3)}$, or even $f’$ does not exist. We don’t feel like this kind of caution. Hence we introduce the concept of distribution which is also known as generalised functions. We want a space where we can still do differentiation with absolute ease. We may need to modify our definition of differentiation such that it works on every continuous functions (but it shall not lost its meaning within $C^\infty$​). Bearing these in mind, we have several settings or expectations for distributions:

  1. Every continuous function should be (considered as) a distribution. (So we can take derivatives for all continuous functions without to many worry. Unlike the calculus problem at the beginning.)
  2. The “modified differentiation” should make sure that the “modified derivative” of a distribution is still a distribution. In other words, distributions are “infinitely differentiable” (which makes differential equation theory much easier). In the language of algebra, the “modified derivative” should be an endomorphism.
  3. The usual formal rules of calculus should hold. For example in the new sense we should still have $(fg)’=f’g+g’f$​​. (Our modified differentiation should not go to far.)
  4. Convergence properties should also be available. (Validating this requires more theories so this can only be mentioned in part 2.)

Let’s write our desired distribution as $\mathscr{D}’$​​​, and all continuous functions $C$​​​. All $C,C^\infty,\mathscr{D}’$​​​​ are considered as real vector spaces and we should have

in the sense of subspaces.

What is distribution and extended differentiation exactly

Here is a breakdown of these concepts. You will see terminologies and definitions later.

  • A smooth, continuous or more generally, locally integrable function, give rise to a bounded linear functional. The converse is not guaranteed to be true, but we pretend it to be true, so all bounded linear functionals give rise to distributions, a.k.a. generalised functions (this name is nice because we pretend the converse to be true). Whenever you are asked what is generalised function, you can say, it is a linear map, and sometimes it can be determined by a normal function.
  • For these distributions or generalised functions, we modify the derivative with respect to integration by parts. The modified derivative cannot be put down explicitly but we don’t care, because integration by parts doesn’t give us many problems. Whenever you are asked how the derivative of a non-differentiable function is given, you can say, it is given by pretending nothing wrong in integration by parts.

We now try to understand what we really what about distribution. We start our study through integration, because differentiation does not work. Given $f \in C \subset \mathscr{D}’$, we first need to make sure $\int f\phi$ is well-defined, for some $\phi\in C^\infty$​, because we want to do integration by parts, which involves some differentiation, and we may make use of it.

If $f$​ is not even a continuous function, we still need to consider some $\phi$ in the same manner, or our extension would be abrupt.

Let’s talk about these $\phi$​​​ a little bit, with respect to integration by parts. Consider the bump function

On $(a,b)$, we have $ \phi\ne 0$. On the boundary $a$ and $b$ we have $\phi(x)=0$ but that shouldn’t be a problem, because they are the alpha and omega. Points outside $[a,b]$ have no contribution to the value of this function. For some obvious reason we call $[a,b]$ the closure of $(a,b)$. In general, given a real-valued function $f$, we call the closure of the set of points where $f(x) \ne 0$ the support of $f$. As you can tell, the support of $\phi$ is $[a,b]$.

If $\phi$ has unbounded support (the support of a function $f$ is the closure of the set of points $x$ where $f(x) \ne 0$), then we may need to discuss limit at infinity. But we don’t want improper integrals at all. Hence the support of $\phi$ are always assumed to be closed and bounded subset of $\mathbb{R}$ It is closed because it is defined to be a closure. These closed and bounded sets are called compact sets. If you are not familiar with topology, it is OK at this moment to consider compact sets as bounded closed interval $[a,b]$.

The test function space $\mathscr{D}$ is defined to be all $C^\infty$ functions with compact support. This is indeed a vector space and the verification is a good excise on both linear algebra and calculus. What about $\mathscr{D}’$​​​? Here we demonstrate how things are extended.

For each $f \in C$​ (which contains $C^\infty$​​), we have a functional (a functional is a linear map between a vector space and its base field, here is $\mathbb{R}$. Nothing special, just a different name that has been used by mathematicians for decades!)

This functional is bounded for all $\phi \in \mathscr{D}$ because if $\phi$ has support $K$​​, then

A continuous function on a compact set is always bounded (proof), hence the integral on the right hand side is always bounded. If it touches infinity a lot of problems are also touched.

In general, a bounded linear functional $\Lambda:\mathscr{D} \to \mathbb{R}$​ is called a distribution, which forms $\mathscr{D}’$​​​ exactly.​ Since every continuous function $f$ gives rise to a unique bounded functional $\Lambda_f$, we consider $C$ as a subspace of $\mathscr{D}’$​. Such a function give rise to a functional, which is called distribution. The converse is not generally true, but we pretend it to be true (we pretend the functional gives rise to a function anyway), which makes our study easier, hence the name generalised function is well-deserved.

Distribution derivative

Differential operator $D$ in $C^\infty$ should be extended naturally into $\mathscr{D}’$​ naturally. There are many ways to extend a linear function. For example the identity map $i:\mathbb{R} \to \mathbb{R}$ has at least two ways to be extended into $\mathbb{R}^2$:

  1. $I:\mathbb{R}^2 \to \mathbb{R}^2$​ by $(x,y) \mapsto (x,y)$​.
  2. $\pi:\mathbb{R}^2 \to \mathbb{R}$​ by $(x,y) \mapsto x$​.

The restriction of these two maps on $\mathbb{R}$ is the same as $i$.

But if we extend $D$​​ in several ways, things would be messy. Originally derivative is defined in the sense of limit, but for a non-differentiable function, we cannot do that. We need an extension that makes most sense: it is by validating integration by parts. It seems like we are developing some advanced concepts, but still we need to make use of elementary ones.

For $f(x)=\sin{x}$ and $\phi \in \mathscr{D}$, we have

The derivative of $f$ is assigned to the derivative of $\phi$. Again we are using integration by parts. If $f$ is not assumed to be differentiable, we pretend it is, skip the body and jump to the result immediately. For example, $f(x)=|x|$ is not differentiable, but we do that anyway:

In general for $f \in C^\infty$, we have (this can be verified by some computation)

Differentiation for distributions (on top of $C^\infty$ functions) should be in the same shape, hence we define the $k$-th distribution derivative of a distribution $\Lambda$ by

Since all $\phi$ are assumed to be of $C^\infty$, there are no problem with this formula and this differentiation is defined for all $\Lambda$​. We don’t care about first order limit on a continuous but not differentiable function. What matters here is the differentiation on test functions.

Why integration by parts

Try to recall what you have learnt about integration by parts. We have

because

Therefore, if our generalisation of differentiation (though we do not know how to do yet) pays respect to integration by parts, then we can still work on product rule of differentiation, hence the usual formal rules of calculus would not go too far. If our extension conflicts with integration by parts, then the ordinary meaning of differentiation is damaged.


Let’s sum up what has happened. We have obtained an inclusion

Every distribution is infinitely differentiable because functions in $\mathscr{D}$ are. If $f \in C^\infty$, then the $k$-th derivative can be understood in both the sense of ordinary differentiation and the sense of distribution because it is given by

This is independent to the choice of $\phi$​​. If $h$ is a function such that $\int h\phi = \int f^{(k)}\phi$, then $h=f^{(k)}$.

If $f$​ is merely continuous, still we can write the $k$​​-th derivative as

At this point, whether $f$​ is differentiable or not is not of our concern. Since $\phi$​​ is smooth, the formula above is well-defined. In general we don’t even care whether $f$​​ is continuous or even integrable, as long as it gives rise to a bounded linear functional, which can be guaranteed by being locally integrable. A function is locally integrable if $\int_K |f|<\infty$ for all compact $K \subset \mathbb{R}$. In particular, $K$ can be taken to be any bounded closed interval. As long as $f$​​ is locally integrable (for example, differentiable, continuous, or simply bounded), we can assign derivative in the new sense (integration by parts).

Product rule of differentiation

We want something like $(fg)’=f’g+fg’$​. To avoid confusion we use $D$​ to denote the derivative on distribution and $f’$​ to denote the derivative in the ordinary sense. This is pretty hard but for a multiplication of a $C^\infty$​ function and a distribution it is not that hard. Suppose $\Lambda \in \mathscr{D}’$​ and $f \in C^\infty$​. We define their ‘product’ by

We have another distribution and derivative follows in a natural way:

Meanwhile

Things still work in this aspect.

We haven’t verify convergence yet, but that requires much more knowledge on functional analysis, so we don’t do that here but in part 2. Fortunately, things would go in an intuitive way.

Dirac $\delta$ distribution

Consider the linear functional on $\mathscr{D}$ by

This is bounded and is in fact our rigour definition of Dirac $\delta$ function (Von Neumann can relax then!). It does have the required property. Say, if we realise this function as integration (informally) as

then $\delta$​ can indeed be considered as a function whose support is the origin, and the integral over $\mathbb{R}$ is $1$.

The derivative of $\delta$ is well-presented as well. Note $\delta’(\phi)=\delta(\phi’)$, hence we have


So much for part 1. If you don’t have many background in functional analysis, then part 2 is not recommended, as you have no idea what is going on at all. It is not feasible to make part 2 to be readable to more people.

Part II - Topology and Calculus - a Overview

Here we provide some basic facts of test functions and distributions, assuming the reader some background in functional analysis. No proof is delivered because if I do this post can be as long as I want. I hope by organising facts here I can help you realise what is going on before you drown yourself in details of a proof. It is recommended to see the table of content on the right hand side first if you are on PC. You can click the expand all button there.

Topology

In brief, test functions are smooth functions with compact support. By the support of a function we mean the closure of the set $\{x:f(x) \ne 0\}$​. Let $K$​ be a compact set in $\mathbb{R}$​, then $\mathscr{D}_K$​ denotes a subspace of $C^\infty$​ whose support lies in $K$​​. Since a closed subset of a compact set itself is compact, we see all functions in $\mathscr{D}_K$​ have compact support.

Test function space is defined by

And the distribution space $\mathscr{D}’$​​ is defined to be the dual space of $\mathscr{D}$​​, i.e. the space of continuous linear functionals of $\mathscr{D}$​​. But if we don’t know the topology of $\mathscr{D}$​​, we cannot proceed. Here is how we attempt to establish the norm.

Topology attempt 1 - norm

Consider the norm for $\phi \in \mathscr{D}$ for all $N=0,1,2,\cdots$ by

This induces a local base

And we get a locally convex metrisable topology on $\mathscr{D}$.

If this topology makes $\mathscr{D}$ a Banach space, then it would be fantastic - a lot of Banach space technique can be used. However, this topology is too small to be complete. One simply need to consider this sequence:

where $\phi \in \mathscr{D}_{[0,1]}$ and $\phi>0$ on $(0,1)$. This sequence is Cauchy but the limit has no bounded support hence does not lie in $\mathscr{D}$.

Topology attempt 2 - enhancement

This time we do an enhancement on the previous topology, which makes $\mathscr{D}$ a locally convex topological space, which is complete and has the Heine-Borel property (closed and bounded set is compact and vice versa). We still need the topology defined in our first attempt. It is broken into three steps:

  1. For each compact set $K$, let $\tau_K$​ denote the subspace topology of $\mathscr{D}$ defined in attempt 1.
  2. Let $\beta$ be the collection of all convex balanced set $W \subset \mathscr{D}$ such that $\mathscr{D}_K \cap W \in \tau_K$ for all compact $K$​. (A set $W$ is balanced if $\alpha{W} \subset W$ for all $|\alpha| \le 1$.)
  3. The new topology $\tau$ is defined to be the collection of all unions of sets of the form $\phi + W$ with $\phi \in \mathscr{D}$ and $W \in \beta$.

This is the topology we want, and one can indeed verify that $\tau$ is a topology, with local base $\beta$. This topology has the following properties:

  1. $\tau$ makes $\mathscr{D}$ a locally convex topological vector space.
  2. $\mathscr{D}$ has the Heine-Borel property.
  3. In $\mathscr{D}$, every Cauchy sequence converges.

Locally, the topology of $\mathscr{D}_K$​ is the same as $\tau_K$​​​. Hence we can still use properties of these norms if we want. In fact, this $\tau_K$ makes $\mathscr{D}_K$​​​ a Fréchet space, i.e. locally compact and complete metric space.

Continuity and category

We cannot discuss continuity without topology. But still continuity has to be treated carefully. For example the space $L^p([0,1])$ with $0<p<1$​​ is weird: the dual space is trivial, due to its topology: the only two open convex sets are empty set and itself. Fortunately we have the following, which is quite intuitive.

Suppose $\Lambda$ is a linear mapping of $\mathscr{D}$ into a locally compact convex space $Y$ (which can be $\mathbb{R}$, $\mathbb{C}$ or $\mathscr{D}$​ itself). Then the following are equivalent:

  1. $\Lambda$ is continuous. (We care about the behaviour of $\mathscr{D}’$)
  2. $\Lambda$ is bounded. (You must have learnt the equivalence of 1 and 2 already)
  3. $\phi_i \to 0$ in $\mathscr{D}$ implies $\Lambda\phi_i \to 0$ in $Y$.
  4. The restriction of $\Lambda$ to every $\mathscr{D}_K$ is continuous.

In particular, it follows that the differential operator $D^n$ is continuous for all $n$​. We also have some knowledge of the behaviour of $\mathscr{D}’$ now:

If $\Lambda$ is a linear functional on $\mathscr{D}$, then the following are equivalent:

  1. $\Lambda \in \mathscr{D}’$.
  2. To every compact set $K$ there corresponds a nonnegative integer $N$ and a constant $C<\infty$ such that the inequality

holds for every $\mathscr{D}_K$.


Consider the Dirac distribution on $x$ given by

This is indeed a distribution. The case when $x=0$ gives us the Dirac function in physics. Note

$\mathscr{D}_K$ is a closed subspace of $\mathscr{D}$. Since $\mathscr{D}_K$ is also nowhere dense, and there is a countable collection of $K_i \subset \mathbb{R}$ (for example $K_i=[-i,i]$) such that $\mathscr{D} = \bigcup \mathscr{D}_i$ (of the first category), and $\mathscr{D}$ itself is complete, by Baire’s Category Theorem, $\mathscr{D}$ is not metrisable. This is a flaw of the topology of $\mathscr{D}$, though is not that troublesome.

Calculus of distributions

We have shown that every $C^\infty$ functions can be considered as a distribution. In general, for a function $f$ one only need to require that $f$ is locally integrable, i.e. for every compact set $K$ we have

If we define $\Lambda_f:\phi \mapsto \int f\phi$, we see

In particular, at the very least, all $L^1$​​ functions can be considered as distributions.

On the other hand, if $\mu$ is a positive measure on $\mathbb{R}$ with $\mu(K)<\infty$ for all compact $K$, then

also defines a distribution.

Absolute continuity

We know the fundamental theorem of calculus in $L^1$ only hold when the function $f$​ is absolutely continuous. The Cantor function $f$ is differentiable almost everywhere on $[0,1]$ but

This restriction still makes sense here. Pick $f$ to be a left-continuous function with bounded variation. Then it can be shown that

where $\mu([a,b))=f(b)-f(a)$​​. Hence $D\Lambda_f=\Lambda_{Df}$ if and only if $f$ is absolutely continuous.

Convergence (uniform?)

We consider the weak*-topology of $\mathscr{D}’$ by

Then fortunately this limit operator commutes with differential operator in a natural way, which may remind you of uniform convergence. In fact,

To prove this one needs Banach-Steinhaus theorem. Here concludes our four requirements of distributions.

Convolution

Convolution plays an important role in Fourier analysis, and here is how to invite distribution to the party.

Normally for two $L^1$ functions $f,g$ we define

We can create more symbols to make life easier:

  1. $\tau_xu(y)=u(y-x)$.
  2. $\check{u}(y)=u(-y)$.

It follows that $\tau_x\check{u}(y)=\check{u}(y-x)=u(x-y)$. Hence

It shows that $g \to (f \ast g)(x)$ is actually a linear functional of $\Lambda_f$, $\tau_x$ and $g \mapsto \check{g}$. But $\Lambda_f$ itself can be a distribution, hence we define convolution for a distribution and a smooth function by

Convolution can be characterised in a natural way. In fact, for any $T:\mathscr{D} \to C^\infty$, if

then there is a unique $L \in \mathscr{D}’$ such that

As you can imagine, this setting creates a lot of potentials for Fourier transform.

References and Further Reading