Bob's Favourite Papers
On this page I have collected my favourite papers and given an
explanation of
their key ideas and significance. I have arranged the papers under the
headings
of Machine Learning, Signal Processing
and Miscellaneous.
Machine Learning
My work in machine learning has covered learning theory,
online learning algorithms and kernel
machines.
Learning Theory
Peter L.
Bartlett, Philip M. Long and Robert C. Williamson, ``Fat-Shattering
and the Learnability of Real-Valued Functions'' Journal of
Computer and System Sciences, 52(3), 434-452, (1996).
This paper provided the first necessary and sufficient conditions for
learnability of real valued functions. Previously it was known that
finiteness of the Pollard Pseudo-Dimension was sufficient, but it was
also known not to be necessary. They key idea of the proof of the main
result was inspired by ideas of dithering in analog-to-digital
converters used in audio. Fat-shattering was introduced in the machine
learning community by Alon et al, but actually show up in work by
Tihomirov in the early sixties. A distribution dependent version was
also used by Talagrand around the time the present paper was written.
Wee Sun Lee, Peter L. Bartlett and Robert C. Williamson,
``Efficient
Agnostic Learning of Neural Networks with Bounded Fan-in'' IEEE
Transactions on Information Theory, 42(6), 2118-2132,
(1996).
This was the first (I believe) paper to give a COLT-like proof of the
efficient learnability of a certain class of neural networks. As with
most COLT-style proofs, the constants are ridiculous and useless...
Wee Sun Lee, Peter L. Bartlett and Robert C. Williamson, ``The
Importance of Convexity in Learning with Squared Loss''
IEEE Transactions on Information Theory 44(5),
1974-1980, 1998.
The normal bounds of learning theory provide for a convergence rate
dominated by 1/sqrt{n} in the agnostic setting (wheras one can obtain
1/n in the realisable case). Somewhat amazingly, when using squared
loss for regression, if the class of functions is convex, then one can
obtain a faster rate. Yet another example of the general rule: geometry
matters! The following paper extends the idea in a more esoteric manner
and more importantly, points out that the lower bound claimed in the
present paper is false.
Shahar Mendelson and Robert C. Williamson, Agnostic
Learning Nonconvex Function Classes, pages 1-13, in J. Kivinen
and R.H. Sloan (Eds), Computational Learning Theory 15th Annual
Conference on Computational Learning Theory, COLT 2002, Sydney,
Australia, July 8-10, 2002. Proceedings Lecture Notes in Artificial
Intelligence 2375.
This builds on the convexity result of the previous paper and shows
that what really matters is how far (in a certain special sense) ones
target function is to the set of nonuniqueness points of the function
class. Convex function classes have no non-uniqueness points (every
function has a unique best approximation in the class). It also shows
that the sort of lower bound claimed in the previous paper can not be
true.
Robert C. Williamson, John Shawe-Taylor, Bernhard Schölkopf and
Alex J. Smola, Sample
Based Generalization Bounds, accepted subject to revision to IEEE
Transactions on Information Theory, November 1999.
Shows how empirical covering numbers can be used to bound the
performance of learning machines. The significan is twofold: the
empirical covering numbers are easier to compute, and they can reflect
the complexity of the task actually being considered (rather than an a
priori bound which is inevitably worst case).
This paper (which was accepted subject to changes) will probably
never be formally published. The changes requested would have been a
lot of work, but none of the authors felt they would actually add much
value. A shorter conference version was published (see P112 in
publications list).
Erik Weyer, Iven M.Y. Mareels, and Robert C. Williamson, ``Sample
Complexity of Stochastic Least Squares System Identification'', IEEE
Transactions on Automatic Control, 44(7), 1370-1383 (1999)
This was the first work to study the finite sample complexity of linear
system identification. Historically only asymptotic results were
obtained. (The paper was submitted in 1995 and actually based in part
on results in a CDC paper from 1992. Thus from time of doing the work
(1992) to publication was 7 years!)
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and
Martin Anthony, ``Structural
Risk Minimization over Data-Dependent Hierarchies'', IEEE
Transactions on Information Theory, 44(5), 1926-1940
(1998).
This paper (which to my surprise is one of my most highly cited) was
the first COLT like analysis of the performance of the maximum margin
algorithm. The reason for the difficulty in analysing such algorithms
is that the usual machinery presumes there is a fixed class
of functions from which the algorithm draws an hypothesis. In the MM
case (and others), the function class actually depends on the data. Now
you simply can not sweep this under the carpet: you have to use the
information in the data only once. The paper introduced the notion of luckiness
which is a way of thinking about how good your sample is. The following
paper provides a more powerful and general framework.
Ralf
Herbrich and Robert C. Williamson, Algorithmic
Luckiness, Journal of Machine
Learning Research 3, 175-212 (2002).
The traditional methods of statistical learning theory analyse learning
algorithms only in terms of the class of functions from
which the
algorithm draws its hypotheses. The luckiness framework (previous
paper)
introduced a method by which performance on the sample received
could
be captured, but it still essentially used the device of considering
the
class of functions from which the (data-dependent)
hypotheses were
drawn. In contrast, the present paper, for the first time, allowed the
use of
the traditional machinery of statistical learning theory in a manner
that
allowed the use of properties of the algorithm beyond merely the class
of
functions from which the hypothesis is drawn. The paper formally
subsumes the
previous luckiness framework, and in addition allows the analysis of
sparsity
and algorithmic stability. Finally, the new machinery is applied to the
analysis of the maximum margin algorithm to obtaion a new performance
bound
which interestingly combines three key factors: margin, sparsity, and
degree of
clustering of the data.
Kim L. Blackmore, Robert C. Williamson and Iven M.Y. M
areels, ``Decision
Region Approximation'', IEEE Transactions on Information Theory,
43(3), 903-907, 1997.
Most work in Learning Theory focusses on the sample error in the
learning process. One also has to deal with the approximation error.
The present paper is one of the very few that deals with the
approximation by the levels sets of a function, which is what is
usually done for classification learning.
Online Learning Algortithms
Online learning algorithms are perhaps the most widely used since they
can deal with arbitrarily large amounts of data and can be used in real
time applications. The first three papers below focus on the use for
nonlinearly paramterized classes of functions. The last two papers
focus on the geometry of online learning.
Kim L. Blackmore, Iven M.Y. Mareels, and Robert C.
Williamson, ``Learning
Nonlinearly Parametrized Decision Regions''
Summary in Journal of Mathematical Systems, Estimation and Control
, 6(1), 129-132 (1996). Full version published electronically.
This paper presents a careful mathematical analysis of the behaviour of
stochastic gradient descent for classification learning when the
function class is nonlinearly parametrised.
This is a good example of the problems of picking the right
journal! The paper was submitted in 1993 but was not published until
1996. And the journal became defunct in 1998...
Kim L. Blackmore, Robert C. Williamson and Iven M.Y.
Mareels, ``Local
Minima and Attractors at Infinity of Gradient Descent Learning
Algorithms,'' Summary in Journal of Mathematical Systems,
Estimation and Control, 6(2), pages 231-234, (1996). Full
version was published electronically.
Many things can go wrong with online learning when the functions are
nonlinearly parametrised. The algorithms can get "stuck" in local
minima. But they can also fail even when they are not stuck by heading
towards an attractor at infinity.
Kim L. Blackmore, Robert C. Williamson, Iven M.Y. Mareels,
William A. Sethares, ``On-line
Learning via Congregational Gradient Descent'', Mathematics of
Control, Signals and Systems, 10(4), 331-363, 1997.
Genetic Algorithms were often held up to be superior to gradient
descent because they did not get stuck in a local minima. This
comparison always struck me as unfair as the gradient descent
algorithms only maintain a single hypothesis. A fairer comparison is
with a variant of gradient descent which maintains several candidate
hypotheses and regularly culls the weak ones. The intricacy of the
analysis of the present paper arises from the use of stochastic
gradient descent algorithms and the subtelty of figuring out when to
cull.
A followup paper was prepared but
never published. One result in it that might be of broader interest is
in section 7 and concerns the density of learning problems.
Robert E. Mahony and Robert C. Williamson, "Prior
Knowledge and Preferential Structures in Learning Algorithms," Journal
of Machine Learning Research, 1, 311-355, 2001. (see the
final version on the JMLR web page
Studies the geometry of prior knowledge for online learning algorithms.
Shows how one can define a prior for stochastic gradient descent. One
can derive the Exponentiated Gradient descent algorithm from this
framework. The framework is being extended; see P174 in publication
list.
Kernel Machines
Bernhard Schölkopf, Alex Smola, Robert Williamson and Peter
Bartlett, ``New
Support Vector Algorithms,''
Neural Computation 12(5), 1207-1245, May 2000.
This paper introduced the nu-trick (hence "new"; geddit?) which is a
nicer parameterization to use for SVMs. The nu parametrisation is now
widely used, and this paper highly cited (117 according to Google
scholar in March 2005)
Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Robert C.
Williamson and Alex J.Smola, Estimating
the Support of a High-Dimensional Distribution,, Microsoft
technical report MSR-TR-99-87. Slightly abridged version in Neural
Computation 13(7), 1443-1471, 2001.
Showed how to use kernel machines for novelty detection or one-class
classification. Previous algorithms in the statistical literature are
all limited to low dimensional input spaces. By utilising the nu-trick
(see previous paper) a very nice knob to control the percentage of
outliers is provided.
Robert C. Williamson, Alex Smola and Bernhard Schölkpof, ``Generalization
Performance of Regularization Networks and Support Vector Machines via
Entropy Numbers of Compact Operators,'' IEEE Transactions on
Information Theory, 47(6), 2516-2532, 2001.
This paper showed (in a COLT setting) how the kernel of a SVM affects
its generalisation performance. The paper used the machinery of entropy
numbers of linear operators. A follow-up
paper provides a more explicit calculation of the bounds which is
easier to interpret.
Signal Processing
Brian C. Lovell and Robert C. Williamson, The
Statistical Performance of Some Instantaneous Frequency Estimators IEEE
Transactions on Signal Processing 40, pages 1708-1723,
(July 1992).
Brian C. Lovell, Robert C. Williamson and Boaulem Boashash, The
Relationship Between Instantaneous Frequency and Time Frequency
Representations IEEE Transactions on Signal Processing ,
41, pages 1458-1461 (1993).
These two papers were written whilst I was a student along with Brian
Lovell who was also a student at the time. They show that the complex
and intricate schemes for estimating instantaneous frequency using the
Wigner-Ville distribution are, when done properly, nothing more than
filtered finite differences of the phase of the analytic signal. "Doing
it properly" simply means using the right formula for the moments of a
random variable defined on a circle.
Darren Ward, Rodney A. Kennedy and Robert C. Williamson, "The Theory
of Broadband Sensor Arrays with Frequency Invariant Far-Field Beam
Patterns" Journal of the Acoustical Society of America 97(2),
1023-1034, (February 1995)
Shows how to design a beamformer (e.g. a microphone array) with a
beampattern that is invariant with frequency (over wide frequency
ranges).
Simon I. Hill and Robert C. Williamson, Convergence
of Exponentiated Gradient Algorithms,IEEE Transactions on
Signal Procesing, 49(6), 1208-1215, June 2001.
This paper analysed the Exponentiated Gradient Descent Algorithm,
familiar in the Machine Learning community, using the standard
machinery of signal processing.
Paul D. Teal, Robert C. Williamson and Rodney A. Kennedy, "Error
Performance of a Channel of Known Impulse Response", Acoustics,
Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE
International Conference on , Volume: 5 , 2000 Page(s): 2733 -2736.
Classical text books on communication theory seperately consider the
error performance of a channel and its equalisation. It turns out for a
fixed Signal to Noise Ratio, some channels are much harder than others.
This has significant implications for mobile communications when the
channel is rapidly varying.
Biljana Radlovic, Robert C. Williamson and Rodney A.
Kennedy, "Equalization
in an Acoustic Reverberant Environment: Robustness Results," IEEE
Transactions on Speech and Audio Processing, 8(3),
311-319, May 2000.
Reverberation is a standard problem in audio transduction. (Humans
remove reverberation remarkably well; See why. The present
paper shows that linear equalisation is intrinsically
non-robust. Even supposing you perfectly removed the reverberation by
equalisation, if the source just moves a few centimetres, you will be
worse off than if you had not equalised at all. Interestingly, the
cocktail party effect (more specifically demixing convolutive mixtures)
is a harder problem than equalisation, so it too will be very
non-robust, but that does not stop lots of people writing papers anbd
doing experiments on simulated data!
Richard K. Martin, William A. Sethares, and Robert C. Williamson, "Exploiting
Sparsity in Adaptive Filters," IEEE transactions on Signal
Processing, 50(8), 1883-1893, August 2002.
By exploting and simplifying work with Mahony on prior knowledge in
online learning we showed how to tune adaptive filtering algorithms to
exploit sparsity of the target in a very simple but effective way.
Darren B. Ward, Eric A. Lehmann and Robert C. Williamson,
Particle
Filtering Algorithms for Tracking an Acoustic Source in a Reverberant
Environment, IEEE Transactions on Speech and Audio Processing,
11(6), 826-836, November 2003.
Shows how to use particle filtering combined with beamforming to track
an acoustic source in a real reverberant room. As far as I am aware
this was the first published work deminstrating reliable tracking in a
typical reverberant office.
Miscellaneous
Robert C. Williamson and Tom Downs, ``Probabilistic Arithmetic:
Numerical Methods for Calculating Convolutions and Dependency Bounds,''
International Journal of Approximate Reasoning, 4(2),
pages 89-158 (1990).
The main contribution of my PhD thesis and my only paper that filled an
entire issue of a journal! It presented numerical algorithms for
computing bounds on the probability distribuition of elementary
functions of random variables when only their marginal distributions
are known.
Algorithms developed are now available in commercial software
from RAMAS Software.
(See Risk Calc, Probability Bounds Analysis) and seem widely cited.
I do not have an electronic version of the paper, but you can
look at my
thesis (without pictures). (Here is a scanned copy of my thesis
with pictures (12M)) Note that I never published part 2 of the
paper (bad idea to call a paper part 1 without a definite plan to make
part 2!). Chapter 4 of my thesis (connection with other ideas) was what
I had in mind for part 2...
Robert C. Williamson, ``The Law of Large Numbers for Fuzzy
Variables under a General Triangular Norm Extension Principle,'' Fuzzy
Sets and Systems, 41, 55-81, (1991).
A paper arising from my PhD thesis. Apart from the technical
contribution (which incidentally contains an error), there is a remark,
which I believe was novel at the time, explaining how the calculus of
operations on fuzzy numbers can strictly be interpreted in
the theory of probability. The key is that one needs to consider the
marginals only, and compute a resultant distribution which is the worst
case over all possible joint distributions. It is very often said that
fuzzy numbers are different to random variables because the manner by
which they are combined is so different. The truth of the matter is
that one can fully recover fuzzy sets from probability (!). A (very
long!) argument along these lines comprises section 4.5 of my PhD
thesis.
I do not have an electronic version of the paper, but you can
look at my
thesis (without pictures) which contains everything in the paper.
Here is a scanned copy of
my thesis with pictures (12M)