CHAPTER 6. DEEP FEEDFORWARD NETWORKS
have long been used to solve optimization problems in closed form, but gradient
descent was not introduced as a technique for iteratively approximating the solution
to optimization problems until the nineteenth century (Cauchy, 1847).
Beginning in the 1940s, these function approximation techniques were used to
motivate machine learning models such as the perceptron. However, the earliest
models were based on linear models. Critics including Marvin Minsky pointed out
several of the flaws of the linear model family, such as its inability to learn the
XOR function, which led to a backlash against the entire neural network approach.
Learning nonlinear functions required the development of a multilayer per-
ceptron and a means of computing the gradient through such a model. Efficient
applications of the chain rule based on dynamic programming began to appear
in the 1960s and 1970s, mostly for control applications (Kelley, 1960; Bryson and
Denham, 1961; Dreyfus, 1962; Bryson and Ho, 1969; Dreyfus, 1973) but also for
sensitivity analysis (Linnainmaa, 1976). Werbos (1981) proposed applying these
techniques to training artificial neural networks. The idea was finally developed
in practice after being independently rediscovered in different ways (LeCun, 1985;
Parker, 1985; Rumelhart et al., 1986a). The book
Parallel Distributed Pro-
cessing
presented the results of some of the first successful experiments with
back-propagation in a chapter (Rumelhart et al., 1986b) that contributed greatly
to the popularization of back-propagation and initiated a very active period of re-
search in multilayer neural networks. The ideas put forward by the authors of that
book, particularly by Rumelhart and Hinton, go much beyond back-propagation.
They include crucial ideas about the possible computational implementation of
several central aspects of cognition and learning, which came under the name
“connectionism” because of the importance this school of thought places on the
connections between neurons as the locus of learning and memory. In particular,
these ideas include the notion of distributed representation (Hinton et al., 1986).
Following the success of back-propagation, neural network research gained pop-
ularity and reached a peak in the early 1990s. Afterwards, other machine learning
techniques became more popular until the modern deep learning renaissance that
began in 2006.
The core ideas behind modern feedforward networks have not changed sub-
stantially since the 1980s. The same back-propagation algorithm and the same
approaches to gradient descent are still in use. Most of the improvement in neural
network performance from 1986 to 2015 can be attributed to two factors. First,
larger datasets have reduced the degree to which statistical generalization is a
challenge for neural networks. Second, neural networks have become much larger,
because of more powerful computers and better software infrastructure. A small
221