Building and Training Neural Networks in Emacs Lisp
Yes you read that correctly. This blog post showcases nn.el
, my
(zero dependency) emacs lisp package that allows you to build and
train neural networks. To answer the initial question as to “why on
God’s green earth would you write a machine learning framework in
emacs lisp?”, the answer is really just for fun (and to cause a bit
of a splash, i mean emacs lisp is the last thing you’d associate
with AI right now). This particular flavour is really only used to
configure emacs, but you can actually do much more1 and that
includes training neural networks. If your unfamiliar with neural
networks the section Primer on Neural Networks provides a quick
overview on neural networks and their training, as well as the vast
resources available online on the topic.
Introducing nn.el
nn.el
makes use of dynamic modules, that is, extra functionality
implemented for use in emacs lisp programs. I’ve written some dynamic
modules in c that provide some basic functions to manipulate matrices
e.g. matrix multiplication, as well as some of mathematical functions
typically used by neural networks e.g. softmax and ReLU.
The package consists of the following libraries:
matrix.c
- This library (i.e. a dynamic emacs module) provides some basic functions to manipulate matrices (vectors of vectors in emacs lisp). This includes addition, subtraction, transpose, scalar multiplication, matrix multiplication, etc.
ops.c
- This library provides some mathematical functions and operations typically used in machine learning. This includes operations like ReLU, softmax, etc.
nn.el
- This is the main neural network library, written in emacs lisp. It uses the previous libraries to provide functionality to build neural networks, compute the output of a neural network for a given input, and train a neural network.
The package has zero dependencies besides emacs. The libraries
matrix.c
and ops.c
are written in c so that we can calculate
outputs and gradients in a reasonably fast manner. They have no
external dependencies (eg no python process is ever invoked!).
The code is available here: https://github.com/namilus/nn.
Building Neural Networks with nn.el
nn.el
has a simple API for building neural networks, and is based
around TensorFlow’s sequential building API. Below is an example:
(require 'nn) (require 'matrix) (require 'ops) (setq model `(,(nn-layer 100 80 'ops-relu) ,(nn-layer 80 8 'ops-relu) ,(nn-layer 8 3)))
The above creates a 3 layer neural network that takes an input
\(\textbf{x} \in \mathbb{R}^{100}\) and outputs a vector \(\in
\mathbb{R}^3\). The first layer has 80 neurons, the second as 8, and
the final outputs just 3 neurons. In total, this particular model has
8755 parameters, however nn.el
allows you to build arbitrary depth
and width networks. The weights and biases of the model are
initialised randomly from the uniform distribution \(U(-1,1)\).
Next we see how we can calculate the output of the model on an input:
(setq x (matrix-random 100 1)) (setq y (matrix-transpose [[1 0 0]])) (ops-softmax (nn-forward-layers x model))
[[0.2600744539110965] [7.306195822330744e-07] [0.7399248154693213]]
Here our \(\textbf{x}\) is a random vector, and the label is \([1 \ 0 \ 0]^T\). The output of our model on \(\textbf{x}\), given above, is not the same as our label, and is actually quite a bit off. Let’s calculate the loss between our model’s output, to see how off it is:
(let* ((s (ops-softmax (nn-forward-layers x model))) (loss (nn-crossentropy y s))) (message "Initial loss: %s" loss))
Initial loss: [[1.346787327763626]]
Training Neural Networks with nn.el
The initial loss value is high, so we’d like to train our model to
better predict the label for our input. nn.el
provides functionality
to calculate the gradients for a model for a given batch of training
examples. This is done with the nn-gradient
function. We can use
this to implement gradient descent and train our model:
(defun nn--apply-gradient-sgd-layer (grad layer) "Apply GRAD to LAYER using gradient descent" (let ((w (nth 0 layer)) (b (nth 1 layer)) (wg (nth 0 grad)) (bg (nth 1 grad))) (setf (nth 0 layer) (matrix-subtract w (matrix-scalar-mul 0.01 wg))) (setf (nth 1 layer) (matrix-subtract b (matrix-scalar-mul 0.01 bg))))) (dotimes (counter 100) ;; this is the train step; essentially just do this as many times (let ((grads (nn-gradient x y model))) (seq-mapn #'nn--apply-gradient-sgd-layer grads model)))
The above code performs 100 gradient descent steps, tweaking the
parameters of the model each time in order to improve it’s output on
input x
. Now let’s see what the model outputs:
(ops-softmax (nn-forward-layers x model))
[[0.9990508146033623] [1.9576943624708923e-08] [0.00094916581969408]]
Wow! That’s much closer to our desired label than before! And it was reasonly fast2 too!Let’s see what the loss function returns:
(let* ((s (ops-softmax (nn-forward-layers x model))) (loss (nn-crossentropy y s))) (message "After training loss: %s" loss))
After training loss: [[0.0009496361583564819]]
Almost zero! Exactly what we wanted.
Conclusion
We’ve now successfully built and trained a neural network with nn.el
on some (albeit random) data. There’s nothing stopping us now from
training on real, much larger datasets (e.g the database of
handwritten digits). I’ll save that for a future blog post, but the
main functionality of nn.el
has been showcased. You probably wont be
able to run an AI startup using it, but it is funny to say that you’re
text editor can train neural networks. A longer blog post will follow
that will explain in some more detail how nn.el
works. The code is
up on Github for anyone who is interested.
Primer on Neural Networks
What follows is a high level overview of neural networks and how they’re trained. If you’re already familiar, feel free to skim this section or skip ahead.
Neural networks consists of several layers of fully connected neurons. The output of a \(L\)-layer fully connected neural network with parameters consisting of weights and biases \(\textbf{W}_1, \textbf{b}_1 \textbf{W}_2, \textbf{b}_2, ..., \textbf{W}_L, \textbf{b}_L\) (each \(\textbf{W}_i \in \mathbb{R}^{d_{i-1} \times d_i}\), \(\textbf{b}_i \in \mathbb{R}^{d_i}\)) is given by \( \textbf{z}_L = \textbf{W}_L^T \textbf{a}_{L-1} + \textbf{b}_L\), with \(\textbf{a}_i = h(\textbf{z}_i)\), where \(\textbf{z}_i = \textbf{W}_i^T\textbf{a}_{i-1} + \textbf{b}_i\), and \(h(x) = \mbox{max}(0,x)\) is the activation function (ReLU in this case).
Let \(n = d_L\) denote the number of classes, and \(\textbf{a}_0 = \textbf{x} \in \mathbb{R}^{d_0}\) i.e., the input to the neural network. A training example for a neural network is the pair of input and label \((\textbf{x}, \textbf{y})\), where the input has \(d_0\) number of features i.e., \(\textbf{x} \in \mathbb{R}^{d_0}\), and we want the neural network to classify this input as one of \(n\) classes. Typically, label \(\textbf{y}\) is a vector of \(n\) elements, which are all zeros except for the true class, which is a one. These types of vectors are called one-hot vectors because they have a 1 at the index which encodes the label. For example the label \(\textbf{y} = [0 \ 1 \ 0 \ 0]^T \in \mathbb{R}^4\) says that the corresponding input is labelled as the second class. These classes can be anything e.g., they can encode what is in a image \(\textbf{x}\) which is a vector of the pixel values of the image. In next word prediction models like ChatGPT, these vectors encode what the next word should be, given the previous word \(\textbf{x}\), encoded as a vector.
Training Neural Networks
Training a neural network on an training example \((\textbf{x}, \textbf{y})\) involves tweaking all the weights and biases \(\textbf{W}_i, \textbf{b}_i\) so that output of the model on the input \(\textbf{x}\) matches the label \(\textbf{y}\). In order to do this, we first need a measure of how off the model output is from the label. This is called the loss function, and cross entropy is typically used. Cross entropy loss calculates how different two probability distributions are. In our case, the two distributions are the true label and softmaxed model output. Softmax is a function that converts a vector into a probability distribution. More formally, the per-example cross entropy loss is given by \(\ell(\textbf{x}, \textbf{y})\) and given a batch of training examples \( B = \{(\textbf{x}^{(1)},\textbf{y}^{(1)}), ..., (\textbf{x}^{(b)},\textbf{y}^{(b)})\}\), of size \(b\), the average loss is given by
\begin{equation} \mathcal{L}(B) = \frac{1}{b} \sum_{(\textbf{x},\textbf{y}) \in B} \ell(\textbf{x}, \textbf{y}) \end{equation}Training neural networks is done using gradient descent. This is an algorithm that allows us to find what values of \(\textbf{W}_i\) and \(\textbf{b}_i\) minimise the loss function i.e., which values for the parameters result in an output that matches the label. Gradient descent works as follows:
- Calculate the derivative of the loss with respect to the parameters (e.g. the gradient) \(\nabla_{\textbf{W}_i} \mathcal{L}(B)\) and \(\nabla_{\textbf{b}_i} \mathcal{L}(B)\).
- Update the weights using gradient, \[\textbf{W}_i := \textbf{W}_i - \eta\nabla_{\textbf{W}_i} \mathcal{L}(B)\] \[\textbf{b}_i := \textbf{b}_i - \eta\nabla_{\textbf{b}_i} \mathcal{L}(B)\]
The variable \(\eta\) is the learning rate, and dictates how fast we want to descend down the gradient. Typically, a value less than 1 is chosen. Gradient descent repeats steps 1 and 2 iteratively until the loss function returns a value that is sufficiently small e.g. close to zero.