The ghosts in the machine

October 28th, 2020 Written by Matt Wilson

Take a close look at the image above – if you dare. But before you think it’s a group of Infinity Works engineers dressed up for Halloween, it’s not. The people in the photo aren’t actually real. That’s right, they’ve been ‘imagined’ by a machine learning model; extracted from a neural network that has been trained to contain a near-infinite array of Halloween faces.

The model was trained using a generative adversarial network (GAN). Legend has it that Ian Goodfellow, the inventor of GANs, came up with the idea while having a beer with friends in 2014. It’s based on a ‘zero-sum game’, which is a game where if one side wins, the other loses proportionately. In the case of GANs, the game is played by two neural networks. When repeated millions of times, fascinating results emerge.

The networks are a generator and a discriminator, along with a set of training data. Starting with random noise, the generator creates a set of images and then the discriminator is shown an image randomly drawn either from the generator, or the training set. The discriminator makes a guess as to whether that image is real (from the training set), or not (from the generator). The accuracy of the guesses is then used in a feedback loop to improve the discriminator’s guessing skills, and the generator’s image creation skills. Nobody really knows what the generator and discriminator are looking for in a GAN – it’s something of a “black box” scenario. The transformations that can be observed mid-process are often rather disturbing!

Convergence

Over many iterations, as the generator gets better at creating images, the discriminator becomes more discerning. This in turn pushes the generator to create even more life-like images when compared to the training set. Eventually, the training may trend towards convergence, an ideal state where the discriminator always guesses with an exactly 50% chance of being correct. In other words, it can’t tell the difference between real training images and generated ones.

I used the word ‘may’ as training GANs can be an esoteric art and there are many pitfalls along the way. These include mode collapse, where the generator always ends up generating the same image, and non-convergence, where the generator tends towards images of random noise. Central to avoiding these is understanding hyperparameters, careful preparation of training data, and a lot of trial-and-error.

The progression of GANs since 2014 (Source: @goodfellow_ian)

The good news, though, is that training GANs is getting easier. Back in 2014, you’d be lucky to generate blurry black and white images 128 pixels square. Today, with the rise of Nvidia’s StyleGAN2, it’s possible to generate limitless photorealistic 1024 pixel square images. Bigger images are currently out of reach, but it’s only a matter of time. The images don’t have to be of faces either. Real-world applications range from book illustrations to stock photography. GANs can even be used to provide medical images to train models to make diagnoses as the real images can be expensive and may present ethical or legal challenges to obtain.

Augmentation 

The Halloween models used for this blog were generated using StyleGAN2-ada, which was released on 9 October. ADA stands for adaptive discriminator augmentation and it’s an approach also used in image classification networks which solves a specific problem. Previously GANs, like StyleGAN2, required tens or hundreds of thousands of images to train to produce high-quality fakes. Compiling these data sets, cropping them accurately to squares (a prerequisite of StyleGAN) and training them was a time consuming and expensive pursuit, leaving GANs out of reach for most non-researchers. 

StyleGAN2-ada works with as few as one thousand images, and sometimes even less. It does this by augmenting, or automatically enlarging the training image set. Each image is duplicated multiple times, with each duplicate then passed through a pipeline of randomised transformations including horizontal flips, rotations, blends and colour transformations. Amazingly, over time, these augmentations balance each other out so the final output from the generator contains only images which resemble the original training set, even though many of the images the discriminator has seen are of a different colour, rotation or composition.

To train a GAN to create Halloween pictures, I used instagram-scraper to scrape certain tags. I then hand-selected my favourite thousand images out of more than ten thousand before aligning and cropping most of them around the face using align_images. The remainder were cropped by hand in Adobe Photoshop. Training was done using a forked StyleGAN2-ada on my local PC with a Nvidia 2080Ti. Paperspace and Colab are also popular options, although the recommended Colab Pro is only available in the US at the time of writing. Transfer learning was used to base the initial model on the published StyleGAN2 ffhq model. Note that although the finished images have the general style of the training set, they are unique images and do not feature faces of real people from that data.

Latent space

Once you have a trained model, there are other things you can do, besides generate random images, by using the model’s latent space. The latent space is a mathematical representation of the 512-number input to the model as a point in 512-dimensional space. This is useful as it means we can carry out vector operations. For example, Interpolation takes the points which generated two fake images and plots a line between them. Points along that line can be used to generate images which are somewhere between the two original fake images. As an animation it looks as though one image morphs into the other as we move along the line. 

Projection is like a Google Image search for a trained model, where it’s possible to find an image inside the model that most closely resembles a provided real-world photo. This can have some surprising results (see below). Finally, feature vectors are discoverable directions in the latent space which can change one aspect of an image, such as a person’s expression or whether or not they are wearing glasses. Exploring the possibilities of a trained model is an amazing tool for creativity and idea generation.

Projection

For the above image, I used a model trained on a dataset which was predominantly female, so it’s not surprising that the latent space only contains very rough approximations of (from left) Boris Johnson, Joe Biden and a somewhat Bill-Clinton-looking Donald Trump. More successfully, a projection of the Joker returned a great female version, and the returned image of Daenerys Targaryen is only slightly off, even though none of these people were in the training data.

Ethics

Of course, like much of artificial intelligence technology, there is an ethical angle to this, too. Scraping someone’s artwork, for instance, and generating images in their style is an interesting experiment, but not if you start selling it as your own. In addition, the capability of GANs to ‘deepfake’ real photographs should raise concerns. Steps are underway to use machine learning technology to be able to detect such fakes, but it’s likely to remain a red queen race. These are important questions and could be just the beginning of the impact of Ian Goodfellow’s libatious moment of inspiration.  

Putting that aside, GANs are an incredible tool for anyone seeking to generate visual ideas. The technology is already being inserted into popular software like Photoshop. Soon, the power and capability to generate infinitely novel photorealistic images will be available to us all. And each image can express an idea or concept, from architectural plans to abstract art and photography. Your next inspiration may already be hiding somewhere in latent space, waiting to be generated, like a ghost in the machine. 

Happy Halloween from everyone at Infinity Works!

Related insights