I am currently on a short break before starting a fellowship at the Center on Long Term Risk, where I’ll be investigating how work in Machine Learning interpretability can help produce safe, beneficial AI both short and long-term.
I was most recently a Research Assistant at the Computer Science Department of the University of Oxford, supervised by Daniel Kroening and Tom Melham. My research interests broadly revolve around making sure Machine Learning systems work well. In particular:
Given the incredible potential AI has to transform our society, I’m broadly interested in the safe development and use of AI, in a way that’s beneficial to all. While my current focus has been the technical side of the issue, I also enjoy looking into AI policy and governance.
Master of Computer Science, 2019
1st Class Honours
University of Oxford
BA Computer Science, 2018
1st Class Honours
University of Oxford
We cannot guarantee that training datasets are representative of the distribution of inputs that will be encountered during deployment. So we must have confidence that our models do not over-rely on this assumption. To this end, we introduce a new method that identifies context-sensitive feature perturbations (e.g. shape, location, texture, colour) to the inputs of image classifiers. We produce these changes by performing small adjustments to the activation values of different layers of a trained generative neural network. Perturbing at layers earlier in the generator causes changes to coarser-grained features; perturbations further on cause finer-grained changes. Unsurprisingly, we find that state-of-the-art classifiers are not robust to any such changes. More surprisingly, when it comes to coarse-grained feature changes, we find that adversarial training against pixel-space perturbations is not just unhelpful: it is counterproductive.
Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them more difficult to analyse and interpret. In a run with n time steps, a policy will decide n times on an action to take, even when only a tiny subset of these decisions deliver value over selecting a simple default action. Given a pre-trained policy, we propose a black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states. We evaluate our ranking method by creating new, simpler policies by pruning decisions identified as unimportant, and measure the impact on performance. Our experimental results on a diverse set of standard benchmarks (gridworld, CartPole, Atari games) show that in some cases less than half of the decisions made contribute to the expected reward. We furthermore show that the decisions made in the most frequently visited states are not the most important for the expected reward.