Colin Johnson: Ideas for Research Students/MSc Projects/Postdocs

General Information

Here is a list of research ideas that I am keen to develop. Most of them are fairly broad, and projects could be found at a number of levels, from large projects for PhD students or postdoctoral research assistants to smaller projects that could be done as a summer project by MSc students on one of our courses.

I have listed a few "wild ideas" at the end of the document. These are very unformed projects that I wouldn't be comfortable supervising immediately, but which might be workable up into concrete projects.< If you have a specific idea for how to do these, or have the right kind of background knowledge, then I'm happy to talk about these./p>

I'm very happy to talk to potential research students and postdocs about either these research projects or your own ideas in these sort of areas. Please email me on if you are interested. There are sometimes funded places available for PhD study or postdoctoral research assistants, and I am always interested in working with potential PhD students and postdocs on applications for funding. I am, of course, also happy to hear from students with their own ideas in related areas or any areas related to my previous publications.

The University runs an annual PhD scholarship competition.. Postdoctoral positions are advertised as they become available on, and the University has a useful list of sources of funding for postdoctoral fellowships.

Project List

Semantics in Program Synthesis

Genetic Programming (GP) is a method of problem solving that takes its inspiration from biological evolution. Specifically, GP is concerned with evolving a population of programs (or other executable structures) to solve a specific problem. GP has been successful in a number of areas, e.g. data mining, bioinformatics, game AI, circuit design, symbolic regression, and many others.

In GP the semantics of programs are important, i.e. what programs actually do, as opposed to the syntax of programs, e.g. their text. A lot of work in GP has focused on these syntactic aspects of GP and neglected the semantic aspects. In the last few years a decent amount of work has been carried out on semantic aspects of GP, both at Kent and elsewhere. For example, the semantic effects of evolutionary operators can be studied (e.g., does the exchange of text between two programs result in a program that has aspects of the behaviour of the two parent programs?), studying the semantic diversity of the population (e.g., is the initial population semantically diverse? How does this semantic diversity change over an evolutionary run?), aspects concerned with fitness (e.g., can we define a fitness function in terms of the desired behaviour of programs and measure the distance of a program from that ideal semantics?).

There is lots of scope for exploring different aspects of this. A good starting point would be to look at the papers that I have written with my research student Lawrence Beadle (e.g. on semantic initialisation, crossover and mutation), and the work by groups in Dublin and Poznan.

Another interesting area is alternatives to bio-inspired ways of doing automated programming. Biological metaphors work up to a point, but it would be interesting to explore ideas around information-theoretic ways of program understanding.

Entropy and Mutual Information in Program Synthesis

Genetic Programming (GP) is a well known technique for the induction of computer programs and other executable structures from training dat (see the GP Field Guide for an introduction).

One problem with GP is the there is a need to measure the success of whole programs on solving complex problems. This is fine for simple programs, as even a in a randomly generated set of programs there will be some which solve a small number more training cases than others. However, this breaks down as problems become more complex the chance of a random program having some "leverage" on the whole problem gets smaller and smaller.

One potential approach to this is to redefine the idea of fitness in a new way, using ideas of entropy to capture patterns in the solution, and then to gradually "work backwards" from the desired solution to the input space.

Code Debugging via Hypothesis Formation

When we code, we debug in a certain way, which can be characterised as hypothesis-driven. That is, we realise (e.g. through testing or informal observation) that our code is not working, and then we look at outputs to hypothesise some reason why it is not working, or at least to find a pattern in the wrong outputs or unusual behaviour. We then go back into the code to find code that is likely to generate those outputs.

This offers an interesting approach for automated code fixing. Existing bug fixing algorithms, such as GenProg or ClearView work by a two stage process. Firstly, bugs are localised (i.e. an estimate of their location found), using one of many fauly localization methods. Then, some kind of machine learning method is used to manipulate the code until the current unit (in the unit testing sense) is working. This latter stage is vulnerable to multiple bugs and to the complexity of finding an effective piece of code to catch the bug.

By contrast, this project aims to use machine learning at an earlier stage, to find patterns in the faulty outputs. These will then be matched, using approximate matching algorithms, to behaviours in the code, to identify areas of the code that are generating the same kind of faulty behaviour that is found in the fault, and then to use techniques similar to GenProg etc. to fix it.

Computer Language Translation using Machine Learning

There exist many different computer languages, and it is often useful to be able to translate a program between them. For example, different mobile devices need programs in different languages, and it is easier to integrate systems together if they are written in a common language. At present, translators are written by hand, by writing a series of rules to specify how one language could be translated into another; a number of frameworks, such as Haxe, exist for this.

The aim of this project is to use machine learning methods to do this translation automatically. The system would work run the program on a large set of test data, and obtain a large number of traces through the program, using coverage methods to ensure that all of the program is covered by this set of traces. This would mean that for each statement in the original program, there would be a large number of data-pairs saying what the state of the program was before that statement was executed and after it was executed. The system would then run a machine learning method (genetic programming, the new accumulative programming method, or a "software transplant" approach) to learn appropriate code to replace that statement. This would continue through the program, translating it line-by-line. Clearly this could not make large-scale changes between programming paradigms, but it would be good for making simple translations between similar languages.

This would be evaluated by a thorough experimental study, looking at a large number of such translations of various kinds of complexity, and asssessing the quality of them in a statistically rigourous way. To do this project you would need strong programming skills, and some knowledge of machine learning, software testing and programming language theory, or a willingness to learn aspects of these areas.

A FlashFill for Database Queries: Programming by Example

Recent versions of Microsoft Excel have a feature called FlashFill, which is an example of a programming by example system. FlashFill takes a few examples of the kind of calculation that you want to do in your spreadsheet, and generalises from those examples to produce a formula that can then be applied to future examples. This is a great feature for people with little experience of programming/maths, because it is often much easier to give a few examples than to write a formula in some formal language. Of course, there are tradeoffs in terms of how confident we can be of the final result; but, for many applications, the benefits outweigh the potential problems. There are some details of FlashFill on Sumit Gulwani's page; in particular, the article Spreadsheet data Manipulation by Examples is a good overview, and there is also a nice videoshowing it working.

The aim of this project would be to build an equivalent system for database queries. Despite query languages such as SQL having been around for decades, even minimal expertise in these is unusual except for a small number of specialists. Instead, it would be good if the user could type in a small number of examples that they are familiar with, and the system would create a SELECT query (or whatever) to generalise. Consider the example of trying to find, say, all final year PhD students studying computer science from a database of all students in the university. The user would start typing the names of the students that they could remember, and after a while it would generalise and present a complete list; behind the scenes it would be constructing a SELECT statement that matched the characteristics that were common to those students and not others. We would use a similar set of abstraction and generalisation techniques to those found in FlashFill to realise this. We could even use some system that translates SQL queries into natural language explanations to explain to the user what the system is doing.

Machine Learning for Improving Numerical Analysis Code

In recent years, a number of machine learning techniques have been developed for improving code by machine learning. By improving we mean improving measurabble non-functional properties, such as execution time or energy consumption, whilst not changing the semantics of the program. Examples include the application of Genetic Programming methods (Genetic Improvement—see the GISMO project for an example) and the idea of software transplants, that is, mining a codebase for similar code and replacing and adapting that code in its new context. These methods have been very successful in a number of areas, for example improving bioinformatics software.

The aim of this project is to apply these approaches to numerical analysis software, for example the LAPACK linear algebra package. To do this project you would need excellent programming skills and a decent mathematicaly background, ideally with some knowledge of numerical analysis.

More generally, there seems to be a lot of scope for meta-learning, learning-based preprocessing, etc. in numerical analysis.

Monte-Carlo Tree Search and Genetic Algorithms

Monte Carlo Tree Search is a method for searching game-trees in computer game playing, which has currently the state-of-the-art method for evaluating move choices in programs that play games such as Go. The idea is that rather than playing forward a small number of games intelligently, the program plays a large number of games randomly to completion, and then takes an average of the performance of each set of random games that start with a particular move as a measure of the quality of that move.

A not entirely unrelated notion is that of fitness inheritance and other fitness approximation techniques in genetic algorithms. This is the idea that rather than always allocating fitness to individuals by calculating the fitness directly, we assign a fitness based on the relationship of the individual to others whose fitness has already been calculated. For example, we might use the mean fitness of the two parents in crossover as an approximation to the fitness of the child. Recent surveys survey paper by Lin and Shi and Rasheed give a general introductions to this area, and a nice paper by Barbour, Corne and McCall shows how these ideas can be applied.

One interesting idea would be to bring these together in a particular way, that is to use a form of Monte Carlo tree search as a means of doing fitness approximation. That is, we take the current population, and run a number of runs to convergence by taking random crossovers and mutations (this is computationally cheap, as in most individuals it is fitness evaluation that is most costly). The fitness of the individuals is then the fitness of the end-point of these random runs. The question that we would like to answer is whether the fitness assigned at the end of this process is a more valuable guide for selection than selection based on direct calculation of the fitness of the individuals.

Similar ideas could be applied to other methods, such as other applications of genetic algorithms, or swarm intelligence methods.

Throw and Catch as an HCI Paradigm

In recent years technologies such as the Microsoft Kinect have allowed developers to readily create gesture-based interfaces for computer systems. Some of the systems that have been created to date have been based around reproducing traditional interfaces with the mouse/keyboard replaced with some gestures. However, there are alternatives; for example, the notion of throwing/catching is an appealing paradigm for transferring information between computers in a gestural fashion.

Consider the scenario of a classroom, where students are working on laptops or tablets. A teacher asks the students to carry out a task, and then to display their work using the data projector. This is complicated at present, as the students would have to email the solution to the teacher, or to plug their own device into the projector. The alternative being presented here is that the various devices used run a gesture based system, and the student just makes some gesture at their device to "grab" the currently focused document, and then "throws" it at the screen.

The idea of this project would be to design a computer interface based on throwing and catching, and investigate how effective this is.

Search-based Mutation in Genetic Algorithms

Genetic Algorithms are a problem-solving method that takes its inspiration from biological evolution. A population of potential solutions is created, and this population is modified by processes that are similar to evolutionary change over generations, in particular mutation of solutions and crossover of information between solutions.

Traditionally, mutation has been carried out by having some representation and making a small change within that representation. Consider a problem like evolving the image of a face to match a remembered face: this is a real problem that is needed for "photofit" problems used by the police to identify criminals seen by witnesses to a crime (see e.g. the E-Fit system). The standard approach would be to have some kind of parameterised model of faces - similar to that used in character creation in computer-based role playing games - and adjust the parameters to match the memory of the witness.

An alternative would be to use the vast amount of data that is available on the web. Again, consider our example: there are huge numbers of photographs of faces on the web, could an similarity-based image-recognition system such as GazoPa be used to present "mutations" of a facial image based on real photographs rather than an artificial face?

The aim of this project would be to develop and evaluate some search-based mutation systems. This could be based on the face-based system discussed above, or another visual system, or a language-based problem, depending on the interests of the student. There is also scope for extending these ideas to crossover, though this is harder.

On-the-fly Parameter Adjustment in Genetic Algorithms and Related Methods

Genetic algorithms are a technique for finding good solutions to optimization and other problems which work by simulating the evolutionary process found in natural biological systems. One problem with these techniques is that the population can rapidly converge to a weak solution, or at the opposite end of the spectrum can fail to converge to a solution at all. This depends largely on the choice of parameters such as rates of mutation, population size, recombination rates, et cetera.

Work has been carried out at finding a priori optimal parameter settings for simple problems, however these results do not extend easily to real problems. One promising approach would be to look at on-the-fly parameter adjustment. As the algorithm runs it would accumulate statistical information about current performance, and use this information to do parameter-estimation for models of algorithm behaviour. This can then be used to adjust the parameters of the algorithm itself to control e.g. convergence rates. Another approach would be to use ideas from control theory.

Similar techniques could be applied to other meta-heuristics for optimization and related problems.

Idioms in Programming Languages

One thing that we learn when we learn to program is the idioms of a programming language. We do not use the full expressivity of a language: most of the time, when we use a particular construct in a language, we use it in one of a very restricted number of ways. Indeed, learning that we do not need to feel guilty at not using the full range of expressivity in a language is one of the marks of maturity as a programmer: your programs start to look like the sort of stuff that real programmers write, rather than just a random spray of syntactically-correct material.

One example of programming language idioms is the theory of roles of variables. This is an attempt to classify the use of variables into a small number of groups: most-recent holder, follower, gatherer et cetera. Empirical research by Sajaniemi's group in Joensuu has shown that a dozen roles can capture the vast majority of variable usage in programs. Tools have been created that, either through program analysis or machine learning that can test whether a variable is being used in a particular role, or identify the most likely role that a variable is playing.

Idioms exist at various levels. The canonical notion of a programming idiom is the for-loop in languages such as C and Java. The canonical for loop is of the form for(int i=0;i<n;i++). Whilst in theory the language allows complex expressions to be placed in the "parameters" of the loop, in practice the vast majority of loops (it would be interesting to do a proper study of this) are of this form or a minor variant.

There are a number of possible projects in this area. One area is in programming education. There is a vast amount of research about beginning to program, but there is very little about the following stage, i.e. progressing from a competence in basic programming skills to being an experienced programmer. Idioms provide one way for thinking about this and for providing tools to support more advanced programming education, for example by producing idiom-aware programming tools. There is a substantial project to be done based on creating such a system in a grounded way and evaluating it.

A second area is in the provision of debugging tools based on idioms. There are a number of tools that attempt to do generic debugging - that is, not just to identify errors with regard to a specific problem specification, but with regard to "code smells" that hint at problems with the program based on commonly-made error types. One example of a system that uses such methods is FindBugs. It would be interesting to develop such a system not based on finding negative aspects in the code but on finding deviations from positive aspects. This would involve the development and evaluation of a debugging system based on these ideas.

In a very different direction there is a lot of scope for applying ideas about idioms to program synthesis. For example, a system could be devised that uses mutations that are biassed towards idioms.

Exploring Real Genetics using the Geometric Framework

(joint project with Alberto Moraglio, University of Exeter)

The geometric framework (GF) is a formal theory of search algorithms that links solution representations, search operators, search spaces and fitness landscapes. It defines search operators using geometric shapes such as circles, segments and convex polygons specified as functions of the underlying distance of the search space. The search operators are then defined using these geometric shapes to relate the location and probability of the offspring with respect to the location of their parents. For example, uniform geometric crossover is a geometric search operator that, given two parents corresponding to the end-points of a segment, returns uniformly points in the segment as offspring. This framework reveals an important duality: the same search operator can be defined both on the representation as a procedure that manipulates parent configurations to produce offspring configurations as well as in a geometric declarative form as spatial relationship by specifying the locations and probability of the offspring with respect to the locations of their parents. For example, the uniform crossover for binary strings, which is defined on the representation, dually, it can be defined as uniform geometric crossover on the Hamming space. This duality has surprising consequences: (i) it allows us to use geometry on the search space as a formal language for a general theory of representations that encompasses the most-frequently used mutation and crossover operators across representations; (ii) it allows us to associate complex search operators with classic fitness landscapes; and (iii) it allows us to do principled design of search operators for new representations and problems that embed problem-specific knowledge in the search.

The aim of this project is to explore whether the GF can give insights to make discoveries in real genetics. This line of research is highly speculative. However, this might be possible because homologous biological recombination that aligns DNA strands on contents before genetic material exchange was shown to be a geometric crossover. Using the GF, this may allow us to determine the biological fitness landscape exactly and not to use this concept only heuristically or metaphorically as it is currently done in biology. Studying the characteristic of smoothness of this landscape might then explain, or contribute to an explanation on, why the rate of the biological adaptation due to Darwinian evolution is surprisingly high (which is an important open question). A second line of enquiry building upon the GF that can give interesting results in real genetics is about genes discovery (which is an important open problem). This is because the notion of gene can be naturally defined in geometric terms in a representation-independent way and can be formally specialized to any solution representation, including the biological one based on DNA strands. When specialized to binary stings, geometric genes correspond to schemata. It would be then interesting to see whether the genes predicted by the GF for DNA strands correspond to known real genes and if so whether they can predict unknown real genes.

Dynamic-baseline Symbolic Regression

Techniques such as genetic algorithms have some notion of move operators at their core, i.e. the idea of moving from one point in the search space to another via a mutation of the solution. One well-known problem in this area is setting the amount of mutation: how often should this operator be applied?

It could be argued that this is asking the wrong question. Rather than applying the same mutation operator more-or-less often, we should perhaps be thinking about how we can apply operators that have a variable amount of impact, depending on the fit that has been achieved so far. Basically, we don't want the mutation to make things worse than the current state - and we want to make small, focused improvements where this is appropriate.

One approach to this is specific to symbolic regression, i.e. fitting a curve to data. There are a number of genetic programming-based approaches to this. I would like to consider a new approach; rather than running a long GP algorithm to solve the problem, run a large number of small (short, small-population) GP runs, and combine the result of each with the results from the previous GP runs (by addition, multiplication or function composition). That is, each run trys to fit the difference between the current solution and the optimal solution, rather than making a complete fit.

There is a very loose analogy here with ideas of series expansion in topics like Fourier or Taylor series. The idea is that the first pass of the optimisation fits an approximate curve to the whole data set; the next pass fits some features that are missed out by the first curve; and so on, until later stages of the fit would be only affecting small local details.

One critique of this is that the resulting expressions would be very complicated. However, existing GP systems produce complicated expressions anyway; and, there might be scope for symbolic simplification of the resulting expressions.

This is particularly interesting the context of dynamic symbolic regression, where the data is changing with time and where the aim is to maintain a good model in the face of this changing data. It would be possible to use this approach to adjust the fit at various levels: e.g. to adjust the details if the changes were at the level of fine detail (we can think of this as adjusting the "higher harmonics" of the model), or to make large-scale changes if needed.

Automatically Contextualising Webpages

We sometimes want to contextualise the information in a web search. An example of context is the age of the information on a page. For example, we might want to get articles that were written at the beginning of some phenomenon. The canonical information retrieval/semantic web way to do this would be through the use of metadata. However, this isn't very useful for this, as web-pages are rarely tagged with such meta-data. However, there is a lot of information available that is tagged, e.g. newspaper articles. The aim would be use some idea of similarity between tagged and untagged pages to add this contextual information to a page. For example, we might search for the use of particular names or slang terms that were prominent during a particular span of months or years to add time-context to a page. The techniques used for this would primarily be machine learning methods.

A second broad set of ideas in a similar direction would be adding affecive tagging to webpages, to allow a search that is contextualised by the "attitude" of the web-page writer. For example, we might want to search for a negative review of some product, or to find pages that are "informative" rather than "gossipy" about a celebrity. Again, machine learning and text mining approaches would be the main technique that would be used for this.

Breaking Audio CAPTCHAs by Learning Voice-Like-Ness

A CAPTCHA is a test to tell humans and computers apart by presenting a task that is easy for humans to do but hard for computers. The canonical example of such a task is reading a string of letters that have been distorted in shape and presented against a complex background.

Another form of CAPTCHA is the audio CAPTCHA, where a sequence of letters or words is read out and distorted by the use of audio filters and background noise. One way to break such a CAPTCHA would be to learn a sequence of filters that remove the noise and reverse the effects of the audio filters. But, to learn such a sequence we need to objective function to decide when one sequence of filters is better than another.

One approach to this would be to learn a measure of "voice-like-ness" from data. A large number of audio samples, distorted and noised using various different techniques and to various extents, would be created. Then, these would be rated by the idea of how close they are to clear speech. A learning method would be used to learn a regression model that can then be used to take a sound sample and predict how "voice-like" it is (essentially, how undistorted it is). This would then be used as the objective function in learning the sequence of filters; the more "voice-like" the current example is, the higher that sequence would be scored. Once a very "voice-like" example was created, a voice recognition algorithm could be applied to extract the text content.

Similar approaches could be taken to text CAPTCHAs with the idea of "print-like-ness".

Coverage Algorithms

Metaheuristics are general computational problem-solving methods. Examples would be search techniques such as genetic algoithms and tabu search. To apply a metaheuristic, the problem at hand needs to be phrased in terms of a general problem type. The canonical example of a problem type is optimization, i.e. defining an objective function and then applying the search technique to find a good value for that objective function.

Research in metaheuristics has focused on exploring new problem solving methods; a neglected area is the definition of new generic problem types. One example with a lot of promise is the idea of coverage, which is the idea of finding one example for each class in some classification.

An example helps to explain this. Consider web search or a similar information retrieval problem. One problem with such a search is that a query that references several topics will often produce a search result that is dominated, in some cases completely, by one particular topic. It would be interesting to have a search that grouped the results, showing a few examples of each of the topics referred to by that query.

This problem of finding a diverse set of examples relative to a qualitative classification has the potential to be a good example of a new generic problem type for metaheuristics. Examples of its application can readily be found in information retrieval, security, mathematics, computational biology, and design.

There are some techniques that are similar to this. For example, there are some similarities with unsupervised clustering, multimodal optimization, machine learning of classifications, and some approaches in computational creativity. However, none of these quite capture the same idea.

There is lots of scope for the development of an interesting tool and applying it to a number of topics. In particular, a demonstration that a single tools could work fairly well on a number of problems might be more interesting than applying these ideas to a single problem. That said, specific examples in e.g. web search and computer-aided drug discovery are probably worthy substantial projects in their own right. These ideas are unpacked a little more in a paper that I wrote a few years ago, a more recent paper is less focused on this specific problem but is aimed at giving an introduction to the general idea of metaheuristic design.

Wild Idea: Affect-inspired Computation

As humans living and acting in the world, we have a number of moods and emotional states that influence how we think and reason about the world. Some of these are purely side-effects of other processes---such as the lack of ability to focus when we are tired---but, others have a positive effect on a particular kind of cognitive task, for example the focus of attention when we are in a frightened situation. It is possible that these effects of mood on thinking might have evolved so as to generate the right sort of "cognitive mood" for a particular situation.

I wonder if we could exploit this to create an "affect-inspired" reasoning system. That is, we would have some kind of reasoning system that had a number of different modes of reasoning, and that these states were switched between depending on the current "emotion" of the system, which would depend on certain criteria.

There are a a couple of areas that have some reltationship to this: affective computing, which attempts to create systems that can work in a way that is aware of the the emotional state of the users; and, the various attempts that have be made to simulate emotions and the physical correlates thereof in systems such as neural networks, with the aim of understanding human emotion. However, this is a different proposal: here we would aim to use broad caricatures of these emotional states as a way of providing the inspiration for a better computational system for some problem. The role of affect in such a system would be similar to the rle played by biology in bio-inspired computation.