Here is a list of research ideas that I am keen to develop. Most of them are fairly broad, and projects could be found at a number of levels, from large projects for PhD students or postdoctoral research assistants to smaller projects that could be done as a summer project by MSc students on one of our courses.
I have listed a few "wild ideas" at the end of the document. These are very unformed projects that I wouldn't be comfortable supervising immediately, but which might be workable up into concrete projects.
I'm very happy to talk to potential research students and postdocs about either these research projects or your own ideas in these sort of areas. Please email me on C.G.Johnson@kent.ac.uk if you are interested. There are sometimes funded places available for PhD study or postdoctoral research assistants, and I am always interested in working with potential PhD students and postdocs on applications for funding. I am, of course, also happy to hear from students with their own ideas in related areas or any areas related to my previous publications.
Genetic Programming (GP) is a method of problem solving that takes its inspiration from biological evolution. Specifically, GP is concerned with evolving a population of programs (or other executable structures) to solve a specific problem. GP has been successful in a number of areas, e.g. data mining, bioinformatics, game AI, circuit design, symbolic regression, and many others.
In GP the semantics of programs are important, i.e. what programs actually do, as opposed to the syntax of programs, e.g. their text. A lot of work in GP has focused on these syntactic aspects of GP and neglected the semantic aspects. In the last few years a decent amount of work has been carried out on semantic aspects of GP, both at Kent and elsewhere. For example, the semantic effects of evolutionary operators can be studied (e.g., does the exchange of text between two programs result in a program that has aspects of the behaviour of the two parent programs?), studying the semantic diversity of the population (e.g., is the initial population semantically diverse? How does this semantic diversity change over an evolutionary run?), aspects concerned with fitness (e.g., can we define a fitness function in terms of the desired behaviour of programs and measure the distance of a program from that ideal semantics?).
There is lots of scope for exploring different aspects of this. A good starting point would be to look at the papers that I have written with my research student Lawrence Beadle (e.g. on semantic initialisation, crossover and mutation), and the work by groups in Dublin and Poznan.
Monte Carlo Tree Search is a method for searching game-trees in computer game playing, which has currently the state-of-the-art method for evaluating move choices in programs that play games such as Go. The idea is that rather than playing forward a small number of games intelligently, the program plays a large number of games randomly to completion, and then takes an average of the performance of each set of random games that start with a particular move as a measure of the quality of that move.
A not entirely unrelated notion is that of fitness inheritance and other fitness approximation techniques in genetic algorithms. This is the idea that rather than always allocating fitness to individuals by calculating the fitness directly, we assign a fitness based on the relationship of the individual to others whose fitness has already been calculated. For example, we might use the mean fitness of the two parents in crossover as an approximation to the fitness of the child. Recent surveys survey paper by Lin and Shi and Rasheed give a general introductions to this area, and a nice paper by Barbour, Corne and McCall shows how these ideas can be applied.
One interesting idea would be to bring these together in a particular way, that is to use a form of Monte Carlo tree search as a means of doing fitness approximation. That is, we take the current population, and run a number of runs to convergence by taking random crossovers and mutations (this is computationally cheap, as in most individuals it is fitness evaluation that is most costly). The fitness of the individuals is then the fitness of the end-point of these random runs. The question that we would like to answer is whether the fitness assigned at the end of this process is a more valuable guide for selection than selection based on direct calculation of the fitness of the individuals.
Similar ideas could be applied to other methods, such as other applications of genetic algorithms, or swarm intelligence methods.
Genetic Algorithms are a problem-solving method that takes its inspiration from biological evolution. A population of potential solutions is created, and this population is modified by processes that are similar to evolutionary change over generations, in particular mutation of solutions and crossover of information between solutions.
Traditionally, mutation has been carried out by having some representation and making a small change within that representation. Consider a problem like evolving the image of a face to match a remembered face: this is a real problem that is needed for "photofit" problems used by the police to identify criminals seen by witnesses to a crime (see e.g. the E-Fit system). The standard approach would be to have some kind of parameterised model of faces - similar to that used in character creation in computer-based role playing games - and adjust the parameters to match the memory of the witness.
An alternative would be to use the vast amount of data that is available on the web. Again, consider our example: there are huge numbers of photographs of faces on the web, could an similarity-based image-recognition system such as GazoPa be used to present "mutations" of a facial image based on real photographs rather than an artificial face?
The aim of this project would be to develop and evaluate some search-based mutation systems. This could be based on the face-based system discussed above, or another visual system, or a language-based problem, depending on the interests of the student. There is also scope for extending these ideas to crossover, though this is harder.
Genetic algorithms are a technique for finding good solutions to optimization and other problems which work by simulating the evolutionary process found in natural biological systems. One problem with these techniques is that the population can rapidly converge to a weak solution, or at the opposite end of the spectrum can fail to converge to a solution at all. This depends largely on the choice of parameters such as rates of mutation, population size, recombination rates, et cetera.
Work has been carried out at finding a priori optimal parameter settings for simple problems, however these results do not extend easily to real problems. One promising approach would be to look at on-the-fly parameter adjustment. As the algorithm runs it would accumulate statistical information about current performance, and use this information to do parameter-estimation for models of algorithm behaviour. This can then be used to adjust the parameters of the algorithm itself to control e.g. convergence rates. Another approach would be to use ideas from control theory.
Similar techniques could be applied to other meta-heuristics for optimization and related problems.
One thing that we learn when we learn to program is the idioms of a programming language. We do not use the full expressivity of a language: most of the time, when we use a particular construct in a language, we use it in one of a very restricted number of ways. Indeed, learning that we do not need to feel guilty at not using the full range of expressivity in a language is one of the marks of maturity as a programmer: your programs start to look like the sort of stuff that real programmers write, rather than just a random spray of syntactically-correct material.
One example of programming language idioms is the theory of roles of variables. This is an attempt to classify the use of variables into a small number of groups: most-recent holder, follower, gatherer et cetera. Empirical research by Sajaniemi's group in Joensuu has shown that a dozen roles can capture the vast majority of variable usage in programs. Tools have been created that, either through program analysis or machine learning that can test whether a variable is being used in a particular role, or identify the most likely role that a variable is playing.
Idioms exist at various levels. The canonical notion of a programming idiom is the for-loop in languages such as C and Java. The canonical for loop is of the form for(int i=0;i<n;i++). Whilst in theory the language allows complex expressions to be placed in the "parameters" of the loop, in practice the vast majority of loops (it would be interesting to do a proper study of this) are of this form or a minor variant.
There are a number of possible projects in this area. One area is in programming education. There is a vast amount of research about beginning to program, but there is very little about the following stage, i.e. progressing from a competence in basic programming skills to being an experienced programmer. Idioms provide one way for thinking about this and for providing tools to support more advanced programming education, for example by producing idiom-aware programming tools. There is a substantial project to be done based on creating such a system in a grounded way and evaluating it.
A second area is in the provision of debugging tools based on idioms. There are a number of tools that attempt to do generic debugging - that is, not just to identify errors with regard to a specific problem specification, but with regard to "code smells" that hint at problems with the program based on commonly-made error types. One example of a system that uses such methods is FindBugs. It would be interesting to develop such a system not based on finding negative aspects in the code but on finding deviations from positive aspects. This would involve the development and evaluation of a debugging system based on these ideas.
In a very different direction there is a lot of scope for applying ideas about idioms to genetic programming. This would involve biassing the search towards mutations that are based on idioms.
(joint project with Alberto Moraglio)
The geometric framework (GF) is a formal theory of search algorithms that links solution representations, search operators, search spaces and fitness landscapes. It defines search operators using geometric shapes such as circles, segments and convex polygons specified as functions of the underlying distance of the search space. The search operators are then defined using these geometric shapes to relate the location and probability of the offspring with respect to the location of their parents. For example, uniform geometric crossover is a geometric search operator that, given two parents corresponding to the end-points of a segment, returns uniformly points in the segment as offspring. This framework reveals an important duality: the same search operator can be defined both on the representation as a procedure that manipulates parent configurations to produce offspring configurations as well as in a geometric declarative form as spatial relationship by specifying the locations and probability of the offspring with respect to the locations of their parents. For example, the uniform crossover for binary strings, which is defined on the representation, dually, it can be defined as uniform geometric crossover on the Hamming space. This duality has surprising consequences: (i) it allows us to use geometry on the search space as a formal language for a general theory of representations that encompasses the most-frequently used mutation and crossover operators across representations; (ii) it allows us to associate complex search operators with classic fitness landscapes; and (iii) it allows us to do principled design of search operators for new representations and problems that embed problem-specific knowledge in the search.
The aim of this project is to explore whether the GF can give insights to make discoveries in real genetics. This line of research is highly speculative. However, this might be possible because homologous biological recombination that aligns DNA strands on contents before genetic material exchange was shown to be a geometric crossover. Using the GF, this may allow us to determine the biological fitness landscape exactly and not to use this concept only heuristically or metaphorically as it is currently done in biology. Studying the characteristic of smoothness of this landscape might then explain, or contribute to an explanation on, why the rate of the biological adaptation due to Darwinian evolution is surprisingly high (which is an important open question). A second line of enquiry building upon the GF that can give interesting results in real genetics is about genes discovery (which is an important open problem). This is because the notion of gene can be naturally defined in geometric terms in a representation-independent way and can be formally specialized to any solution representation, including the biological one based on DNA strands. When specialized to binary stings, geometric genes correspond to schemata. It would be then interesting to see whether the genes predicted by the GF for DNA strands correspond to known real genes and if so whether they can predict unknown real genes.
Techniques such as genetic algorithms have some notion of move operators at their core, i.e. the idea of moving from one point in the search space to another via a mutation of the solution. One well-known problem in this area is setting the amount of mutation: how often should this operator be applied?
It could be argued that this is asking the wrong question. Rather than applying the same mutation operator more-or-less often, we should perhaps be thinking about how we can apply operators that have a variable amount of impact, depending on the fit that has been achieved so far. Basically, we don't want the mutation to make things worse than the current state - and we want to make small, focused improvements where this is appropriate.
One approach to this is specific to symbolic regression, i.e. fitting a curve to data. There are a number of genetic programming-based approaches to this. I would like to consider a new approach; rather than running a long GP algorithm to solve the problem, run a large number of small (short, small-population) GP runs, and combine the result of each with the results from the previous GP runs (by addition, multiplication or function composition). That is, each run trys to fit the difference between the current solution and the optimal solution, rather than making a complete fit.
There is a very loose analogy here with ideas of series expansion in topics like Fourier or Taylor series. The idea is that the first pass of the optimisation fits an approximate curve to the whole data set; the next pass fits some features that are missed out by the first curve; and so on, until later stages of the fit would be only affecting small local details.
One critique of this is that the resulting expressions would be very complicated. However, existing GP systems produce complicated expressions anyway; and, there might be scope for symbolic simplification of the resulting expressions.
This is particularly interesting the context of dynamic symbolic regression, where the data is changing with time and where the aim is to maintain a good model in the face of this changing data. It would be possible to use this approach to adjust the fit at various levels: e.g. to adjust the details if the changes were at the level of fine detail (we can think of this as adjusting the "higher harmonics" of the model), or to make large-scale changes if needed.
We sometimes want to contextualise the information in a web search. An example of context is the age of the information on a page. For example, we might want to get articles that were written at the beginning of some phenomenon. The canonical information retrieval/semantic web way to do this would be through the use of metadata. However, this isn't very useful for this, as web-pages are rarely tagged with such meta-data. However, there is a lot of information available that is tagged, e.g. newspaper articles. The aim would be use some idea of similarity between tagged and untagged pages to add this contextual information to a page. For example, we might search for the use of particular names or slang terms that were prominent during a particular span of months or years to add time-context to a page. The techniques used for this would primarily be machine learning methods.
A second broad set of ideas in a similar direction would be adding affecive tagging to webpages, to allow a search that is contextualised by the "attitude" of the web-page writer. For example, we might want to search for a negative review of some product, or to find pages that are "informative" rather than "gossipy" about a celebrity. Again, machine learning and text mining approaches would be the main technique that would be used for this.
Metaheuristics are general computational problem-solving methods. Examples would be search techniques such as genetic algoithms and tabu search. To apply a metaheuristic, the problem at hand needs to be phrased in terms of a general problem type. The canonical example of a problem type is optimization, i.e. defining an objective function and then applying the search technique to find a good value for that objective function.
Research in metaheuristics has focused on exploring new problem solving methods; a neglected area is the definition of new generic problem types. One example with a lot of promise is the idea of coverage, which is the idea of finding one example for each class in some classification.
An example helps to explain this. Consider web search or a similar information retrieval problem. One problem with such a search is that a query that references several topics will often produce a search result that is dominated, in some cases completely, by one particular topic. It would be interesting to have a search that grouped the results, showing a few examples of each of the topics referred to by that query.
This problem of finding a diverse set of examples relative to a qualitative classification has the potential to be a good example of a new generic problem type for metaheuristics. Examples of its application can readily be found in information retrieval, security, mathematics, computational biology, and design.
There are some techniques that are similar to this. For example, there are some similarities with unsupervised clustering, multimodal optimization, machine learning of classifications, and some approaches in computational creativity. However, none of these quite capture the same idea.
There is lots of scope for the development of an interesting tool and applying it to a number of topics. In particular, a demonstration that a single tools could work fairly well on a number of problems might be more interesting than applying these ideas to a single problem. That said, specific examples in e.g. web search and computer-aided drug discovery are probably worthy substantial projects in their own right. These ideas are unpacked a little more in a paper that I wrote a few years ago, a more recent paper is less focused on this specific problem but is aimed at giving an introduction to the general idea of metaheuristic design.
As humans living and acting in the world, we have a number of moods and emotional states that influence how we think and reason about the world. Some of these are purely side-effects of other processes---such as the lack of ability to focus when we are tired---but, others have a positive effect on a particular kind of cognitive task, for example the focus of attention when we are in a frightened situation. It is possible that these effects of mood on thinking might have evolved so as to generate the right sort of "cognitive mood" for a particular situation.
I wonder if we could exploit this to create an "affect-inspired" reasoning system. That is, we would have some kind of reasoning system that had a number of different modes of reasoning, and that these states were switched between depending on the current "emotion" of the system, which would depend on certain criteria.
There are a a couple of areas that have some reltationship to this: affective computing, which attempts to create systems that can work in a way that is aware of the the emotional state of the users; and, the various attempts that have be made to simulate emotions and the physical correlates thereof in systems such as neural networks, with the aim of understanding human emotion. However, this is a different proposal: here we would aim to use broad caricatures of these emotional states as a way of providing the inspiration for a better computational system for some problem. The role of affect in such a system would be similar to the rle played by biology in bio-inspired computation.