Dear All, WHY DO STUDENTS BREAK THINGS? ============================= A final-year student, working on a project with our CSP-for-Java (JCSP) library, broke it a few weeks ago :-( ... The good news is ... we fixed it ;-) ... but it was agonisingly hard to do. Or have we really fixed it? After all, we thought it was OK before. It makes an interesting case study, so I'm setting it as a challenge for your idle hours. The student (Neil Fuller) came up with a deadlock in his application that he claimed was not his fault. He was right. Distilling his system down reduced it to a simple stress-test on the JCSP Alternative (i.e. ALTing) that deadlocked. I went back to the original Java algorithms for the ALT: http://www.hensa.ac.uk/parallel/groups/wotug/java/discussion/4.html This was published in October, 1996, following up discussions arising from the Java Threads Workshop the month before. Mapping the stress-test to use this code (i.e. just using Java Threads directly - no JCSP packages), it also failed :-( The stress-test is simple: two threads (A and B) bombard two channels with ints in a tight loop (the Channel in the above web page just carried ints). There is a receiving thread (R) that `fair'-ALTs on the two channels, also in a tight loop: ______________ ______________ ______________ | | | | | | | | a | | b | | | A |--->---| R |---<---| B | | | | | | | | | | | | | ---------------- ---------------- ---------------- The select method in the Alternative class (from the above web page) is supposed to be a PRI ALT. To get a `fair'-ALT, unroll the body of the loop in R so that it does two selects (PRI ALTs): the first gives priority to channel `a' and the second to channel `b'. What should *not* happen is that the sytem deadlocks. Well, it runs for a while ... sometimes several minutes ... and then it deadlocks ... WHY DIDN'T WE STRESS TEST IT ORIGINALLY? ======================================== I can't remember, but I think we must have done. But in 1996, we weren't using a JIT so a stress test wouldn't really have stressed it and the race hazard wouldn't have shown up. The JCSP/CTJ libraries then got built and are certainly used with JITs - but till now, none of our JCSP applications really stressed it. WHAT WENT WRONG? ================ The web page was published long ago (2nd. October, 1996). The algorithm tries to be a Java rendering of the algorithm implemented by the transputer. It can't be exactly the same, because of certain aspects of Java, but was as close to it as could be made (or, at least, that's what I tried to do). Prior to appearing on that web page, it was posted to java-threads and occam-com. The algorithm has been presented at seminars around the world and in private conversations with all sorts of clever people. I never had any problem in explaining how it was a watertight implementation of the ALT. The algorithm was incorporated into the JCSP library and has worked successfully for a couple of years ... until a couple of weeks back! The algorithm was also (I think) incorporated into Twente's CJT library, where it's also been used without apparent failure (Andy/Gerald: look out!). But the algorithm is wrong. It has a race hazard. Stress it and it will deadlock. Oh, dear! The algorithm involves two monitors (classes): Alternative and Channel. The Alternative has synchronized methods: select (public) and schedule (of package visibility). The Channel has four methods: read and write (both public) and enable and disable (both package visibility). The read, write and enable methods are synchronized - the disable one is not synchronized. The reason given on the web page for not synchronising the disable is valid (it would *quickly* deadlock otherwise). There are several waits and notifies around the code. The two classes interact with each other in all sorts of interesting ways. At WoTUG-21, I gave an informal presentation of a Hoare monitor alogorithm that had just 4 (synchronized) methods plus various waits and signals. That was pretty hard to follow! I think these Alternative/Channel interactions may be a tad harder. MORAL: ====== Programming with synchronized, wait and notify is to be avoided at all costs! Above an almost trivial level of complexity, we just plain get them wrong!! Even when the code is a key element implementing a major product and we stare at it and talk about it for years, we still get it wrong. What if we were building something safty-critical (like air traffic control)? The hope is that we've got it right now for our CSP primitives and we will never be bothered by them again. We can use the CSP stuff instead - which is something about which we *can* reason and which remains reasonable with increasing complexity of application. SO WHAT WENT WRONG? =================== I'm not going to tell you ;-) At least, not until some (probably) late night session at WoTUG-22 at Keele next month ... have *you* booked your place yet?!! Meanwhile, it is an exercise for the reader. Print out the web page. Stare at the Alternative and Channel classes. Spot the race hazard! Hint: without formal methods, I bet you won't spot it! Instead, build the stress-test outlined above. Run it and watch it deadlock. Get your JVM to dump a stacktrace for all its (3) threads (CTRL-\ for JDK on Solaris or CTRL-Break on Windows may do the trick ...). Look at the stack traces. The deadlock is obvious, but how did those threads get into those states? The Alternative/Channel methods were not supposed to let that happen! [Note: this took us an hour or so.] Once you've found the deadlock and worked out the race hazard that lets it happen, fix the algorithm so that it doesn't. [Note: this took us about 5 days ...] HOW DO WE KNOW WE'VE GOT IT RIGHT NOW? ====================================== We don't. I'll publish the new algorithm at WoTUG-22. Please would someone then prove it correct!!! Or prove it wrong, correct it and prove that one correct! A problem lies in that there are no formal semantics for Java's synchronized keyword, nor for its wait and notify methods. They must have some semantics, because we don't seem to have any problem informally explaining what they do. Indeed, a problem with these things is that it's *easy* to explain what they do. Great, threads seem so simple in Java - at least, the primitives seem simple enough. It's when we use them in combination that we lose all sense of what we are doing. To formalise these things, build CSP models for them. Get some consensus that these models do reflect the informal Java semantics. Then, we can start proving things (with respect to the assumption that the CSP models are good ones). [I can see(ish) how to map sync and wait into CSP ... the notify beats me though ... ???] Then, to prove the revised Alternative/Channel implementations correct, turn them back into CSP (assuming we have CSP models for sync/wait/notify). We would now have an implementation in terms of CSP primitives. Prove that that implementation equals an ALT. Have a good weekend, Peter.