A common occurence in my student supervision is the event of students getting stuck. Sometimes they get stuck trying to install an obscure library, perhaps to write a new routine, or more conventionally they get an acute attack of classic Writer's Block. Being stuck for a short time can be very productive, being stuck for a long time is usually mainly frustrating. Aside from regular interaction with students and allowing them to interact with each other, I don't quite possess the magic potion that makes students stuck for shorter times, or less often. However, I did realize that some of the "stuck" points are very similar for different students.
In an initial effort to try and address this, I have created a public GitHub repository with resources for all my 3rd year and MSc project students. You can find it here, and everyone is free to contribute, criticize and comment. Now all I need to do is ensure that this thing doesn't become derelict or, perhaps even worse, turn into some opaque spaghetti that eventually collapses under it's own weight. Let's try and remember to revisit this one in a year or two.
Published on July 27th, 2016.
Written for the Software Sustainability Institute Blog, reposted here.
This blog is already chock-full of useful tips for software development, and much of it applies to sustaining software on supercomputers as well. Here are a few tips on developing sustainable software for supercomputer environments.
Yes, that’s what the locals call it in Europe. The locals in the UK call it High-End Computing infrastructure and the locals in the US prefer Cyberinfrastructure. Familiarise yourself with at least your local supercomputer before you start. Check the technical specifications, read the user guide, arrange access and preferably examine how the machine performs with existing codes. And last but not least, try to find relevant libraries that are already installed there. That way you don’t have to waste time writing bespoke code or installing libraries of your own.
Supercomputers tend to have a more complicated architecture than your local machines, so installing software on them is going to be more difficult. Intuitively you may feel inclined to write more complicated code than usual, so that your program nicely wraps around all the intricacies of the machine. However, you may well find yourself sinking in a swamp of errors and incompatibilities, especially when the administrators have decided to update their operating system or, worse still, upgrade to a new supercomputer!
Complex software doesn’t work well with complex hardware, because complex hardware tends to make the software installed on it more difficult to maintain. If you’re not part of a huge company, and would still like to be able to use your software five years from now, it’s best to keep your code structured simply with a constrained set of dependencies.
Supercomputer compilers generally tend to be optimised for performance, and frequently are a little unstable on the fringes of your programming language. When you write supercomputing code, try to avoid implementing that ten-level dynamically type object hierarchy in C++, or basing 50% of your communication routines on a feature that emerged in the MPI a year ago.
Similarly, supercomputers behave a little differently when things go wrong. When you test your program, you’ll save a lot of time and frustration by assuming that your program will almost surely crash. Prepare for crashes and hangs by capping your jobs tightly in wall-clock time, and enabling a reasonable level of verbosity. If crashes do occur, dig into the data. Don’t just look for the cause of the current crash, but inspect your output data for other inconsistencies and errors. This may save you a handful of additional crashes further down the line.
Helpdesks at supercomputing centres are responsive and useful. They are manned by specialist people who, in many cases, are expected to resolve issues within a very limited timeframe. So... if you get stuck with an issue, you’ll help yourself greatly by contacting them. However, when it comes to installing new system-level software, such as your favorite flavor of MPI or a set of Java web service libraries, don’t expect the helpdesk team to be as enthusiastic.
Resource providers tend to acknowledge installed software as a perpetually present sink of energy and money. You’ll find yourself making much faster progress if you either install it in your local home directory (when possible) or avoid using the software altogether.
Suppose you decided to develop that next generation science solver by combining six existing programs: one runs on a desktop and is interactive, the second consists of millions of tiny and independent tasks, the third is optimised to run on the latest NVidia/ATI graphics cards, the fourth actually is a straightforward parallel code, the fifth is a straightforward parallel code that uses privacy-sensitive data, and the sixth one simply handles much more data than any of the others.
Now, what kind of computer should you pick to run this super-program? Clearly, there is no one-size-fits-all solution, which is exactly why we have so many different architectures out there to begin with. Fortunately, distributed computing is very much alive, and workflow tools such as Taverna and coupling tools such as MUSCLE make it easier to just link resources together to use them for a common purpose.