Before starting my PhD in October 2016, I helped the FFEA team at Leeds polish up their software and release it to the public.
FFEA (more info here) is a biophysics simulation package designed to simulate large, complex biomolecules. It had been in internal use at Leeds for a few years (mostly for simulations of the motor protein Dynein), but the team wanted to release it as free software, in the hope that it would be used for other molecules as well.
The road to release was a long and fraught with danger, but since my undergraduate project the software had clearly been shown some love, as it now had some documentation and a number of unit tests. However, there was still a lot to do!
The software lacked a proper readme file and any kind of installation instructions. Scientific software (and biophysics software in particular) can be quite difficult to install, so one of my first tasks was to test the install process on a range of different operating systems, and compile a troubleshooting guide that showed each installation step in detail, along with the potential gotchas. I also wrote a readme file that explained the purpose of FFEA, along with some images and eventually a video (link).
Additionally, I created a basic website for the project, and started hosting our documentation online – previously, users would have had to compile the software in order to get access to it, which creates this horrible chicken-and-egg installation situation. For fun, I reskinned doxygen to make it a bit easier to read, and in keeping with the look of the rest of the FFEA website.
I also wrote documentation for the FFEA viewer and FFEA analysis API, both of which I’ll discuss in more detail later on.
FFEA is a multi-language project, comprising two big slices of C++ and Python, and a small slither of C. Most of the C++ code just fed into one monolithic executable with some fairly simple command-line arguments. From a user’s perspective, nothing could be easier. However, the FFEA tools, a suite of initialisation and analysis tools written in a mixture of Python, C++ and C, were in worse shape. They mostly existed as a collection of disparate scripts, scattered around the install directory. It was possible to run them using a ‘launcher’ script, which ended up being on the system PATH, but this was a bit hack-y (I think we had 2 simultaneous command-line wrappers going at once) and not really a good user experience.
I decided that I would unify these scripts, along with some of the self-contained C++ executables, into a more consistent API. This would mean less duplication of code between scripts (there was a lot!), it would make the FFEA tools easier to extend and enhance, and it would make it easier for users to install and use them – ideally, they should be usable in a python interpreter just by typing ‘import ffeatools’.
I also strove to make the existing command-line interfaces more consistent. I stripped out the custom sys.argv sniffing code (again, duplicated but slightly different in each file) and replaced it with a more standard use of the python argparse module. However, these scripts also needed to be imported as a python package, so they gained the ability to detect whether they were being used on a terminal or in a package, and adapt themselves to either. Furthermore, the most important C++ and C scripts gained python wrappers, so that they could all be used from one, unified interface.
These improvements also permitted the creation of the ‘FFEA meta’ object. An instance of the FFEA meta class is automatically created when the FFEA tools are loaded, and can record the parameters being given to all of the different FFEA scripts using a few simple hooks. It will also automatically write that information to a file, and name that file in line with the other FFEA files. The end result is that the user doesn’t have to worry about manually writing down or losing their initialisation parameters. This object would not have been possible without packaging!
I also rewrote the FFEA error handling code in its entirety. I work with the philosophy ‘It is better for the user to see an error and understand it, than for that error to be hidden from them, and then come back to bite them in the ass’. I removed a number of instances of ‘error handling as control flow’ and replaced them with verbose errors with the correct exception type. That said, there are a few cases of ‘non-pythonic’ error code left, notably in the file IO scripts, as asking for forgiveness (as in ‘whoops, I tried to read a line out of a file that didn’t exist, guess I’m done’) is a lot faster than asking for permission.
Finally, I fixed a number of bugs, most of which were related to the visualisation tools, file IO (particularly parsing PDB files), deprecated code, and the installation process.
Since the end of my work on my BSc project [link], the homegrown FFEA trajectory\molecule viewer had been deprecated in favour of a plugin for PyMOL, the python molecular dynamics viewer. This viewer was brand new, and so was missing a number of useful features.
First, as I mentioned earlier, the viewer was very buggy. When it didn’t core dump due to a thread safety issue, it would throw up exceptions due changes in the FFEA python modules. The former issue was actually quite easy to fix – the viewer relied on loading the FFEA trajectory information on a separate thread to the renderer for the main window (in order to keep the viewer responsive). The fix was to use a drop-in replacement for the Tkinter library, called mtTkinter, which automatically handled threading, and stopped the majority of crashes (why this is not shipped as a drop-in replacement for Tkinter, I have no idea). As for the latter, hitting this moving target eventually drove me to write unit tests for the core FFEA modules, something that is still ongoing.
After that, I decided to tidy up the user interface for the PyMOL plugin. At the time, this consisted of a collection of radio buttons with not-very-descriptive labels, thrown into a window seemingly at random (they weren’t even lined up!). This is fine if the software is only for internal use, but now that it was going public, I made an effort to group the controls in a structured and logical fashion, and label each group.
I also added a few features. The first was the ‘CGO’ trajectory loader option. PyMOL accepts input in the CGO format, which lets you feed vertex data into it as a series of huge strings of numbers. These strings are formatted as a series of control codes (indicating whether the data forms a triangle, line, cylinder etc) followed by the positions of the vertices, and the vertex normal vector (needed for lighting).
Here is where the problem lies: the FFEA data files are not structured in this way. The nodes are saved in one file, and are loaded into quite a complex, hierarchical object. In the file, and the object, there is no way of knowing which vertex each node belongs to. That information is given in the surface file\object (or the topology, as interior nodes form tetrahedrons, not triangles). Each vertex contains 3 nodes, but each node belongs to 3 different vertices, meaning that each node has be loaded three times. This has to happen for each frame in the trajectory. So, 100 frames of 1000 nodes doesn’t sound like much, but it means that 300,000 lookups have to be performed, along with a few tens of thousands of vertex normal calculations, which can take quite some time.
I wish I could say I’d written some incredibly clever algorithm that does all of this much faster, but the reality is that no amount of clever algorithms can ever match up to that time-honoured tradition of cheating and making it look faster than it really is.
I rewrote the trajectory loader for special cases like this. Rather than loading into a complex object, it wrote all the data into a monolothic n-dimensional array. This meant that reading in the data was only bottlenecked by the speed at which the program could iterate over the lines in the file. Next, the array was cached to the hard drive in a binary blob (normally making it orders of magnitude smaller than the trajectory file itself, which is a text file). Then, the surface normals and calls to CGO were pre-calculated and also cached to the hard drive. Because they’re now accessing this monolothic array object, everything is a little faster. And because they’re cached, loading the file is instantaneous (after the initial generation).
Whilst this method isn’t perceptibly faster at first, unless you’re working with a really big trajectory (most of the small gains in loading the file are thrown away when writing cached objects to the disk), it does have one major advantage – it lets the user load in and out these files in real-time, once they’re generated. This tackles another big issue of the FFEA viewer – the user sets their parameters and pushes the ‘load file’ button. After that, their parameters are immutable and unchangeable. In order to switch from looking at the faces to looking at the topology, they’d have to re-load the entire thing. Now, they just have to load in one cached file. It also makes the files much smaller (the binary blob format is 1/5th of the size of a regular trajectory).
Now that the way was paved to make the viewer more ‘interactive’, I looked at trying to add useful features (both for potential users of the software post-release, but also for the rest of my PhD!). First on the list was tracking element inversions. The simulation will often crash when a certain element ‘inverts’. This happens when a tetrahedron undergoes enough force to literally turn it inside-out. The result is that the element has a negative volume, and this cause the numerical integration function (which uses the Euler method) to become unstable. Luckily, when this happens, the software will tell you which element inverted. The problem is that there was no way of visualizing that element in the PyMOL plugin. Element indices could only be displayed as 3D models, and even then, you’re going literal needle-in-a-haystack mode.
I designed two features to completely address this. First, the user can specify a number of nodes in the user interface to be highlighted. The software will then generate specific CGO ‘objects for those nodes, and put them in a separate PyMOL object. Then, the user can set the colour of that pymol object to whatever they like.
Second, the user can push another button (‘Add node indices’) at any time to have ‘psuedo-atoms’ generated at the location of each node. While you can draw objects into PyMOL quite easily, none of PyMOL’s measurement techniques actually work on them (e.g. if you want to measure the distance between two nodes). Generating psuedo-atoms fixes this – and if you set the name attribute to the index of the atom, then you can view their names or target them using a few commands in the pymol console.
I feel the need to point out that if you’ve used any of these tools whilst using the FFEA viewer, you should definitely also thank Albert Solernou, who took a lot of time to upgrade and improve them before the official launch of the software.
I hope that the changes I’ve made to the software will enable people to easily run cool simulations, to build their own tools on top of the FFEAtools API, and to more easily visualize the results of their simulations. If you are at all interested in biological simulations, please feel free to check out FFEA! It’s easy to get up and running, even if you don’t know anything about biophysics!
Last updated December 20, 2016.