Reaction-Diffusion by the Gray-Scott Model: Pearson's Parameterization
Instructions: A click anywhere in the crescent-shaped complex region will take you to a page with images, a movie and a specific description. Each grid square leads to a different page. I have special pages for the uskate-world and certain other exotic patterns.
This web page serves several purposes:
What Is It?
All of the images and animations were created by a computer calculation using the formula (two equations) shown below. In Roy Williams' words, it is a "relatively simple parabolic partial differential equation". Details are below; the essence of it is that it simulates the interaction of two chemicals that diffuse, react, and are replenished at specific rates given by some numerical quantities. By varying these numerical quantities we obtain many different patterns and types of behavior.
The patterns created by this equation, and other very similar equations, seem to closely resemble many patterns seen in living things. Such connections have been suggested by:
- Turing, 1952  (dappling; embryo gastrulation; multiple spots in a 1-D system)
- Bard and Lauder, 1974 (leaves, hair follicles)
- Bard, 1981  (spots on deer and giraffe)
- Murray, 1981 (butterfly wings)
- Meinhardt, 1982  (stripes, veins on leaf)
- Young, 1984 (development of eye)
- Meinhardt and Klinger, 1987 (mollusc shells)
- Turk, 1991 (leopard, jaguar, zebra)
In his original 1994 exhibit3, Roy Williams presented grayscale images, showing a "histogram-equalized view of the U component". He suggested that the images, after modification by "a little playing with the color map" would be "quite attractive as gift-wrapping". My images, like Williams', meet their own edges seamlessly, so they would look good if you used them as wallpaper on your computer monitor, for example.
The organic appearance and great diversity of patterns makes Gray-Scott patterns ideal for purely artistic applications, such as my own screen saver:
The Xmorphia PDE5 screen saver
Even for fairly non-artistic purposes such as a scientific publication, much attention should be given to the visual presentation of the simulation data.
Some of the color maps I tried
For this website, I wanted to express more than one dimension of information (specifically: u and ∂u/∂t) and I wanted the color mapping to be the same everywhere in the system, so that any two images could be compared directly with the knowledge that identical colors always represent identical values of u and ∂u/∂t. This prevented the approach of using "histogram equalization" or anything that changes the mapping from data value to color.
In addition to meeting those basic requirements, the color scheme I ended up with meets all of the following objectives:
- Low u values, including the large plain area near the left, are blue, while high u values and the area on the right are "red" this is to match the colors used by Pearson in his paper which in turn match the color of the pH indicator bromothymol blue used in actual experiments,
- The colors make fairly full use of the entire HSV (hue-saturation-value) color space, in an aesthetically pleasing way,
- With practice the observer can distinguish the level of u as well as the rate of change of u, and neither one obstructs the other,
- Resulting images compress well into JPG and H264 video without producing distracting artifacts or exceeding the color gamut of those standards.
Another 4 trial color maps
The reaction-diffusion system described here involves two generic chemical species U and V, whose concentration at a given point in space is referred to by variables u and v. As the term implies, they react with each other, and they diffuse through the medium. Therefore the concentration of U and V at any given location changes with time and can differ from that at other locations.
There are two reactions which occur at different rates throughout the space according to the relative concentrations at each point:
U + 2V → 3V
V → P
P is an inert product. It is assumed for simplicity that the reverse reactions do not occur (this is a useful simplification when a constant supply of reagents8 prevents the attainment of equilibrium). Because V appears on both sides of the first reaction, it acts as a catalyst for its own production.
The overall behavior of the system is described by the following formula, two equations which describe three sources of increase and decrease for each of the two chemicals:
The Reaction-Diffusion System Formula
u=[U], the concentration of U, and v=[V]. For the sake of simplicity we can consider Du, Dv, F and k to be constants. In computer simulations there are also quantization constants for time and space (Δt and Δx) that are used to break ∂t and ∇2 into discrete intervals.
The first equation tells how quickly the quantity u increases. There are three terms. The first term, Du∇2u, is the diffusion term. It specifies that u will increase in proportion to the Laplacian (a sort of multidimensional second derivative giving the amount of local variation in the gradient) of U. When the quantity of U is higher in neighboring areas, u will increase. ∇2u will be negative when the surrounding regions have lower concentrations of U, and in such cases the diffusion term is negative and u decreases. If we made an equation for u with only the first term, we would have ∂u/∂t = Du∇2u, which is a diffusion-only system equivalent to the heat equation.
The second term is -uv2. This is the reaction rate. The first reaction shown above requires one U and two V; such a reaction takes place at a rate proportional to the concentration of U times the square of the concentration of V6. Also, it converts U into V: the increase in v is equal to the decrease in u (as shown by the positive uv2 in the second equation). There is no constant on the reaction terms, but the relative strength of the other terms can be adjusted through the constants Du, Dv, F and k, and the choice of the arbitrary time unit implicit in ∂t.
The third term, F(1-u), is the replenishment term. Since the reaction uses up U and generates V, all of the chemical U will eventually get used up unless there is a way to replenish it. The replenishment term says that u will be increased at a rate proportional to the difference between its current level and 1. As a result, even if the other two terms had no effect, 1 would be the maximum value for u. The constant F is the feed rate and represents the rate of replenishment.
In the systems this equation is modeling, the area where the reaction occurs is physically adjacent to a large supply of U and separated by something that limits its flow, such as a semi-permeable membrane; replenishment takes place via diffusion across the membrane, at a rate proportional to the concentration difference Δ[U] across the membrane. The value 1 represents the concentration of U in this supply area, and F corresponds to the permeability of the membrane.
It is useful to note here that in biological systems, such as the skin of a developing embryo, this "supply" could be from the bloodstream, or from cells in an adjacent layer of tissue that continuously generate the needed chemical(s), with rates regulated by enzymes.
The only significant difference in the v equation is in its third term. The third term in the v equation is the dimishment term. Without the diminishment term, the concentration of V could increase without limit. In practice, V could be allowed to accumulate for a long time without interfering with further production of more V, but it naturally diffuses out of the system through the same (or a similar) process as that which introduces the new supply of U. The diminishment term is proportional to the concentration of V that is currently present, and also to the sum of two constants F and k. F, as above, represents the permeability of the membrane to U, and k represents the difference between this rate and that for V.
(In the original stirred tank model, F is the flow rate at which a pure-U supply is pumped into a tank, and k is the rate at which the reaction V→P takes place. The tank does not fill or empty; there is an exit pipe that carries liquid out at the same rate as the inflow. If the current concentration of U and V in the tank are u and v respectively, one can see that during the time it takes for 1 unit of U to be pumped in, the outflow is a mixture of U and V in the ratio u/v. That puts the minus F times u in the first equation and the minus F times v in the second equation. The k times v part (also being subtracted in the second equation) is the result of the V→P reaction converting (and therefore "removing") an additional amount V from the system.)
There is nothing in the equations that states whether the system exists in a two-dimensional space (like a Petri dish) or in three dimensions, or even some other number of dimensions. In fact, any number of dimensions is possible, and the resulting behavior is fairly similar. The only significant difference is that in higher dimensions, there are more directions for diffusion to happen in and the first term of the equation becomes relatively stronger. It is for this reason that phenomena depending on diffusion for their action (such as gradient-sustained stable "spots") occur at higher k values in the 2-D system as compared to the 1-D system, and at yet higher values for the 3-D system (this relationship can be seen in Lidbeck's Java simulator1). The use of a uniform grid is not essential either, for example see the "amorphous layout" of the simulations by Abelson et al.2
The original Gray-Scott model, with only a single quantity for the concentration of U and V that is equal throughout the tank, can effectively be thought of as the zero-dimensional case. In the 2-D simulations shown here, the "zero-dimensional" behavior is seen in any oscillatory or dynamic phenomena that have a long wavelength in space. For example, at F=0.026, k=0.049, the U changes to V pretty quickly, but there is a sort of damped oscillation as the system approaches the steady-state equilibrium concentrations; and the same is seen in the original Gray-Scott system.
Classification of Patterns
Various combinations of F and k produce widely different types of patterns, ranging from nothing at all (a completely blank screen or "homogeneous state") to stripes, spots, stripes and/or spots that split and/or "replicate", and several more complicated things. A preliminary classification scheme was begin by Pearson , and I have expanded on it; a catalog of the pattern classes is given here.
There is also a more general system for classifying pattern phenomena in complex systems that was established by Wolfram . This system was originally designed for discrete, deterministic cellular automata, but the principles can be expanded to continuous real-valued systems; the beginnings of such an expanded system are described here:
- Universality-Complexity Classes for Partial Differential Equation Systems.
- Pearson's Classification (Extended) of Gray-Scott System Parameter Values
Gray-Scott patterns can also be pretty interesting in just one dimension. A few examples are shown here: Gray-Scott reaction-diffusion system in one dimension.
In three dimensions, there are many more types of patterns, and the possibilities for complexity like that in uskate-world are much higher. However, they are a lot harder to simulate and display.
How I Simulate the Gray-Scott System
The reaction-diffusion hacker emblem
The partial-differential equations are fairly easy to translate into computer code, although there are pitfalls and tradeoffs to consider in calculating the gradients (Du∇2u and Dv∇2v terms). All projects discussed here have employed the explicit Euler method, in which ∇2s of the scalar quantity s is just the sum of the differences between the value of s at the point in question and the neighboring points, multiplied by the square of the grid spacing.
The original project, by Roy Williams, involved a number of calculation runs on a grid of 256×256 points, for 200,000 iteration steps (which I call "generations" in analogy to cellular automata). This took 30 minutes on a set of 17 Intel i860 processors, part of the Intel Paragon at Caltech, a rate of 7.3 million cell-generations per second (7.3 Mcgs), or 233 MFLOPs out of a peak speed (for 17 nodes) of 1.275 GFLOPs.
In 1994 I discovered the Caltech exhibit (web page) at this address (now no longer active) through my research on the Paragon and related supercomputers. I began doing simulations on various size grids, usually around 300×300 or somewhat larger, on a 60 MHz PowerPC 601, achieving rates of about 0.58 Mcgs.
The original implementation, July 1994
Over the following years I continued to maintain the project and implement improvements (described below) to deal with memory bandwidth limitations and take advantage of new SIMD instructions. By 2004, 10 years after I began, it was running at 36.1 Mcgs using the vector instruction set on an 800 MHz PowerPC 7445. At 6.7 Mcgs, the scalar code was roughly on par with the 17-node i860 implementation at Caltech. (The vector-to-scalar speedup factor, 36.1/6.7 = 5.38, seems impossible but was the result of the fact that the vector code also included optimizations that deal with the cache-to-memory bandwidth bottleneck.)
PDE4 on Core 2 Duo, with LogCPU load monitor
Most recently, in 2009 I revisited the project (mainly to set up this web site) and created the multi-threaded implementation for SMP (shared-memory parallel) systems. On a single node of a 2.33-GHz Intel Core 2 system it runs at 151 Mcgs; on an 8-core Xeon E5520 system it achieves just over 1800 Mcgs. (Perhaps even more impressively, a single thread running alone on the same Nehalem system achieves 275 Mcgs, almost double the Core 2's single-threaded speed). Memory speed does not factor into the comparison; the same ratios are seen when the dataset far exceeds the level-2 cache, as when it fits easily. (This is due to special care I have given to address the memory bandwidth bottleneck, see below). From the original PowerPC to a present-day machine of equal cost, I have seen a speed improvement of nearly 3000 over 15 years, a doubling in speed every 15.6 months. Double-precision performance is about half that of the figures quoted here, and this ratio has been about the same throughout the project's 15-year history.
I currently do most of my work on 512×512 grids (4× the area of the original Caltech grids) because it shows a greater variety of patterns and allows me to view a million tu in just a few minutes. Occasionally (such as when the parameter F is greater than about 0.05) I will let it run as much as an hour or two, or overnight when generating timelapse movies or a view of the full parameter map like that at the top of this page. Some tests have been run for as long as 4 months.
Some instructions for my program are here: PDE4 User Manual. The program is for MacOS only. Here is a recent version: PDE4 2011 Oct 5. If you have trouble running the program or have questions, contact me.
In the fall of 2010 I created a screensaver using OpenGL kernels to perform the reaction-diffusion calculations on the GPU. See my screen savers page for more details, screen shots, and to download the screensaver itself.
my face rendered in Gray-Scott
Optimizing Reaction-Diffusion Simulations
My reaction-diffusion simulation work has been a long journey (18 years at this writing) or progressive optimization to adapt to new hardware. In the first 15 years, my program's performance increased by a factor of more than 3000.
1994: The Caterpillar Scan
I began this project in 1994, just after the original Xmorphia website appeared. The standard algorithm at the time was forward Euler integration, (and my later studies showed that this is more than adequate for even the most complex Gray-Scott phenomena). Using the standard "ping-pong" method, one uses two grids (arrays) of data. With the gridsize that Pearson and Williams used, each grid would be 256×256×2×4 bytes, or 512 KiB, which amounts to 1 MiB total for the two grids.
In 1994 I had a 60 MHz PowerPC 601. This chip has a 32K unified L1 cache, and I had not installed the optional external 256K L2 cache. I discovered that with the 256×256 grids things ran at a speed that seemed reasonable, but with the larger grids I was trying to use for parameter space plots, performance fell sharply.
Since each grid cell is two single-precision floating point numbers, a row of 1024 cells consists of 8192 bytes of data, which is typically stored as two separate blocks of 4096 bytes. The Euler calculation requires reading three rows of input and writing one row of output; all of these rows get read into the cache on the first row scanned. For the cache to be useful, two of those input rows need to remain in the cache for when we calculate the second row. That means that four rows of data need to fit in the cache. So a grid width of 1024 cells would be trying to put 4×8 KiB = 32 KiB of grid data in the cache, and that won't fit because it's a unified cache: some of the cache is used up by the code (program instructions). A grid width of 1024 was extremely slow, and in fact I didn't get much over 512 before things got slow.
So the first optimization I used was what I called the "caterpillar scan", so-called because the data end up creeping along like the feet on a caterpillar that is walking forwards. We use just one grid, but we start by adding an extra row to the top and bottom of the grid and on each row, the output (new values for that row) get written over the previous row. The processing of a single iteration of the whole grid goes like this:
Notice that the output is offset by one row from the input: that's because we always need the "old values" of the two neighboring rows when computing each new row: so when the new value of row N is written, it must be written over row N-1 so that the old value of row N is still available when working on row N+1.
Using the caterpillar algorithm the cache impact of the grid width was half as great, because there's only one grid: the row being overwritten is already in the cache. Here are my original PowerPC 601 benchmarks:
I was perfectly happy with this (the theoretical peak MFLOPs of this CPU is 120 MFLOPs, but that's only achievable without loads and stores. I did achieve 92 MFLOPs in my Power MANDELZOOM but only by keeping everything in registers.)
2004: GCC Compiler and Register Variables
By 2004 I had an 800 MHz PowerPC 7445 with 64K of L1 cache and 256K of L2 cache. My old PDE1 program had been built with MetroWerks CodeWarrior C++, and I created a new version (called "PDE2") in Xcode. But PDE2 failed to match the performance of PDE1. After some investigation I discovered that the new compiler (a version of GCC) was no longer putting local variables in registers, something CodeWarrior has been doing quite well. There are 32 registers of each type (float and integer) and at least half are available for use within a subroutine.
After some digging around and experimentation, I discovered that the solution for the GCC compiler was to declare local variables like this:register real_type F ASM("f1"); register real_type dF ASM("f2"); register real_type u_u ASM("f3"); ...
There was also a new way to tell it to generate FMA instructions (the -mfused-madd flag). With these changes, the newly compiled PDE2 matched the performance of PDE1:
2004: 128-bit Vector SIMD
The main reason for moving PDE1 to the new compiler was so that I could utilize the PowerPC 7445's "AltiVec" unit, a 128-bit-wide SIMD vector unit. It had the potential of delivering up to 4 times as many MFLOPs, although the need for additional data movement typically reduces the efficiency a bit.
I worked out a way to scan across a grid row 4 cells at a time, while keeping some of the data in AltiVec registers so that the next 4 cells don't have to re-load some of the needed input values. The result was a quite satisfactory improvement of about 3×:
2004: Cache Line Alignment Investigation
In the above two tables, I have highlighted the column with the highest performance. You might nitice that the results for a row length of 1024 cells is suspiciously low in the single threaded scalar results, and suspiciously high in the single threaded vector results.
I suspected this was because of "cache lines", and wrote my own version of the STREAM benchmark to investigate. I spent quite a while looking at the memory system performance of two 800 MHz PowerPC 7445 (with different types of memory) and worked out exactly how many cycles it takes to load each vector (128 bits or 16 bytes) of data. I eventually figured out three things:
- Due to the organization of the cache (8-way set-associative, 32-byte cache lines) certain row lengths will make the L1 and L2 caches behave as if they are much smaller than 64K and 256K respectively.
- In many cases performance is improved by interleaving the two floating-point values of each grid point so that all the data is in a single big array.
- Most grid sizes benefit from using the vec_dst (data stream touch) instruction, which tells the CPU to pre-load data that will be used soon.
Combining these techniques led to a substantial improvement, and with a more consistent performance for different row lengths:
2004: Millipede Scan
As part of the investigation of cache performance I worked out that I could achieve a greater cache hit rate by performing more than one "caterpillar scan" in parallel. This is easiest to visualize by imagining that there are two CPUs going therough the grid in parallel, with one CPU "in the lead" and the other following just a few rows behind.
This requires adding more extra rows to both the top and bottom of the grid, because if we're going to compute 2 new generations, we need 2 extra rows on each side.
If you watch a real millipede walking, it's actually a lot like this: there are several places where the feet are off the ground, and these places travel like waves in the "backwards" direction while the while millipede moves forward. Our data move like this: we scan from the top down, but the result (generation 2 output) ends up two rows higher.
The practical benefit of this is greater cache locality: to compute two generations we only need to scan the whole grid once. As long as 6 rows of data can fit in the cache, the cache will only "miss" once per row.
I call this the zig-zag scan because of the order in which memory is being accessed. The above diagram represents a "double-interleaved" zig-zag scan, but it can be more than that. The optimum amount of "interleaving" depends on the size of the processor caches and the size of the grid being simulated.
As before, continued operation requires using modular coordinates and having the grid data "wrap around" to continually re-use a fixed amount of memory. Here are the results I got in 2004. For each different grid size, I use boldface to highlight which overscan value ran fastest:
2009: Multi-Threaded Implementation
I revisited the project in 2009, using a dual-CPU PowerPC "G5" system. To use multiple threads, the grid is divided into "stripes" with each stripe being computed by one thread. Unlike the normal approach, the "stripes" do not remain fixed, but move in lock-step. Each thread is performing a millipede scan, and the millipedes march together around the (cylindrical) grid. This reduces the amount of data transfer that needs to be performed between threads, and reduces the total memory needed to hold the grid data. Both make it run faster on a variety of CPU, cache and memory configurations.
2009: Intel Conversion
I then moved to an Intel Core 2 Duo. This required two major changes. I added a set of macros to hide the use of compiler intrinsics for vector operations. With my macros, which are available here, the same C source will work on PowerPC or Intel vector intrinsics and even in cases where neither are available.
Then I made additional changes to handle the "little-endian" storage of data in memory (which affects how vector loads and stores get 4 consecutive array elements, and matters if the array is also being accesses by non-vector code e.g. for drawing). Most of this was handled by the vector macros, which hide the fact that array element 0 is loaded into the lowest vector element on an Intel system, and into the highest element on a PowerPC system.
As compared to the dual 2.0 GHz PowerPC G5, he 2.33 GHz Core 2 Duo does about 40% better.
2009: 8-core Nehalem System
On my 8-core system I have 16 MB of L3 cache, which is enough to hold all of the grid data for grids up to about 1024x1024 pixels (with room to spare for everything else that needs to be cached). I created many long-term time-lapse videos, most on a 480x336 grid; it was easy to run two or three of these at the same time, using 4 or 8 threads per each simulation.
This is a dual Xeon E5520 system. Each CPU has 4 cores and runs at speeds from 2.27 GHz to 2.53 GHz depending on how many cores are active.
Using just 2 threads should give more "performance per thread" than with 4 threads or more, both due to Amdahl's Law and because each CPU will have just one active core, allowing the clock rate to go up to the maximim 2.53 GHz. Based on clock speed alone I would expect to get about 324 Mcgs at the large grid size. In contrast to the Core 2 Duo, on this dual Xeon system data needs to be transferred between the two CPUs throughout the test. However, any slowdown from that is greatly outweighed by the DDR memory controller being right on the CPU die, and other Nehalem microarchitecture improvements.
For these tests I changed the 300×300 and 600×600 grids to 296×304 and 592×608 respectively, because this makes the number of rows a multiple of 16, needed for the 16-thread benchmarks below.
To measure the impact of Amdahl's Law, I repeated almost every test using 4 threads, then again with 8 threads and finally with 16 threads (except for the smallest gridsize).
In 2009 I prepared a scientific paper describing some of the more exotic patterns and my work to confirm or disprove their existence; see my formal scientific research page, and the figures and animations
In 2010 I gave a talk in the Rutgers Mathematical Physics Seminar Series; the slides, links to videos, and associated notes are here: 2010 Rutgers talk
 A. M. Turing, The chemical basis of morphogenesis, Philosophical Transactions of the Royal Society of London, series B (Biological Sciences), 237 No 641 (1952) 37-72. (PDF at cba.media.mit.edu: Turing 1952)
 K. J. Lee, W. D. McCormick, H. L. Swinney, and J. E. Pearson, Experimental observation of self-replicating spots in a reaction-diffusion system, Nature 369 (1994) 215-218. (PDF at chaos.ph.utexas.edu: Lee et al.)
The best program for getting started with reaction-diffusion is Ready, a fast, easy to use and versatile reaction-diffusion explorer for Windows, Mac and Linux. There are downloadable working versions, and it's an open-source project.
U-skates in Ready
If you have a Mac (running "Snow Leopard" or later) you'll probably also appreciate my xmorphia PDE5 screensaver, which turns the webcam image (i.e. you!) into a continually evolving Gray-Scott pattern:
The Abelson et al. applet 2 is a quick and easy way to experiment with Reaction-diffusion in your browser. It is a bit difficult to draw a pattern manually, but it is the only web-based simulator that supports different types of grids:
U-skates on a hex grid
Other Online Resources
1 : Jonathan Lidbeck has created this Java applet with extensive presets, options and instructions. He also provides a 3-D version. (Old URLs were http://ix.cs.uoregon.edu/~jlidbeck/java/rd/ and .../java/rd/3d.php)
2 : A group (Abelson, Adams, Coore, Hanson, Nagpal, Sussman) at MIT have created this website showing Gray-Scott pattern phenomena in grids of points that are connected "amorphously", more closely modeling how things would happen in a biological system of living cells.
6 : Why UV2? : This is due to the so-called law of mass action (in its kinetic aspect), the fundamental relationship between stochiometry and reaction rate that generally applies when all reagents 8 are free to move around and participate in any available reaction. The value UV2 results from the facts that the odds of finding a molecule of U at any given place are proportional to the concentration u=[U], and in order for the reaction to take place you need to have one U and two V's in the same place at the same time. This ignores the rate constant, reverse reaction, and many other details: the reaction-diffusion model above has been simplified by converting to dimensionless units, ignoring any heat, ionization, association, etc. considerations, leaving just UV2. The law of mass action is a fairly rare example of a place in applied mathematics where addition and multiplication become multiplication and exponentiation respectively ("U plus V times 2" becomes "u times v to the power of 2") 7. It is because of this transformation, particularly the exponentiation part, that certain substances such as pure oxygen are so dangeous: when you increase the concentration from, say, 5% to 50%, the reaction rate of an otherwise rare reaction of the form "4O2 + A ... → B + ..." increases by a factor of 10,000.
7 : Multiplication to exponentiation : I love stuff like this, see my large numbers pages if you have any doubt.
This page was last updated on 2015 May 17. s.13