/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|  luams-3.c
|
  A summary of this and other HPC projects is in high-performance.rhtf

19961015.xxxx Create LUAMS-2 to answer a new thread on sci.fractals
about the Mandelbrot area. It seems they want to do a new GIPPP on our
newer, faster machines.

19961017.2xxx Find my postings to sci.fractals from 1993. Notable
points:

- I was about to run a GIPPP myself but called it off in order to
accomodate the distance-estimate GIPPP, which at the time was more
mathematically interesting.

- The 1.506595 result was calculated with a dwell limit of 512*1024
and period detection. It was an average of 40 runs, half of which
straddled Re=0.5 (Seahorse Valley) and the other half of which ran
right down it.

- The error term 0.000002 came from the standard deviation for the 40
runs divided by sqrt(40).

- An additional 0.0000056 should have been subtracted to account for
the dwell limit 524288 that was used in the 40 runs.

- It would be nice to calculate the total number of iterations iterated
in the course of the area estimate.

- It takes 7 years for computing power to grow enough to use a 2x2
larger grid.

And other things I thought of while reading:

- Given the trend in stderr, it takes a little under 20 years for
computing power to grow enough to gain one decimal digit.

- It would be nice to compute the center-of-gravity of the Mandelbrot
Set at the same time as the area.

19961018.14xx. Add period detection.

grid  itmax  area     stderr   time
32    256    1.509723 0.039114 1.60
32    512    1.500947 0.038372 0.63
64    512    1.516968 0.015629 2.00
64    1024   1.512485 0.016269 2.30
128   1024   1.510655 0.007460 7.92
128   2048   1.508534 0.007052 8.87
256   2048   1.508178 0.002845 33.2
256   4096   1.507292 0.002688 36.6
512   4096   1.507517 0.001122 141
512   8192   1.507116 0.001097 155
1024  8192   1.506971 0.000489 607
1024  16384  1.506779 0.000501 661

19961018.16xx. Add test for period-2 mu-atoms, but it doesn't work.
Add skewing along Im axis and code to deal with the first row
differently.

32    256    1.530539 0.039807 1.45
32    512    1.524781 0.038548 0.63
64    512    1.508679 0.017331 2.00
64    1024   1.504898 0.019480 2.32
128   1024   1.509324 0.005810 7.90
128   2048   1.507078 0.005430 8.88
256   2048   1.508248 0.002059 33.2
256   4096   1.507333 0.002191 36.6
512   4096   1.507676 0.001369 149
512   8192   1.507299 0.001376 166
1024  8192   1.507195 0.000489 675

19961018.1635. Fix period-2 test. Increase number of iterations
greatly to reduce that source of error.

19961018.1717. Add printout of total aggregate iterations.

32    4096    1.512 842 730  0.035 980 710        771,615  2.07
64    8192    1.502 917 751  0.019 617 425      2,714,431  4.40
128   16384   1.505 115 841  0.005 550 857     10,725,998  12.3
256   32768   1.506 413 466  0.002 277 286     45,943,157  49.6
512   65536   1.506 898 665  0.001 368 183    200,294,650  218
1024  131072  1.506 801 245  0.000 503 518    877,059,533  895
2048  262144  1.506 675 767  0.000 151 678  3,830,269,312  3737
4096  524288  1.506 588 196  0.000 074 581 16,898,067,806  15840
8192  1048576 1.506 595 617  0.000 027 484 74,884,551,084  67570

19961021.1248. Make m_area save to a file periodically, with filename
based on gridsize and n_max so (potentially) different projects can be
started and stopped at different times. It also has the nice side-effect
that if you run the same gridsize again after recompiling (or whatever)
it returns the area for that gridsize immediately.

1024  262144  1.506 622 314   46,910,573
2048  524288  1.506 286 621  213,925,295

19961021.1315. Go back to m_area_stat, which now automatically benefits
from the changes to m_area. Increase itmax again and begin a new set of
runs (Also, the skew parameter has become an integer, so two things have
changed):

32    8192     1.511 458 647  0.037 221 915           904,123
64    16384    1.502 640 743  0.019 431 398         3,133,053
128   32768    1.504 840 699  0.005 686 364        11,973,129
256   65536    1.506 341 352  0.002 261 414        51,835,404
512   131072   1.506 886 924  0.001 371 587       225,038,584
1024  262144   1.506 783 392  0.000 493 761       971,556,504
2048  524288   1.506 674 263  0.000 152 765     4,248,279,832
19961022
4096  1048576  1.506 585 067  0.000 074 146    18,687,285,458
19961023
8192  2097152  1.506 593 810  0.000 027 584    83,170,102,744
19961028
16384 4194304  1.506 588 027  0.000 012 606   368,748,991,052
(19961107)
32768 8388608  1.506 591 544  0.000 004 484 1,416,385,072,113
(19961209)
65536 16777216 1.506 592 300  0.000 001 900 3,728,239,546,503

19961028.17xx. Add nice() routine, so that it will avoid slowing
down the Mac when the mouse is being moved. This does a pretty good
job of keeping LUAMS out of the way while I'm doing development.

19961029.1550. Small changes to nice().

19961104.2xxx. rebuild Power MANDELZOOM projects, preparing to merge
pipelined iteration code into LUAMS.

19961105.1458. Finally add Jay Hill's tests for period-1 and -2
mu-atoms (replacing my old test for the period-2 mu-atom). The
iterations column shows how much this speeds up the program:

32    8192     1.511 458 647  0.037 221 915         276,466
64    16384    1.502 640 743  0.019 431 398         993,155
128   32768    1.504 840 699  0.005 686 364       4,144,440
256   65536    1.506 341 352  0.002 261 414      17,877,456
512   131072   1.506 886 924  0.001 371 587      79,693,035

Immediately replace the ongoing background task so it can benefit from
the speedup.

19961105.1650. Add knowledge of exact value (7 pi / 16) for period 1
and 2 mu-atoms; pixel counting only used to estimate area of the
higher mu-atoms. As expected, this lowers the standard deviation a
lot at first, then substantially less and eventually not at all:

32    8192     1.510 347 371  0.028 337 143         276,466
64    16384    1.501 618 093  0.013 830 156         993,155
128   32768    1.506 875 646  0.005 529 593       4,144,440
256   65536    1.506 404 500  0.001 983 789      17,877,456
512   131072   1.506 870 205  0.001 167 508      79,693,035
1024  262144   1.506 775 316  0.000 508 381     361,842,027

(This happens because the standard error has two components: that
which comes from the period-1 and -2 mu-atoms and that which comes
from the rest of the Set. The component of error from the two mu-atoms
goes down as (grid size ^ 1.0)/(grid size^2) because the mu-atoms'
boundaries have dimension 1.0; the other error term goes down slower
and as a result it dominates when the grid size is arbitrarily large.)

19961107.xxxx. Merge in pipelined iteration code and (almost) verify
accuracy. It seems to differ on a few points, probably because of the
differences in the iteration algorithms. Also, it doesn't provide a
speed improvement (yet) because it doesn't have orbit detection.  (In
other words, the simple iterator with orbit detection is faster than
the pipelined iterator without orbit detection.)

19961107.17xx. Two sources of discrepancy were found: One was a bug
(not flushing the pipeline before doing the first row adjustment) and
the other results from the differences in the iteration formulas (such
as using zi2 = 2*(zr*zi + ci/2) rather than zi2 = zr*zi + ci).  The
following are the new standard results to compare to (times are
approximate):

32    8192     1.511 458 647  0.037 221 915          275,459 0.267
64    16384    1.502 640 743  0.019 431 398          967,521 0.867
128   32768    1.504 914 742  0.005 692 132        4,188,562 4.533
256   65536    1.506 386 129  0.002 312 484       17,981,647 15.87
512   131072   1.506 869 368  0.001 349 102       80,190,202 64.72

{For the next 15 months the program's performance is hindered because
I don't realize that the loop unrolling makes the period detection so
inefficient as to cancel out the benefit of iterating two points at
once. -19990520}

19961107.2039. Pipeline now includes orbit detection, but isn't any
faster than single iteration. Don't really know why...

Subject:      Mandelbrot Set Area, new result
From:         Munafo@prepress.pps.com
Date:         1996/12/23
Message-Id:   <32BED4D5.BF8@prepress.pps.com>
Newsgroups:   sci.fractals

I have continued to compute the area of the Mandelbrot Set using
my method of averaging 20 runs by the pixel-counting method,
with slightly different grid placements and grid spacings. The
latest result (after about 3.7 trillion Mandelbrot iterations)
is 1.50659230 +- 0.0000006.

Here are the results of a set of such averaged runs. For each
grid size, 20 runs were computed and averaged together.

grid  dwell    average        standard
size  limit    area           deviation
----  -------  -------------  -------------
32    8192     1.511 4        0.037 2
64    16384    1.502 6        0.019 4
128   32768    1.504 84       0.005 68
256   65536    1.506 34       0.002 26
512   131072   1.506 88       0.001 37
1024  262144   1.506 783      0.000 493
2048  524288   1.506 674      0.000 152
4096  1048576  1.506 585 0    0.000 074 1
8192  2097152  1.506 593 8    0.000 027 5
16384 4194304  1.506 588 0    0.000 012 6
32768 8388608  1.506 591 54   0.000 004 48
65536 16777216 1.506 592 30   0.000 001 90

The standard deviation measures how much (on average) each
run differs from the average. This means that the average
itself almost certainly varys even less from the true area.
For 20 samples, the standard error of the mean is about 0.23
of the standard deviation. A safe estimate of the error
would be about 1/3 of the standard deviation.

(My old estimate, from April 13th 1993, was 1.506595
+- 0.000002. That estimate was based on an average of 40
runs, and used a dwell limit of 524288. I should have
subtracted 0.000005 to account for the error in using such
a low dwell limit; I also was a little too "liberal" in my
calculation of the expected error.)

- Robert Munafo

{Big hiatus here -19990520} ---------------------------------------------

19971010.0055. Resurrect the project. I have just gotten it to run
again after updating to the new CodeWarrior. It was crashing because
g_dp was being set to dwell2, which caused an infinite loop (the fix
was to set g_dp to dwell3). Can't tell how fast it's
running because the iteration counts are way off due to orbit
detection -- it's adding up the dwell values returned by the pipeline,
not the actual number of iterations performed. With this new build
on the 6100/80 I am starting a new set of runs:

32    8192     1.511 458 647  0.037 221 916           1,427,000      0.367
64    16384    1.502 640 744  0.019 431 398           7,705,483      0.983
128   32768    1.504 914 742  0.005 692 132          53,725,852      3.667
256   65536    1.506 386 130  0.002 312 485         389,356,427     13.72
512   131072   1.506 869 368  0.001 349 103       2,983,154,869     56.55
1024  262144   1.506 781 916  0.000 492 708      23,278,493,347    262.9
2048  524288   1.506 673 891  0.000 151 484     183,890,967,883   1405
4096  1048576  1.506 583 929  0.000 074 489   1,461,504,579,703   5106
19971011
8192  2097152  1.506 593 644  0.000 027 695  11,658,311,598,595  59242

19971012.1454. Make it measure ticks only while it's actually calculating,
and make it record ticks in the files. This requires trashing all the
files so far, so I have to start over. While I'm at it, I also make it record
the actual number of iterations and the "simulated" number of iterations
seperately, which had stopped being done right after loop detection was added.
It is somewhat disturbing to notice that the actual area values are changing
too. Probably something to do with the pipeline restart after loading from a
file -- I think perhaps it isn't flushing the pipeline before doing saves.

   32     1.511 458 647  0.037 221 916          13,698,616
     8192                                          460,236     0.55
   64     1.502 640 744  0.019 431 398          84,071,307
    16384                                        1,452,625     1.1667
  128     1.504 914 742  0.005 692 132         585,681,564
    32768                                        6,091,786     3.5833
  256     1.506 386 130  0.002 312 485       4,358,544,267
    65536                                       30,530,867    13.05
  512     1.506 751 662  0.001 348 801      33,606,713,474
   131072                                      165,141,228    54.817
 1024     1.506 771 394  0.000 495 651     263,814,226,416
   262144                                    1,048,193,486   251.7
 2048     1.506 673 891  0.000 151 484   2,090,491,542,859
   524288                                    7,455,656,801  1349.6
 4096     1.506 584 024  0.000 074 387  16,643,863,788,711
  1048576                                   55,389,875,749  8218.1

19971012.2347. Yep, it wasn't flushing the pipeline before doing file
operations. Change format of output and begin a whole new set of runs.

   32     1.511 458 647  0.037 221 916            13,698,616
     8192                              460,236     0.55    5.8575 MFLOPs
   64     1.502 640 744  0.019 431 398            84,071,307
    16384                            1,452,625     1.1667  8.7157 MFLOPs
  128     1.504 914 742  0.005 692 132           585,681,564
    32768                             6,091,786     3.55   12.012 MFLOPs  
  256     1.506 386 130  0.002 312 485         4,358,495,089
    65536                            30,543,955    13.05   16.384 MFLOPs
  512     1.506 869 368  0.001 349 103        33,607,097,921
   131072                           165,722,959    55.183  21.022 MFLOPs
 1024     1.506 781 916  0.000 492 708       263,815,177,892
   262144                         1,050,371,926   252.75   29.090 MFLOPs
 2048     1.506 673 891  0.000 151 484     2,090,497,201,424
   524288                         7,464,872,712  1359      38.451 MFLOPs
19971013:
 4096     1.506 584 024  0.000 074 387    16,643,919,148,551
  1048576                        55,472,219,207  8321.5    46.663 MFLOPs
 8192     1.506 593 644  0.000 027 695   132,836,944,645,477
  2097152                       428,665,503,505 57369      52.304 MFLOPs
19971018:
16384     1.506 588 076  0.000 012 669 1,061,430,262,321,482
  4194304                     3,365,249,417,036 4.0937e5   57.544 MFLOPs

19971014.14xx Look over the code thoroughly, trying to figure out why
the pipeline is running only half as fast as theory indicates it should.
I don't see anything, so I'll have to do some profiling. Given the magnitude
of the discrepancy, just breaking into Macsbug a bunch of times should be
sufficient to find where it's spending half its time.

19971014.235x I check this and find that it's actually spending almost all
of its time in the unrolled loop. However, there are a lot of pipeline
stalls because the compiler reordered the instructions.

After I turn off optimization and recompile it is in fact a bit faster,
but not nearly as faster as I thought it would be. It looks like the
asymptote is a little over 60 MFLOPs. Then I remember that it's double
precision, and the next day I check Power MANDELZOOM and see that it
only performs at 64 MFLOPs double anyway.

19971018.23xx. Read the notes from 19961017 and realize I still haven't
done the center-of-gravity thing. Oops.

   32    8192  1.511 458 647  0.037 221 916           13,698,616
              -0.292 565 786  0.014 640 563              460,236
                                                       0.517  6.236 MFLOPs
   64   16384  1.502 640 744  0.019 431 398           84,071,307
              -0.287 474 451  0.004 811 761            1,452,625
                                                       1.78   5.702 MFLOPs
  128   32768  1.504 914 742  0.005 692 132          585,639,290
              -0.286 495 853  0.002 352 688            6,094,438
                                                       3.58  11.905 MFLOPs
  256   65536  1.506 386 130  0.002 312 485        4,358,537,348
              -0.286 741 015  0.000 912 340           30,588,764
                                                      12.8   16.750 MFLOPs
  512  131072  1.506 869 368  0.001 349 103       33,607,168,662
              -0.286 880 847  0.000 416 546          165,498,826
                                                      54.5   21.263 MFLOPs
 1024  262144  1.506 781 916  0.000 492 708      263,814,920,625
              -0.286 801 539  0.000 173 164        1,049,624,871
                                                     246     29.910 MFLOPs
 2048  524288  1.506 673 891  0.000 151 484    2,090,496,819,135
              -0.286 811 014  0.000 074 787        7,461,538,307
                                                   1,310     40.011 MFLOPs
 4096 1048576  1.506 584 024  0.000 074 387   16,643,914,697,009
              -0.286 765 808  0.000 024 839       55,470,040,257
                                                   7,970     48.708 MFLOPs
 8192 2097152  1.506 593 644  0.000 027 695  132,836,936,354,151
              -0.286 769 509  0.000 009 733      428,624,195,429
                                                  54,700     54.845 MFLOPs

19971021.xxx.x I get ready to post a new message to sci.fractals:

Recently I have decided to restart the computing effort I undertook
about a year ago at this time to compute the area of the Mandelbrot
Set. After reviewing the notes from that time I saw that there had
been some interest in the past of computing the Mandelbrot Set's
"center of gravity" as well. I have also done some optimizations of
the code in the mean time and fixed a few bugs, and am ready to do
some long runs.

The method I use involves counting pizels on a grid. There is error
in the estimate resulting from grid placement and the fact that the
grid squares on the boundary get counted as completely in or
completely out when in fact only part of their area is in. It is hard
to estimate the magnitude of this error, so I use statistical methods
to measure it. I use a bunch of grids of slightly different grid
spacings with slight vertical and horizontal offsets, and take the
mean and standard deviation of all the runs.

These are the figures I've compiled so far. Each run is the average
of 20 grids for which the standard deviation is shown.

grid iteration  area and           std deviation
size  limit     center-of-gravity    (20 runs)
 256    65536    1.506 39          0.002 31
                -0.286 741         0.000 912

 512   131072    1.506 87          0.001 35
                -0.286 881         0.000 417

1024   262144    1.506 782         0.000 493
                -0.286 802         0.000 173 

2048   524288    1.506 674         0.000 151 
                -0.286 811 0       0.000 074 8

4096  1048576    1.506 584 0       0.000 074 4
                -0.286 765 8       0.000 024 8

8192  2097152    1.506 593 6       0.000 027 7
                -0.286 769 51      0.000 009 73

For each average, the standard deviation gives the expected error
in each of the 20 individual grids. The expected error of the mean
is 1/sqrt(19) of the standard deviation (this is called "the standard
error of the mean"). I use 1/3 as a conservative estimate. So, my
best estimate of the center-of-gravity is -0.286769 +- 0.000003.

A few other observations:

- My program currently runs at 8 million Mandelbrot iterations per
  second. The current model Macs are about 4 times faster. Intel
  machines get nowhere close because they don't have enough
  floating-point registers to keep the pipeline full.

- Each time I double the grid size it takes almost 8 times as long.
  It takes 7 years for the industry to increase computing power of
  personal computers by 8.

- Each time the gridsize is doubled the error goes down to about 0.4
  of the previous value. It would take about 20 years for personal
  computers to improve enough to give us one more digit in the area
  estimate (given the same run time).

- My best area estimate is still the figure from last year:
  1.5065923 +- 0.0000006.

and then I notice that the current version is much slower than the version I
was running when I posted the results a year ago. The number of iterations
is going up by about 7.5 each time, and last year's version was more like 4.5.
I guess almost right away that it's because of the new pipelined routines'
rather poor ability to detect limit cycles.

19971021.23xx. I do some runs with the old iterator routine just to verify this.

   32    8192  1.511 458 647  0.037 221 916           13,648,582
              -0.292 565 786  0.014 640 563              275,459
                                                       0.383  5.0301 MFLOPs
   64   16384  1.502 640 744  0.019 431 398           83,832,862
              -0.287 474 451  0.004 811 761              967,521
                                                       0.867  7.8146 MFLOPs
  128   32768  1.504 914 742  0.005 692 132          584,564,367
              -0.286 495 853  0.002 352 688            4,188,562
                                                       2.97   9.8831 MFLOPs
  256   65536  1.506 386 130  0.002 312 485        4,354,401,151
              -0.286 741 015  0.000 912 340           17,981,647
                                                      11.9   10.533 MFLOPs
  512  131072  1.506 869 368  0.001 349 103       33,588,336,186
              -0.286 880 847  0.000 416 546           80,190,202
                                                      51.1   10.989 MFLOPs
 1024  262144  1.506 781 916  0.000 492 708      263,748,416,082
              -0.286 801 539  0.000 173 164          362,847,780
                                                     224     11.347 MFLOPs
 2048  524288  1.506 673 891  0.000 151 484    2,090,234,952,071
              -0.286 811 014  0.000 074 787        1,678,084,515
                                                   1,010     11.608 MFLOPs
 4096 1048576  1.506 584 024  0.000 074 387   16,642,875,255,072
              -0.286 765 808  0.000 024 839        7,780,618,202
                                                   4,600     11.846 MFLOPs

and I quickly decide that I should try to figure out a way I can write some
code to *guess* whether each point will be in the set or not (e.g., based on
the result of the most recently completed point). It can use the double pipeline
for points guessed outside the set and the single pipeline for points
guessed in the set.

19971022.01xx. Optimize the single iterator routine a bit (the rad2 test
in particular could use some work). This improves the times substantially.

   32    8192  1.511 458 647  0.037 221 916           13,664,870
              -0.292 565 786  0.014 640 563              291,747
                                                        0.183 11.139 MFLOPs
   64   16384  1.502 640 744  0.019 431 398           83,882,823
              -0.287 474 451  0.004 811 761            1,017,482
                                                        0.600 11.871 MFLOPs
  128   32768  1.504 914 742  0.005 692 132          584,737,298
              -0.286 495 853  0.002 352 688            4,361,493
                                                        2.23  13.670 MFLOPs
  256   65536  1.506 386 130  0.002 312 485        4,355,042,622
              -0.286 741 015  0.000 912 340           18,623,118
                                                        9.37  13.918 MFLOPs
  512  131072  1.506 869 368  0.001 349 103       33,590,805,167
              -0.286 880 847  0.000 416 546           82,659,183
                                                       40.2   14.376 MFLOPs
 1024  262144  1.506 781 916  0.000 492 708      263,758,102,020
              -0.286 801 539  0.000 173 164          372,533,718
                                                      176     14.839 MFLOPs
 2048  524288  1.506 673 891  0.000 151 484    2,090,273,319,600
              -0.286 811 014  0.000 074 787        1,716,452,044
                                                      788     15.246 MFLOPs

I try making it call the dual pipeline whenever the dwell was greater than
50 but not in the set, the results are encouraging at first and then
much, much worse at higher grid sizes:

  256   65536       22,167,716       9.2 sec 16.867 MFLOPs
  512  131072      109,955,129      38.9 sec 19.769 MFLOPs
 1024  262144      623,655,741     176       24.776
 2048  524288    4,006,852,998     888       31.588
 4096 1048576   27,710,036,024   4,860       39.904
 8192 2097152  177,665,018,000  27,300       45.543

so I go back to the all-dwell1 version:

 4096  1048576  1.506 584 024  0.000 074 387      16,643,027,976,653
               -0.286 765 808  0.000 024 839           7,933,339,783
                                                   3,580     15.502 MFLOPs
 8192  2097152  1.506 593 644  0.000 027 695     132,833,077,382,170
               -0.286 769 509  0.000 009 733          36,987,339,092
                                                  16,300     15.852 MFLOPs
16384  4194304  1.506 588 076  0.000 012 669   1,061,413,532,170,594
               -0.286 767 267  0.000 003 204         171,493,547,507
                                                  74,400     16.138 MFLOPs
32768  8388608  1.506 591 501  0.000 004 512   8,486,342,263,388,365
               -0.286 768 373  0.000 001 774         814,136,316,927
                                                 347,000     16.442 MFLOPs
65536 16777216  1.506 592 311  0.000 001 904  67,870,833,876,143,188
               -0.286 768 423  0.000 000 510       3,844,750,451,387
                                               1,630,000     16.529 MFLOPs

19980218 compose a new article for sci.fractals:

Mandelbrot Set center-of-gravity is -0.28676842 +- 0.00000017

Recently I restarted the computing effort I undertook
about a year ago at this time to compute the area of the Mandelbrot
Set. After reviewing the notes from that time I saw that there had
been some interest in the past of computing the Mandelbrot Set's
"center of gravity" as well. I have also done some optimizations of
the code in the mean time and fixed a few bugs, and have started
doing a long run.

The method I use involves counting pizels on a grid. There is error
in the estimate resulting from grid placement and the fact that the
grid squares on the boundary get counted as completely in or
completely out when in fact only part of their area is in. It is hard
to estimate the magnitude of this error, so I use statistical methods
to measure it. I use a bunch of grids of slightly different grid
spacings with slight vertical and horizontal offsets, and take the
mean and standard deviation of all the runs.

These are the figures I've compiled so far. Each run is the average
of 20 grids for which the standard deviations are shown.

 grid     iter  area and       standard                  iterations
 size    limit  center-of-     deviations            time (seconds)
                gravity        (20 runs)       performance (MFLOPs)

  256    65536  1.506 38       0.002 31                18,623,118
               -0.286 741      0.000 912                9.4  13.918

  512   131072  1.506 87       0.001 35                82,659,183
               -0.286 881      0.000 417               40.2  14.376

 1024   262144  1.506 782      0.000 493              372,533,718
               -0.286 802      0.000 173              176    14.839

 2048   524288  1.506 674      0.000 151            1,716,452,044
               -0.286 811 0    0.000 074 8            788    15.246

 4096  1048576  1.506 584 0    0.000 074 4          7,933,339,783
               -0.286 765 8    0.000 024 8          3,580    15.502

 8192  2097152  1.506 593 6    0.000 027 7         36,987,339,092
               -0.286 769 51   0.000 009 73        16,300    15.852

16384  4194304  1.506 588 1    0.000 012 7        171,493,547,507
               -0.286 767 27   0.000 003 20        74,400    16.138

32768  8388608  1.506 591 50   0.000 004 51       814,136,316,927
               -0.286 768 37   0.000 001 77       347,000    16.442

65536 16777216  1.506 592 31   0.000 001 90     3,844,750,451,387
               -0.286 768 423  0.000 000 510    1,630,000    16.529

For each average, the standard deviation gives the expected error
in each of the 20 individual grids. The expected error of the mean
is 1/sqrt(19) of the standard deviation (this is called "the standard
error of the mean"). I use 1/3 as a conservative estimate. So, my
best estimate of the center-of-gravity is -0.28676842 +- 0.00000017.

A few other observations:

- My program currently runs at about 2.35 million Mandelbrot iterations
  per second. The latest model Macs are about 4 times faster. Intel
  machines get nowhere close because they don't have enough
  floating-point registers to keep the pipeline full.

- Each time I double the grid size it takes almost 4.7 times as long.
  The current rate of progress in computer performance is
  about 1.56 per year; at that rate it takes about 3.5 years
  to increase computing power by 4.7.

- Each time the gridsize is doubled the error goes down to about 0.4
  of the previous value (but it varies from 0.3 to 0.55). It would take
  about 8 to 10 years for personal computers to improve enough to give
  us one more digit in the area estimate (given the same run time).

- My new program reproduces my best area estimate from late 1996:
  1.5065923 +- 0.0000006.

http://www.cecm.sfu.ca/projects/ISC/ISCmain.html gives the
following for 0.286... (and way too much output for 1.506...):
2867682063800133 = (0250) F(44/59;39/56;1) 
2867682149714066 = (0250) F(19/21;11/13;1) 
2867682521500307 = (0404) Psi(5/12)+Psi(3/4)-Psi(19/20)
2867682633829350 = (0001) (1/3-ln(3))^Feig1
2867682783580957 = (0010) sum((11/3*n^3-20*n^2+157/3*n-25)/(n!+1),n=1..inf)
2867682869758240 = (0064) sum((7/3*n^3-13*n^2+83/3*n+3)/(Fibo(n)+1),n=1..inf)
2867682944514431 = (0326) 4^(3/4)*(12-12^(1/4))
2867682960509936 = (0001) Catalan+2/3+GAM(17/24)
2867682970307599 = (0192) (GAM(5/6)+1/3)/(ln(3)+4)
2867683491264107 = (0400) sum(1/(n!+7/6*n^3-5*n^2+131/6*n-14),n=1..inf)
2867683638369271 = (0395) sum(1/C(2*n,n)/(17/6*n^3-13*n^2+133/6*n-10),n=1..inf)
2867683747960658 = (0013) sum((8/3*n^3-27/2*n^2+167/6*n-3)/n^(n-1),n=1..inf)
2867683843843252 = (0259) F(3/11,9/10;4/9,4/5,1/2;1) 

   Artin = 0.3739558136192022
   gamma = 0.57721566490153286060651209008240
TwinPrim = 0.6601618158468695
 Catalan = 0.9159655941772190150546035149324
   Feig2 = 2.502907875095892822283902873218215786381271376727149977336192056
   Feig1 = 4.6692016091029906718532038

Gibbs + BesI(1,1) = 2.417096155974951

5606298864626305 = (0001) Gibbs+3/2*Feig2
5606312839756551 = (0001) Artin+Gibbs^Khint
5606382942821305 = (0001) Gibbs-ln(3)^E
8606371538195862 = (0405) Cahen/Parking
5606401453580847 = (0192) (sin(1)+3)/(Gibbs+5)

19980302 get the new result:

131072 33554432  1.506 591 734  0.000 000 624    18,173,551,931,685
                -0.286 768 317  0.000 000 295    7,590,000    16.754

19980304 It seems clear that the Mandelbrot Set center of gravity is
(ln(3)-1/3)^Feig1, which comes out to about 0.2867682633829350268529586

{Another big hiatus here -19990520} -------------------------------------

19990518.1xxx.  Port the program to gcc under Linux, and call it
luams-3.c.  It seems to produce fairly similar results, and runs at
about 32 MFLOPs on a 233-MHz Pentium MMX.

19990518.1820.  I've gotten it up to 38 MFLOPs by changing the
optimization flags and rearranging the inner loop a bit. For some
reason, optimization makes it print different values of g_act_its. I
need to figure out why in case this represents a bug.

Gridsize 32, itmax 8192
             Area  1.5114586  +- 0.0372219      Its: 13664870
Center of gravity -0.29256579 +- 0.01464056, Actual: 291766

Gridsize 64, itmax 16384
             Area  1.5026407  +- 0.0194314      Its: 83882823
Center of gravity -0.28747445 +- 0.00481176, Actual: 1011791

Gridsize 128, itmax 32768
             Area  1.5049147  +- 0.0056921      Its: 584735910
Center of gravity -0.28649585 +- 0.00235269, Actual: 4367224

Gridsize 256, itmax 65536
             Area  1.5063889  +- 0.0022895      Its: 4355035389
Center of gravity -0.28675522 +- 0.00091069, Actual: 18674354

Gridsize 512, itmax 131072
             Area  1.5068630  +- 0.0013543      Its: 33590760535
Center of gravity -0.28687915 +- 0.00041832, Actual: 82279430

Gridsize 1024, itmax 262144
             Area  1.5067819  +- 0.0004926      Its: 263758161877
Center of gravity -0.28680108 +- 0.00017393, Actual: 372254099

19990519.1xxx.  Add im_sqrt parameters so the inner loop doesn't have
to test for iterations overflow. This doesn't seem to speed it up as
much as I was hoping it would. I would guess that results from
speculative execution of the floating-point operations.

19990519.2356.  Today I make it catch the SIGTERM signal so it will
automatically stop when the system is rebooted normally, make .bashrc
launch luams at startup if it's not already running, and write a
script to find and kill the luams process to use when I need to.

I also restore the l_ticks and g_ticks variables and related
benchmarking calculations, which I removed during yesterday's port
because I didn't yet know how to measure user CPU time under Linux.

All of these changes bring LUAMS back up to the level of usability it
had in the Power Macintosh incarnation, and now it's on a more stable
OS and significantly faster hardware.

19990520.xxxx
Add iter_type typedef to make it easy to change the floating-point type
for benchmarking purposes. float isn't any faster than double on a
Celeron.

Also, I discover that the 333-MHz Celeron is in fact a lot faster than
the 233-MHz Pentium MMX at work, despite the Linux "BogoMIPS" values.

19990603.2430
Change prototype of main() so gcc doesn't complain.

19990628.2422
  Check current state of luams-3 output (it has been running as a
background task for about 5 weeks now) and find that it has finished
the 131072 gridsize and one iteration (out of 20) of the next
gridsize:
Gridsize 131072, itmax 67108864
             Area  1.506591696  +- 0.000000627      Its: 1085772250631444522
Center of gravity -0.2867683146 +- 0.0000002961  Actual: 22749243215253
 Total computation time: 2100630400 msec, 75808 KFLOPs

it is notable that the area estimate is a little lower (by 0.000 000 030)
which I attribute to the higher itmax.

19991021
  It has finished its 262144 grid:
Gridsize 262144 (34359738368 pixels) ....................
Gridsize 262144, itmax 134217728
             Area  1.506591772  +- 0.000000242      Its: 8685544402515235059
Center of gravity -0.2867684377 +- 0.0000000748  Actual: 107475449884576
 Total computation time: 1303136528 msec, 76045 KFLOPs

The total computation time is wrong (because of overflow). The actual
time was about 9.9 million seconds or about 115 days; in real-time the
program was running continuously since 20000519.
 
From this I derive my estimates:
      area =  1.50659177 +- 0.00000008
       c-g = -0.28676844 +- 0.000000025

After a few weeks I add these results to my web pages but I don't post
them to a newsgroup until 2000.

{Another big hiatus here} --------------------------------------------------

20030925
  Make luams-3 a standard background task on S3. Replace
WRITE_INTERVAL with a variable, to make it cache a little more on each
of the already-finished cases each time it restarts. Increase max
gridsize to 1048576 so it can be left running for a very long time
without running out of things to do.
  Port it to MacOS X 10.2.x (mainly involves using getrusage instead
of clock() and changes to how time is counted and printed out)

2003121x
  Divide std deviation by 3 when printing to more accurately reflect
the error in the area estimate.

20040903 gridsize 524288 has finished, but the work was wasted because
g_inset overflowed. Change it to 64-bit.
 20040906 Fix overflow bug in g_inset that caused loss of entire
524288-size grid results.

{Another big hiatus here} --------------------------------------------------

20101017 Get it running again (mainly just updating the makefile after
an upgrade to MacOS 10.6 caused me to audit all of my makefiles)
 20101019 After about 1.5 days as a single thread, it has completed
all gridsizes up to 131072.
 Add COMPILE_DATE, usage() and command-line argument parsing. Add
-slice option to permit use of multiple copies of #luams# for
concurrency (which can be run on separate machines and then merged by
copying the cache files to one location, or can all be run on the same
machine). Compute KFLOPs with floating-point math.
 Add check_sizes() and update the definitions of signed16, etc. for
64-bit compilers. (The existing cache files have to be trashed because
some of the variables were the wrong size. They are now archived as
"cache-20101019.1400.zip")
 I do some sanity testing with gridsize 4096, then start a set of 4
slices, now starting over from scratch.

 20101023:
  Change R_LO, R_HI, and I_HI; add standard deviation to output
  Calculate standard error based on actual number of samples being
averaged (rather than hard-coded to 3.0 as before), and multiply by
1.96 to get the 95% confidence interval; these combine to make the new
std error estimate more conservative than before (because 1.96/sqrt(20)
is greater than 1/3.0).
  Std dev and std err are both now returned explicitly from
m_area_stat, and both are printed
  Do not print center of gravity average if PERFECT_CARDIOID is defined
  Keep track of total number of pixels tested (g_tested variable)
which is now also saved in the cache files.
  This version is in "old-versions/luams-20101023.txt"

20101024 I have done several important changes over the past 24 hours,
and now have a sync issue: the cache files from the 20101019 and
earlier versions will not work with the new changes, but I want to
"salvage" as much of that cached work as I can to get some results for
future comparison. So I am going to re-do the recent changes in an
order that allows me to keep the old cache structure for as long as
possible.
  First few changes:
 * Add SKEW_INC_MAX define, use it in the one place where STAT_NSAMP
was being used to compute skew_inc, so that I can change the number of
samples without affecting the skews (and therefore the cache file
names)
 * Do not print center of gravity average if PERFECT_CARDIOID is defined
 * Display PERFECT_CARDIOID and USE_FILES options after COMPILE_DATE
 * Calculate standard error based on actual number of samples being
averaged (rather than hard-coded to 3.0 as before), and multiply by
1.96 to get the 95% confidence interval; these combine to make the new
std error estimate more conservative than before (because
1.96/sqrt(20) is greater than 1/3.0).
 * Std dev and std err are both now returned explicitly from
m_area_stat, and both are printed

This version of the code is in "old-versions/luams-20101024.txt", and
the cache files are saved as "cache-20101024.1930.zip".

I continue to make changes, now diverging from the old cache file format:
 * Store R_LO, R_HI and I_HI in each cache file and abort with an error
if an existing cache file has different values
 * Change R_LO, R_HI, and I_HI (as was done yesterday); the ranges are now
(-2.02 .. 0.49) and (0.00 .. 1.15)
 * Disable PERFECT_CARDIOID (probably permanently)
 * Add -gmin and -gmax options
 * Add g_row_mass, used to accumulate the mass total one row at a
time, to minimize the cumulative effect of roundoff error that would
otherwise influence the precision of the answer at the larger
gridsizes.
 * DEFAULT_GS_MIN is now 25; gridsizes are now centered on the nominal
size (for example, for gridsize 128 it used to go from 128 to 147, but
now it will go from 118 to 137). Add save_or_reload().

20101028 Finish gridsize groups from 25 up to 204800 on MP16. Stats
for the last few gridwidths are:

 Gridwidth 51200:
    Area 1.506592211(521),  CoG -0.286768176(129); 8223 sec.

 Gridwidth 102400:
    Area 1.506591964(223),  CoG -0.2867683969(465); 8912 sec.

 Gridwidth 204800:
    Area 1.5065919790(936),  CoG -0.2867683744(266); 8912 sec.


20111228 In response to email from Warren Smith, I decide to continue
with the next gridwidth 409600 (using first 1 task, then 4) on MP16.
This should take about 7 or 8 days to complete (each task is taking
just a little over 30 hours to complete a grid, and each task is doing
5 grids).

20120105 Add a few formatting routines (f64_expon, snfu_tag, and
snformat_unc), not yet complete.

20120107 At the end of the gridwidth 409600 run started on 1228, I
have these results:

 Gridwidths 409590-409609 at Nmax 209715200....................
   Gridsize 409600, itmax 209715200
                Area  1.506591856  +- 2.54e-08  (n=20, sigma 5.80e-08)
   Center of gravity -0.2867684229 +- 1.11e-08  (n=20, sigma 2.54e-08)
       Its: -1618709677335174482;  Actual: 140360269258091
    pixels in first sample: 20059349338 of 76864256651
    Total computation time: 691255 sec, 1.42 GFLOPs

This result shows a bug: the "Its:" number (effective number of
iterations) overflowed its range (which was 64-bit integer). This
needs to be changed to 128-bit or double.
  In addition, in order for the numbers to be right when used with
-slice, I need to make each cache file store only its own g_agg_its
and g_act_its values, not the running totals for the entire set of
grids.

20120208 Report error on fopen(filename, "w"). Eliminate use of ugly "goto"s.

20120209 Replace g_agg_its and g_act_its with g_this_its, g_effc_its,
g_ttl_its and g_t_efc_its, all of which are now 128-bit integers. This
makes the cache files incompatible with anything I did prior to this
point.
  I also work on snformat_unc; get it mostly working.
  Change use of l_ticks and g_ticks; add f_ticks (this parallels the
changes made to the g_xxx_its variables so that the benchmark stats
work regardless of the -slice option).

20120210 Add s128_str

20120222 Report ns_done rather than STAT_NSAMP for (n=XX) value (only
affects slice mode)

20120224 Add g_dots and display current imaginary coordinate as a sign
of progress.

FUTURE CHANGES:

 - Area of cubic Mandelbrot set (Z^3+C) and higher orders.

 - Add an argument option to override STAT_NSAMP

|  end of revision history
|____________________________________________________________________________*/

/* #include "sys-includes.h" */
#ifdef MACOSX
# include <sys/time.h>
# include <sys/resource.h>
#else
# include <sys/time.h>
# include <sys/resource.h>
# include <unistd.h>
#endif

#include <math.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

// Parameters that control the amount and precision of area measurement.
// Because of the way the cache files are named, any of these parameters can
// be changed and the program can be recompiled without losing earlier work
// and without clearing the cache.

// Gridsizes to calculate
#define DEFAULT_GS_MIN 25
#define DEFAULT_GS_MAX 2000000

// Iterations limit multipliers to try.
// The 19980218 results used a multiplier of 256.
#define IM_MIN 512
#define IM_MAX 512

// Number of runs to average together to determine standard deviation.
// The standard error goes down by the square root of this value, and
// goes down just a little faster when increasing the gridsize.
#define STAT_NSAMP 20
#define SKEW_INC_MAX 20

/*

The real coordinates in the Mandelbrot set vary from -2.0 to
+0.471185... and the imaginary coordinates range from +-1.122757...
See:
  mrob.com/pub/muency/easternmostpoint.html
  mrob.com/pub/muency/northernmostpoint.html

We conservatively add about 0.02 in all directions.
*/
#define R_LO -2.02
#define R_HI  0.49
#define I_LO  0.00
#define I_HI  1.15

typedef unsigned short unsigned16;
typedef signed int signed32;
typedef unsigned int unsigned32;
typedef signed long signed64;
typedef unsigned long unsigned64;

/* For 128-bit I probably should be converting LUAMS to C++ so I can
   use my own int128 library. However, this works in the short-term.
   According to some on the internet, something like "int128_t"
   or "__int128_t" might be provided by compilers. */
typedef __attribute__((__mode__(__TI__))) signed128;
typedef unsigned __attribute__((__mode__(__TI__))) unsigned128;

typedef double iter_type;

typedef struct {
  double  ar_area;
  double  ar_gravity;
} area_result;

// dwell completion procedure pointer
typedef void(*dwell_cpp)(iter_type, long, long);

typedef void(*dwell_proc)(iter_type r, iter_type i, signed32 itmax,
  signed32 im_sqrt, dwell_cpp cp);

signed64 rpm_clock(void);

void dcp1(iter_type r, long dwell, long its);
void dwell1(iter_type r, iter_type i, signed32 itmax, signed32 im_sqrt,
  dwell_cpp cp);
void dwell2(iter_type r, iter_type i, signed32 itmax, signed32 im_sqrt,
  dwell_cpp cp);
area_result m_area(signed32 gridsize, signed32 itmax, signed32 im_sqrt,
  signed32 skew_v);
area_result m_area_stat(signed32 gridsize, signed32 itmax,
  signed32 nsamp, area_result *std_dev, area_result *std_err, int * ns_done);
void loop1(void);
int main(int, char **);

signed32 g_gs_min;
signed32 g_gs_max;

signed128  g_this_its;
signed128  g_effc_its;
signed128  g_ttl_its;
signed128  g_t_efc_its;
unsigned32  g_itmax;
unsigned64  g_inset;
unsigned64  g_tested;
unsigned64  g_1st_inset;
unsigned64  g_1st_tested;
double    g_row_mass;
double    g_mass;

// l_ticks counts the total number of ticks (microseconds on Linux) spent on
// the current grid. f_ticks is the number of ticks that have elapsed since
// the last time we wrote out to a cache file. Everytime we write a file we
// transfer f_ticks into l_ticks and zero out f_ticks, so that f_ticks doesn't
// overflow.
signed64  g_ticks;
signed64  l_ticks;
signed64  f_ticks;

unsigned16  g_quit;

dwell_proc  g_dp;

const double  k256 = 256.0;
const double  k96 = 96.0;
const double  k32 = 32.0;
const double  k16 = 16.0;
const double  k7 = 7.0;
const iter_type  k4 = 4.0;
const double  k3 = 3.0;
const iter_type  k2 = 2.0;
const double  k1 = 1.0;
const double  kpi = 3.1415926535897932;


void s128_str(char *s, signed128 x);
void s128_str(char *s, signed128 x)
{
  int d, nd;
  signed128 t, k10;

  if (x == 0) {
    *s = '0'; s++;
    *s = 0; return;
  }
  if (x < 0) {
    *s = '-'; s++;
    x = - x;
  }

  k10 = 10;

  // Extract each digit (in reverse order)
  nd = 0;
  while(x > 0) {
    t = x / k10;
    d = x - (t * k10);
    s[nd++] = ('0' + d);
    x = t;
  }
  s[nd] = 0;
  /* Un-reverse the output string */
  int i; char tc;
  for (i=0; i*2 < nd; i++) {
    tc = s[nd-1-i]; s[nd-1-i] = s[i] ; s[i] = tc;
  }
}

// Define PERFECT_CARDIOID to add the exact value of the period-1 and -2
// components' areas rather than pixel-counting them. This reduces the
// computation time a little (less than you'd think!) and also reduces the
// standard error for small grid sizes, but not for big grid sizes.
//#define PERFECT_CARDIOID

#define USE_FILES // To save work in disk files, to recover from power failures

/* The write interval constants are in the same units as rpm_clock() */
#define INIT_WRITE_INTERVAL 500000LL
#define MAX_WRITE_INTERVAL 5000000LL
signed64 write_interval;
//#define DEBUG_WRITES
int g_slice; int g_of;

/* Measure elapsed runtime in microseconds */
signed64 rpm_clock(void)
{
  signed64 rv;
  struct   rusage ru;
  int      err;

  err = getrusage(RUSAGE_SELF, &ru);

  if (err) {
    printf("rpm_clock err %d\n", err);
    exit(1);
  }

  rv = (signed64) ru.ru_utime.tv_sec;
  rv = (1000000LL * rv) + (signed64) ru.ru_utime.tv_usec;

  return(rv);
}

void check_sizes(void)
{
  int s;

  s = (int) sizeof(unsigned16);
  if (s != 2) {
    printf("sizeof(u16) == %d, expected size was 2\n", s);
    exit(-1);
  }

  s = (int) sizeof(unsigned32);
  if (s != 4) {
    printf("sizeof(u32) == %d, expected size was 4\n", s);
    exit(-1);
  }

  s = (int) sizeof(unsigned64);
  if (s != 8) {
    printf("sizeof(u64) == %d, expected size was 8\n", s);
    exit(-1);
  }

  s = (int) sizeof(unsigned128);
  if (s != 16) {
    printf("sizeof(u128) == %d, expected size was 16\n", s);
    exit(-1);
  }

  if (0) {
    /* Test signed128 type. Note that printf has no format selector for
       this type. */
    signed128 x;
    x = 1;
    for(s=0; s<128; s++) {
      /* Note that printf has no format selector for 128-bit integers. */
      printf("%d %10g\n", s, ((double) x));
      x += x;
    }
    exit(0);
  }
}

void fmt_1_999(char * s, double x);
void fmt_1_999(char * s, double x)
{
  strcpy(s, "");
  if (x >= 100.0) {
    sprintf(s, "%3.0f", x);
    if (strlen(s) > 3) {
      strcpy(s, "999");
    }
  } else if (x >= 10.0) {
    sprintf(s, "%3.1f", x);
    if (strlen(s) > 4) {
      strcpy(s, "99.9");
    }
  } else if (x >= 1.0) {
    sprintf(s, "%3.2f", x);
    if (strlen(s) > 4) {
      strcpy(s, "9.99");
    }
  } else {
    strcpy(s, "1.00");
  }
}

void format_si(char * s, double x);
void format_si(char * s, double x)
{
  if (x < 0.0) {
    strcpy(s, "neg err");
  } else if (x == 0) {
    strcpy(s, "   0  ");
  } else if (x >= 1.0e18) {
    sprintf(s, "%2.1e", x);
  } else if (x >= 1.0e15) {
    fmt_1_999(s, x/1.0e15);
    strcat(s, " P");
  } else if (x >= 1.0e12) {
    fmt_1_999(s, x/1.0e12);
    strcat(s, " T");
  } else if (x >= 1.0e9) {
    fmt_1_999(s, x/1.0e9);
    strcat(s, " G");
  } else if (x >= 1.0e6) {
    fmt_1_999(s, x/1.0e6);
    strcat(s, " M");
  } else if (x >= 1.0e3) {
    fmt_1_999(s, x/1.0e3);
    strcat(s, " K");
  } else if (x >= 1.0) {
    fmt_1_999(s, x);
    strcat(s, "  ");
  } else {
    sprintf(s, "%2.1e", x);
  }
}

int f64_expon(double x, signed32 * expon);
int f64_expon(double x, signed32 * expon)
{
  int rv;

  if ((x == 0) || isnan(x) || isinf(x)) {
    return -1;
  }

  signed32 e = 0;
  x = fabs(x);
  while(x < 1.0) {
    e--;
    x *= 10.0;
  }
  while(x >= 10.0) {
    e++;
    x /= 10.0;
  }

  if (expon) { *expon = e; }
}

char snfu_scratch[32];

int snfu_tag(char * s, long n, char * fmt, double x,
             char **lead, char **digits, char **trail);
/* Perform an snprintf with the given format string, and "tag" the result
   with pointers to the "leader", digits, and "trailer". */
int snfu_tag(char * s, long n, char * fmt, double x,
             char **lead, char **digits, char **trail)
{
  int l1, l2;
  if (n <= 0) {
    return -1;
  }
  if ((lead == 0) || (digits == 0) || (trail == 0)) {
    return -1;
  }
  *lead = 0;
  *digits = 0;
  *trail = 0;
  
  /* %%% Here we should make sure fmt fits within the constraints we're
     willing to deal with: contains exactly one '%', has an 'e', 'f' or 'g'
     at the end, etc. */
  /* Get the digits, and also find out how much snprintf wants */
  l1 = snprintf(s, n, fmt, x);
  if (l1+1 > n) {
    return -1;
  }
  *lead = s;

  /* Find lead digit */
  char * ld = s;
  int gg = 1;
  while(gg) {
    if ((*ld >= '1') && (*ld <= '9')) {
      gg = 0;
    } else if (*ld == 0) {
      gg = 0;
    } else {
      ld++;
    }
  }
  if (*ld == 0) {
    return 0;
  }
  *digits = ld;

  /* Skip digits */
  gg = 1;
  while((*ld == '.') || ((*ld >= '0') && (*ld <= '9'))) {
    ld++; /* Skip a character */
  }
  *trail = ld;

  return l1;
}

int snformat_unc(char * s, long n, double x, double unc);
int snformat_unc(char * s, long n, double x, double unc)
{
  int l1, l2, i;
  const char * spec1 = "(Uncertain)";
  char ts[30];
  char * lead; char * digits; char * trail;

  if (fabs(unc) > fabs(x)) {
    /* The uncertainty is bigger than the value */
    if (strlen(spec1)+1 < n) {
      return strlen(spec1);
    }
    strncpy(s, spec1, n);
    return strlen(s);
  }

  /* Check for 0 */
  if (x == 0.0) {
    s[0] = '0'; s[1] = 0;
    return 1;
  }

  /* Deal with negatives */
  if (x < 0) {
    *s++ = '-';
    n--;
    x = - x;
  }
  if (unc < 0) {
    unc = - unc;
  }

  /* Get the digits, and also find out how much snprintf wrote */
  l1 = snfu_tag(ts, sizeof(ts), "%.17g", x, &lead, &digits, &trail);

  if (digits == 0) {
    s[0] = 0;
    return 0;
  }

  char t2[30];
  int gotdot = 0;
  for(i=0; (i+1 < sizeof(t2)) && (digits + i < trail); i++) {
    t2[i] = digits[i];
    if (t2[i] == '.') {
      gotdot = 1;
    }
  }
  if (gotdot == 0) {
    t2[i++] = '.';
  }
  while (i+1 < sizeof(t2)) {
    t2[i++] = '0';
  }
  t2[i] = 0;

  /* We want to add three uncertainty digits inside '()' plus the
     trailing null */
  l2 = l1+5;
  if (l2+1 > n) {
    s[0] = 0;
    return 0;
  }

  /* Determine how many lead digits are in the clear */
  int ex = 0; int eu = 0;
  f64_expon(x, &ex); f64_expon(unc, &eu);
  int need = ex - eu;

  /* Put at most 15 digits in the lead */
  if (need > 15) {
    need = 15;
  }

  /* Find lead digit */
  char * ld = digits;

  int len = digits - lead;
  for(i=0; i<len; i++) {
    *s++ = ts[i];
  }

  /* Emit lead digits */
  i = 0;
  while (need > 0) {
    /* If this character is a digit, count it */
    if ((t2[i] >= '0') && (t2[i] <= '9')) {
      need--;
    }
    *s++ = t2[i];
    i++; /* Next digit */
  }

  /* Then 3 more value digits */
  need = 3;
  while (need > 0) {
    /* If this character is a digit, count it */
    if ((t2[i] >= '0') && (t2[i] <= '9')) {
      need--;
    }
    *s++ = t2[i];
    i++; /* Next digit */
  }

  *s++ = '(';

  /* Then 3 digits of the uncertainty */
  char us[30];
  char * ulead; char * udig; char * utr;
  /* Get the digits, and also find out how much snprintf wrote */
  l1 = snfu_tag(us, sizeof(us), "%.3e", unc, &ulead, &udig, &utr);

  need = 3;
  int j = 0;
  while ((need > 0) && (udig[j])) {
    /* If this character is a digit, count it */
    if ((udig[j] >= '0') && (udig[j] <= '9')) {
      need--;
      *s++ = udig[j];
    }
    j++; /* Next digit */
  }
  while (need > 0) {
    *s++ = '0';
    need--;
  }

  *s++ = ')';

  /* Add trail */
  j = 0;
  while(trail[j]) {
    *s++ = trail[j++];
  }

  /* trailing null, and we be done. */
  *s = 0;

  return strlen(s);
}

void my_handler(int signal);

void my_handler(int s)
{
  g_quit = 1;
  printf("\n\nGot signal %d, exiting!", s);
}

void dcp1(iter_type r, long dwell, long act_its)
{
  g_tested++;
  if (dwell > g_itmax) {
    g_inset++;
    g_row_mass += r;

    g_dp = &dwell1; // dwell1 or dwell3
  } else if (dwell > 50) {
    g_dp = &dwell1; // dwell1 or dwell3
  } else {
    g_dp = &dwell1; // dwell1 or dwell3
  }
  g_effc_its += dwell;
  g_this_its += act_its;
}

// This is a relatively straightforward iterate routine, except that it
// detects loops.
void dwell1(iter_type r, iter_type i, signed32 itmax, signed32 im_sqrt,
                            dwell_cpp d1_cp)
{
  iter_type          cr, ci;
  register iter_type zr, zi;
  register iter_type sr, si;
  register iter_type zr2, zi2, rad2;
  signed32           rv_its;
  signed32           act_its;
  signed32           n1, n2;
  int escape_flag = 0;
  int cycle_flag = 0;

  rv_its = 1;
  act_its = 1;
  zr = r;
  zi = i;
  cr = r;
  ci = i / k2;

  sr = zr;
  si = zi;
  for (n1=2; n1<=im_sqrt; n1++) {
    sr = zr;
    si = zi;
    for (n2=0; n2<n1; n2++) {
      zi2 = zi*zi;
      zr2 = zr*zr;
      act_its++;

      zi = (zr*zi) + ci;
      zr = zr2 + cr;
      rad2 = zr2 + zi2;

      zi = zi * k2;
      zr = zr - zi2;

      if (rad2 > k4) {
        escape_flag = 1;
        n1 = n2 = im_sqrt + 1; /* Make both loops exit */
      } else if ((zr == sr) && (zi == si)) {
        rv_its = itmax+1;
        cycle_flag = 1;
        n1 = n2 = im_sqrt + 1; /* Make both loops exit */
      }
    }
  }

  if ((cycle_flag == 0) && (escape_flag == 0)) {
    /* We did not cycle or escape, and we ran out of iterations.
       Return act_its+1 just to make sure it meets the criteria. */
    act_its++;
  }

  if (cycle_flag == 0) {
    /* We did not cycle, but might have escaped. Here I want to return the
       actual iteration count. */
    rv_its = act_its;
  }

  /* Now call the completion routine, to update the statistics for this
     grid (total number of pixels, area total, etc.) */
  (*d1_cp)(r, rv_its, act_its);
}

// dwell2 checks to see if the point is in R2a or R2.1/2a, and
// calls (*g_dp) if it isn't. (*g_dp) should point to dwell1 or
// another dwell routine.
void dwell2(iter_type r, iter_type i, signed32 itmax, signed32 im_sqrt,
                              dwell_cpp cp)
{
  iter_type i2, s, s2;
  signed32  dwell, act_its;

  // Test for period-2 component  
  i2 = i * i;
  s = r * r + i2;
  s2 = (k256*s - k96)*s + k32*r;
  if (s2 < k3) {
    /* We do not need to call the dwell routine */
  } else {
    // Test for cardioid
    s2 = k16*s + k32*r + k16;
    if (s2 < k1) {
      /* We do not need to call the dwell routine */
    } else {
      /* Yes we do need to call the dwell routine */
      (*g_dp)(r, i, itmax, im_sqrt, cp); // dwell1 or dwell3
      /* dwell routine calls *cp, so we don't need to do that. */
      return;
    }
  }

  /* If we get here, it means the point is either in the cardioid or the
     period-2 component (largest mu-atom). */

  act_its = 1;

  // If we're using the PERFECT_CARDIOID method, we report this
  // point as being *outside* the set because the exact areas of
  // R2a and R2.1/2a are added by m_area 
#ifdef PERFECT_CARDIOID
  dwell = 1;
#else
  dwell = itmax+1;
#endif

  (*cp)(r, dwell, act_its); // dcp1
  return;
}

void save_or_reload (int * loaded, signed64 * wrote_time, signed64 * now_time,
  iter_type * i, int * firstrow, char * filename);
void save_or_reload (int * loaded, signed64 * wrote_time, signed64 * now_time,
  iter_type * i, int * firstrow, char * filename)
{
  FILE         *f;

  if (*loaded) {
    // Time to write?
    *now_time = rpm_clock();
    *now_time = *now_time - *wrote_time;
#ifdef USE_FILES
    if (*now_time > write_interval) {
      // Only write if we're still running "normally". This is
      // mainly so I don't have to worry too much about making
      // my loops exit gracefully.
      if (g_quit == 0) {
        iter_type tfloat;

        f_ticks += rpm_clock();  // stop timer to write value to file
        // printf("about to write; f_ticks == %ld\n", f_ticks);
        l_ticks += f_ticks;
        // printf("l_ticks now %ld\n", l_ticks);

        // write
        f = fopen(filename, "w");
        if (f == 0) {
          fprintf(stderr, "Could not open '%s' for writing.\n", filename);
          exit(-1);
        }
        fwrite(i, sizeof(i), 1, f);
        fwrite(&g_inset, sizeof(g_inset), 1, f);
        fwrite(&g_tested, sizeof(g_tested), 1, f);
        fwrite(&g_mass, sizeof(g_mass), 1, f);
        fwrite(firstrow, sizeof(firstrow), 1, f);
        fwrite(&g_this_its, sizeof(g_this_its), 1, f);
        fwrite(&g_effc_its, sizeof(g_effc_its), 1, f);
        fwrite(&l_ticks, sizeof(l_ticks), 1, f);
        tfloat = R_LO; fwrite(&tfloat, sizeof(tfloat), 1, f);
        tfloat = R_HI; fwrite(&tfloat, sizeof(tfloat), 1, f);
        tfloat = I_HI; fwrite(&tfloat, sizeof(tfloat), 1, f);
        fclose(f);
        *wrote_time = rpm_clock();
#ifdef DEBUG_WRITES
        printf("Wrote i = %10.7f\n", i);
#endif
        // printf("restart f_ticks at 0\n");
        f_ticks = 0; // zero timer
        f_ticks -= rpm_clock(); // start timer
      }

      if (write_interval < MAX_WRITE_INTERVAL) {
        write_interval += INIT_WRITE_INTERVAL;
      }
    }
#endif // USE_FILES
  } else {
    /* Not yet loaded cache file */

#ifdef USE_FILES
    f = fopen(filename, "r");
    if (f) {
      iter_type t_rlo, t_rhi, t_ihi;
      // Reload from file
      fread(i, sizeof(i), 1, f);
      fread(&g_inset, sizeof(g_inset), 1, f);
      fread(&g_tested, sizeof(g_tested), 1, f);
      fread(&g_mass, sizeof(g_mass), 1, f);
      fread(firstrow, sizeof(firstrow), 1, f);
      fread(&g_this_its, sizeof(g_this_its), 1, f);
      fread(&g_effc_its, sizeof(g_effc_its), 1, f);
      fread(&l_ticks, sizeof(l_ticks), 1, f);
      // printf("read l_ticks == %ld\n", l_ticks);
      fread(&t_rlo, sizeof(t_rlo), 1, f);
      fread(&t_rhi, sizeof(t_rhi), 1, f);
      fread(&t_ihi, sizeof(t_ihi), 1, f);
      fclose(f);

      /* The following checks also help ensure endianness and alignment of the
         preceding data fields */
      if (t_rlo != R_LO) {
        printf("\n");
        printf("Cannot use cache file '%s'\n", filename);
        printf("  because its R_LO value is %10g and we expect %10g\n",
          t_rlo, R_LO);
        exit(-1);
      }
      if (t_rhi != R_HI) {
        printf("\n");
        printf("Cannot use cache file '%s'\n", filename);
        printf("  because its R_HI value is %10g and we expect %10g\n",
          t_rhi, R_HI);
        exit(-1);
      }
      if (t_ihi != I_HI) {
        printf("\n");
        printf("Cannot use cache file '%s'\n", filename);
        printf("  because its I_HI value is %10g and we expect %10g\n",
          t_ihi, I_HI);
        exit(-1);
      }

#ifdef DEBUG_WRITES
      printf("Continuing from i = %10.7f\n", *i);
#endif
      f_ticks = 0; // Since any calculation up to this point has been replaced
                   // with the cache values, we need to restart the timer at 0
      f_ticks -= rpm_clock();
    } else {
      // No cache file. Schedule a cache write 1 second from now.
      *wrote_time = rpm_clock() - write_interval + 1000000L;
    }
#endif // USE_FILES
    *loaded = 1;
  }
}

int g_num_dots;
char g_dots[STAT_NSAMP+1];

area_result m_area(signed32 gridsize, signed32 itmax, signed32 im_sqrt,
  signed32 skew_v)
{
  area_result  rv;
  iter_type    r, i2;
  iter_type    ri_inc;
  iter_type    row0_adjust;
  double       pix_area;
  double       result;
  double       skew_d;

  iter_type    i;
  int          firstrow;

  char         filename[200];
  signed64     wrote_time, now_time;
  int          loaded;

  wrote_time = rpm_clock();
  loaded = 0;
  sprintf(filename, "cache/%ld,%ld@%ld",
                               (long) gridsize, (long) skew_v, (long) itmax);
//  printf("filename: %s", filename);

  ri_inc = (R_HI - R_LO) / gridsize;
  skew_d = skew_v;
  skew_d /= 65536.0;
  skew_d *= ri_inc;
  pix_area = ri_inc*ri_inc;
  g_inset = 0;
  g_tested = 0;
  g_mass = 0;
  g_itmax = itmax;
  firstrow = 1;
  for (i = I_LO + skew_d;
    (i < I_HI) && (g_quit == 0);
    i += ri_inc)
  {
    save_or_reload(&loaded, &wrote_time, &now_time, &i, &firstrow, filename);

    i2 = i;
    if (firstrow) {
      i2 = (i+(ri_inc/2.0))/2.0;
    }

#ifndef DEBUG_WRITES
      if (g_quit == 0) {
        printf("\r%s%9.6f ", g_dots, i2);
        fflush(stdout);
      }
#endif

    g_row_mass = 0;
    for (r = R_LO - skew_d;
      (r < R_HI) && (g_quit == 0);
      r += ri_inc)
    {
      dwell2(r, i2, itmax, im_sqrt, &dcp1);
    }

    if (firstrow) {
      // printf("%qd->", g_inset);
      i2 = g_inset;
      row0_adjust = (i+(ri_inc/2.0))/ri_inc;
      i2 *= row0_adjust;
      g_inset = i2;
      g_row_mass *= row0_adjust;
      // printf("%qd ", g_inset);
      firstrow = 0;
    }

    g_mass = g_mass + g_row_mass;
  }

  result = g_inset;
  result *= pix_area;
  result *= k2;

#ifdef PERFECT_CARDIOID
// Add the exact area of the cardioid and the period-2 component.
// %%% this works for the area but not for g_mass, so we get the wrong answer
// for the center of gravity
  result += (k7 * kpi)/k16;
#endif

  rv.ar_area = result;
  rv.ar_gravity = g_mass / g_inset;

  return rv;
}

area_result m_area_stat(
  signed32 gridsize,     // Initial gridsize
  signed32 itmax,        // Itmax for all samples
  signed32 nsamp,        // Number of samles to take
  area_result *std_dev,  // Sample standard deviation results will go here
  area_result *std_err,  // Standard error results will go here
  int * ns_done)         // Number of samples taken will go here
{
  double    a_sum, g_sum;
  double    a_sum_sqr, g_sum_sqr;
  signed32  i, skew_v, skew_inc;
  double    area, gravity;
  double    a_mean, g_mean;
  area_result  mar;
  area_result  rv;
  signed32  its, n1, n2, im_sqrt;
  signed32  gs_start, gs_end;

  gs_start = gridsize - (nsamp/2);
  gs_end = gs_start + nsamp;
  printf("Gridwidths %ld-%ld at Nmax %ld\n",
                           (long) gs_start, (long) (gs_end-1), (long) itmax);
  fflush(stdout);
  g_num_dots = 0;

  *ns_done = 0;
  (*std_dev).ar_area = 0.0;
  (*std_dev).ar_gravity = 0.0;
  (*std_err).ar_area = 0.0;
  (*std_err).ar_gravity = 0.0;
  rv.ar_area = 0.0;
  rv.ar_gravity = 0.0;

  /* Compute im_sqrt. The algorithm is dumb but doesn't matter because we
     only do it once per unique value of itmax */
  its = 0;
  im_sqrt = 0;
  for (n1=2; n1<itmax; n1++) {
    for (n2=0; n2<n1; n2++) {
      if (its > itmax) {
        im_sqrt = n1;
        n1 = n2 = itmax; /* Make both loops end now */
      }
      its++;
    }
  }
  if (im_sqrt == 0) {
    fprintf(stderr, "Error: did not detect im_sqrt\n");
    exit(-1);
  }

  g_ttl_its = 0;
  g_t_efc_its = 0;
  a_sum = a_sum_sqr = 0;
  g_sum = g_sum_sqr = 0;
  skew_inc = 65536 / SKEW_INC_MAX; // was "nsamp"
  skew_v = skew_inc / 2;

  g_ticks = 0;

  g_1st_inset = g_1st_tested = 0;
  gridsize = gs_start;
  for (i=0; (i<nsamp) && (g_quit == 0); i++) {
    /* Perform one area measurement */

    // Clear the global iteration stats
    g_this_its = 0;
    g_effc_its = 0;

    g_dp = &dwell1; // dwell1 or dwell3
    area = 0;
    if ((i % g_of) == g_slice) {
      /* This grid belongs to our slice, i.e. it is our responsibility to
         compute it. */
      // printf("(%d)", (int) i); fflush(stdout);

      f_ticks = 0; // Stop and zero timer
      f_ticks -= rpm_clock(); // Start timer
      l_ticks = 0; // l_ticks will get overridden by file reads if any
        mar = m_area(gridsize, itmax, im_sqrt, skew_v);
      f_ticks += rpm_clock(); // Stop timer
      l_ticks += f_ticks;

      g_ticks += l_ticks; // Add this grid to total

      if (g_1st_inset == 0) {
        g_1st_inset = g_inset;
        g_1st_tested = g_tested;
      }
      area = mar.ar_area;
      gravity = mar.ar_gravity;
      a_sum += area;
      g_sum += gravity;
      a_sum_sqr += area * area;
      g_sum_sqr += gravity * gravity;
      g_ttl_its += g_this_its;
      g_t_efc_its += g_effc_its;
      {
        int j;
        g_num_dots++;
        for(j=0; j<g_num_dots; j++) {
          g_dots[j] = '.';
        }
        g_dots[j] = 0;
      }
      (*ns_done)++;
    }
    skew_v += skew_inc;
    gridsize++;
  }

  if (g_of <= 4) {
    // Given a mean m, and samples xi, the sample standard deviation (ssd) is:
    // ssd = sqrt[ sum((xi-m)^2) / (n-1) ]
    //     = sqrt[ sum(xi^2 - 2.xi.m + m^2) / (n-1) ]
    //     = sqrt[ ( sum(xi^2) - 2.m.sum(xi) + n.m^2 ) / (n-1) ]
    //     = sqrt[ ( sum(xi^2) - 2.n.m^2 + n.m^2 ) / (n-1) ]
    //     = sqrt[ ( sum(xi^2) - n.m^2 ) / (n-1) ]
    a_mean = a_sum / (*ns_done);
    rv.ar_area = a_mean;
    (*std_dev).ar_area = sqrt(
                      (a_sum_sqr - (*ns_done)*a_mean*a_mean) / ((*ns_done)-1));
    (*std_err).ar_area = (*std_dev).ar_area / sqrt(*ns_done);

    g_mean = g_sum / (*ns_done);
    rv.ar_gravity = g_mean;
    (*std_dev).ar_gravity = sqrt(
                      (g_sum_sqr - (*ns_done)*g_mean*g_mean) / ((*ns_done)-1));
    (*std_err).ar_gravity = (*std_dev).ar_gravity / sqrt(*ns_done);
  } else {
    /* If we are part of 5 or more slices, we don't bother to calculate the
       stats because there aren't enough samples for it to mean much of
       anything. */
  }

  /* *std_err and *ns_done are also return values */
  return rv;
}

void loop1()
{
  signed32  itmax, gs;
  double    perf;

  for(gs=g_gs_min; (gs<=g_gs_max) && (g_quit == 0); gs*=2) {
    for(itmax=IM_MIN*gs; (itmax<=IM_MAX*gs) && (g_quit == 0); itmax*=2) {
      area_result  mar, std_dev, std_err;
      int ns_done;

      // Set up our results structures
      mar.ar_area = 0;
      mar.ar_gravity = 0;
      std_dev.ar_area = 0;
      std_dev.ar_gravity = 0;
      std_err.ar_area = 0;
      std_err.ar_gravity = 0;

      mar = m_area_stat(gs, itmax, STAT_NSAMP, &std_dev, &std_err, &ns_done);

      if (g_quit == 0) {
        printf("\n");
        if (g_of > 1) {
          printf("RUNNING AS SLICE %d of %d SLICES\n", g_slice, g_of);
        }

        if (g_of > 4) {
          /* More than 4 slices means less than 5 samples per slice */
          printf("  Not enough samples for meaningful averages.\n");
        } else if (g_of > 1) {
          printf(
            "  (These stats reflect %d samples out of the full set of %d)\n",
            ns_done, STAT_NSAMP);
          printf("\n");
        }

        if (g_of <= 4) {
          char s[30];

          /* 4 or fewer slices means we have enough samples to calculate
             the average and standard error, etc. */
          printf("  Gridsize %ld, itmax %ld\n", (long) gs, (long) itmax);

          snformat_unc(s, sizeof(s), mar.ar_area, 1.96 * std_err.ar_area);
          printf("               Area %s (n=%d, sigma %7.2e)\n",
                                               s, ns_done, std_dev.ar_area);
          // printf(
          // "               Area  %11.9f  +- %7.2e  (n=%d, sigma %7.2e)\n",
          // mar.ar_area, 1.96 * std_err.ar_area, ns_done, std_dev.ar_area);
  
#ifndef PERFECT_CARDIOID
          snformat_unc(s, sizeof(s), mar.ar_gravity, 1.96*std_err.ar_gravity);
          printf("  Center of gravity %s (n=%d, sigma %7.2e)\n",
                                            s, ns_done, std_dev.ar_gravity);
          // printf(
          // "  Center of gravity %13.10f +- %7.2e  (n=%d, sigma %7.2e)\n",
          // mar.ar_gravity, 1.96 * std_err.ar_gravity, ns_done,
          // std_dev.ar_gravity);
#endif

          s128_str(s, g_t_efc_its);
          printf("      Its: %20s", s);
          s128_str(s, g_ttl_its);
          printf(";  Actual: %s\n", s);
          // printf("      Its: %10.05g;  Actual: %10.05g\n",
          //    ((double) g_t_efc_its), ((double) g_ttl_its));

          printf("   pixels in first sample: %lu of %lu\n",
                                                    g_1st_inset, g_1st_tested);

          perf = ((double) g_ticks)/1.0e6; /* Seconds */
          printf("   Total computation time: %g sec", perf);
          if (perf > 0.01) {
            char s[20];
            /* FLOPs per millisecond equals KFLOPs per second. */
            perf = ((double) g_ttl_its) * 7.0 / perf;
            format_si(s, perf);
            printf(", %sFLOPs", s);
          } else {
            /* Ran for less than 10 milliseconds */
            printf(" (too short for KFLOPs measure)");
          }
          printf("\n");
        }
      }
    }
    printf("\n");
  }
}

void usage(void);
void usage()
{
  fprintf(stderr, "%s",
"NAME\n"
"      luams     - measure area of Mandelbrot Set\n"
"\n"
"SYNOPSIS\n"
"      luams [-slice N/M] [-gmin A] [-gmax B]\n"
"\n"
"EXAMPLES\n"
"      luams -slice 3/4           (slice number is 0-based)\n"
"\n"
  );
}

int main(int nargs, char **argv)
{
  char ** av;
  char *arg;
  int args;

  printf("LUAMS-3, %s", COMPILE_DATE);
#ifdef PERFECT_CARDIOID
  printf(", PERFECT_CARDIOID");
#endif
#ifdef USE_FILES
  printf(", USE_FILES");
#endif

  printf("\n\n");

  check_sizes();

  /* Set defaults */
  write_interval = INIT_WRITE_INTERVAL;
  g_slice = 0; g_of = 1;
  g_gs_min = DEFAULT_GS_MIN;
  g_gs_max = DEFAULT_GS_MAX;

  if (nargs > 1) {
    av = argv;
    args = nargs;
    av++;   /* skip our program name/path */
    args--;
    while(args) {
      arg = *av; av++; args--;
      if ((strcmp(arg, "-help") == 0)  || (strcmp(arg, "-h") == 0)
      || (strcmp(arg, "--help") == 0) || (strcmp(arg, "--h") == 0)) {
        usage();
        exit(0);

      } else if (strcmp(arg, "-gmin") == 0) {
        char * s;
        int na, g;
        s = *av; av++; args--;
        na = sscanf(s, "%d", &g);
        if (na != 1) {
          fprintf(stderr, "-gmin requires an argument\n");
          exit(-1);
        }
        if (g < 32) {
          fprintf(stderr, "-gmin argument must be at least 32.\n");
          exit(-1);
        }

        g_gs_min = g;
        printf("Min gridsize %d\n", g_gs_min);

      } else if (strcmp(arg, "-gmax") == 0) {
        char * s;
        int na, g;
        s = *av; av++; args--;
        na = sscanf(s, "%d", &g);
        if (na != 1) {
          fprintf(stderr, "-gmax requires an argument\n");
          exit(-1);
        }
        if (g < 32) {
          fprintf(stderr, "-gmax argument must be at least 32.\n");
          exit(-1);
        }

        g_gs_max = g;
        printf("Max gridsize %d\n", g_gs_max);

      } else if (strcmp(arg, "-slice") == 0) {
        char * s;
        int na, slice, of;
        s = *av; av++; args--;
        na = sscanf(s, "%d/%d", &slice, &of);
        if (na != 2) {
          fprintf(stderr, "-slice requires an argument like '3/7'\n");
          exit(-1);
        }
        if (of <= 0) {
          fprintf(stderr, "second part of -slice argument must be a positive integer.\n");
          exit(-1);
        }
        if ((slice < 0) || (slice >= of)) {
          fprintf(stderr, "first part of -slice argument must be in the range 0 .. %d inclusive\n", (int) (of-1));
          exit(-1);
        }
        if (of > STAT_NSAMP) {
          fprintf(stderr, "number of slices should not be more than %d\n", (int) STAT_NSAMP);
          exit(-1);
        }

        g_slice = slice; g_of = of;
        printf("Will run as slice %d of %d\n", g_slice, g_of);
      }
    }
  }

  if (g_gs_max < g_gs_min) {
    fprintf(stderr, "-gmax must be at least as large as -gmin.\n");
    exit(-1);
  }

  signal(SIGINT, &my_handler);
  signal(SIGTERM, &my_handler);

  loop1();
}