Turing-Completeness of the Zuse Z3

The Zuse Z3 was a general-purpose automatic calculating machine designed and built by Konrad Zuse, a German civil engineer, from 1939 to 1941.

It read its instructions from a punched tape (using 35-mm film stock) and had the following instruction set. (To give an idea for the opcode format, a few known opcodes from the model Z4 are shown)

opcode	mnemonic	cycles	description
3 h l	Pr A	1	Load from memory at address A into register R1 or R2
2 h l	Ps A	0-1	Store register R1 into memory address A (0 cycles if prev op was a calculation)
1 0 3	Ls1	3	Calculate R1+R2 and store into R1
1 ? ?	Ls2	3	Calculate R1-R2 and store into R1
1 0 4	Lm	16	Calculate R1×R2 and store into R1
1 0 5	Li	20	Calculate R1/R2 and store into R1
1 ? ?	Lw	20	Calculate sqrt(R1) and store into R1
? ? ?	Lu	9-41	Wait for user input, convert to binary (and load into R1?)
? ? ?	Ld	9-41	Convert (R1?) into decimal, display output, wait for user

The ALU was able to perform each of the operations described above (add, subtract, multiply, divide, and square root) plus several more operations used specifcally for binary-decimal conversion (multiply or divide by 10, multiply or divide by 0.1, multiply or divide by 2, and negate)

What's missing in all this is any type of conditional branch. Zuse was aware of the possibility that such a thing might be useful, and wrote about it (using the term freien Rechenplanen, "free computing-plan", to refer to a program that used conditional calculations) but did not need it for any of his own work.

Since the Z3 is programmed by a punched tape, it is capable of looping unconditionally, just by connecting the beginning and end of the program together to make a continuous loop of tape. This allows a specified calculation to be performed iteratively.

Turing Completeness

To prove Turing-completeness, Raul Rojas shows how a set of expressions can be combined together into a larger expression whose value depends on a number of flag variables that are either 0 or 1. He goes to great length to explain how this can be done, and how that, in turn, can be used to access arbitrary-sized "arrays" of variables, and other similar things. The description is fairly complex.

However, all we need to do to prove Turing completeness is the following:

Let each variable have a value of either 0 or 1.

Define two boolean operations:

NOT: Negates a value. If the value is in variable x, then y = NOT x is computed by y = 1 - x.

AND: Combines two values together by a boolean "and". If the values to be combined are x and y, then z = x AND y is computed by z = x y.

Recall that NOT(x AND y) = x NAND y, and that any desired set of logic gates can be built out of NAND gates.

To emulate a given computer design, designate one variable for each gate. Write a program that emulates one "clock cycle", by performing the boolean operation for each gate. Loop that program indefinitely.

Details and Refinements

The universality of NAND gates is outlined here. For our purposes, a number of simplifications are possible using other arithmetic operations. For example, z = x OR y can be computed by z = x + y - x y, and x XOR y is z = x + y - 2 x y = x - (2 x y - y).

Memory cells, registers and other persistent states can be implemented by gates, using flop-flops and latches of various kinds. The common element in all is that a gate's output feeds back into its input, possibly through another gate. Examples are described here) and here. However, it is usually not necessary to propagate every signal simultaneously. You can get a "latch" in a single variable just by leaving a value alone and not re-calculating it.

Combining these techniques, a 3-bit binary counter (with its state held in variables b2, b1 and b0) can be incremented with five boolean opration steps (here showing how many arithmetic calculations and Zuse Z3 instructions each takes):

carry1 = b0	0 calculations, 1 instruction
b0 = NOT b0	1 calculation, 4 instructions
carry2 = b1 AND carry1	1 calculation, 4 instructions
b1 = b1 XOR carry1	4 calcultions, 12 instructions
b2 = b2 XOR carry2	4 calcultions, 12 instructions

This is far fewer steps than what it would take to implement the counter as a cascade of three flip-flops, each built entirely from NAND gates.

How Efficient Is This?

It should be fairly clear that we suffer a very large amount of slowdown in order to gain Turing-completeness. Let's make a rough calculation of how much slower it gets:

Given a number of gates, we can estimate the size of a Zuse Z3 program neccesary to emulate a real-world system. Note that memory cells amount to gates: although actually storing the value persistently takes no calculation at all, reading and writing takes some conditional tests to determine if it is time to change this memory location, and if so what to set it to. This is called "addressing"; in real memories there are two "select lines" called the row and column, each is decoded from half of the binary digits of the memory cell's address). The per-cell logic overhead is one two-input AND for writing, and one two-input OR for reading, per cell. Other parts of the system (CPU, bus interfaces, etc) have more inputs per gate, or their gates' outputs affect more gates (these are actually equivalent, and collectively called fan-out). We can conservatively estimate that it will take 10 arithmetic operations to emulate each memory gate (counting memory cells as gates).

For a concrete example, an Apple II has 64K bytes of memory (48K RAM and 16K ROM), a CPU with about 5000 transistors, and miscellaneous other sections amounting to no more than 5000 more transistors. Each byte is 8 bits, so the memory has 2¹⁹ cells. As just described, it takes two gates for each memory cell. The total gate count for the whole Apple II comes to just a bit over a 10⁶ gates (and the same number of Zuse Z3 memory locations to hold their values). If each gate takes an average of 10 Zuse Z3 instructions to emulate, then our emulation program would have 10⁷ steps, which is 10 million. So it takes 10⁶ variables and 10⁷ instructions to emulate the Apple II.

Remember, it would have to execute all 10 million steps just to emulate a single clock cycle of the Apple II's operation. If the Apple II takes 10,000 clock cycles to perform one floating-point operation, then the emulation program would take 10¹¹ steps to accomplish the same calculation.

Remember that the Zuse Z3 is designed to do one floating-point operation in a single instruction. Thus we get a roughly 10¹¹ slowdown by emulating an Apple II.

This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2008 Dec 21.

s.27