TuringCompleteness of the Zuse Z3
The Zuse Z3 was a generalpurpose automatic calculating machine designed and built by Konrad Zuse, a German civil engineer, from 1939 to 1941.
It read its instructions from a punched tape (using 35mm film stock) and had the following instruction set. (To give an idea for the opcode format, a few known opcodes from the model Z4 are shown)
opcode  mnemonic  cycles  description 
3 h l  Pr A  1  Load from memory at address A into register R1 or R2 
2 h l  Ps A  01  Store register R1 into memory address A (0 cycles if prev op was a calculation) 
1 0 3  Ls1  3  Calculate R1+R2 and store into R1 
1 ? ?  Ls2  3  Calculate R1R2 and store into R1 
1 0 4  Lm  16  Calculate R1×R2 and store into R1 
1 0 5  Li  20  Calculate R1/R2 and store into R1 
1 ? ?  Lw  20  Calculate sqrt(R1) and store into R1 
? ? ?  Lu  941  Wait for user input, convert to binary (and load into R1?) 
? ? ?  Ld  941  Convert (R1?) into decimal, display output, wait for user 
The ALU was able to perform each of the operations described above (add, subtract, multiply, divide, and square root) plus several more operations used specifcally for binarydecimal conversion (multiply or divide by 10, multiply or divide by 0.1, multiply or divide by 2, and negate)
What's missing in all this is any type of conditional branch. Zuse was aware of the possibility that such a thing might be useful, and wrote about it (using the term freien Rechenplanen, "free computingplan", to refer to a program that used conditional calculations) but did not need it for any of his own work.
Since the Z3 is programmed by a punched tape, it is capable of looping unconditionally, just by connecting the beginning and end of the program together to make a continuous loop of tape. This allows a specified calculation to be performed iteratively.
Turing Completeness
To prove Turingcompleteness, Raul Rojas shows how a set of expressions can be combined together into a larger expression whose value depends on a number of flag variables that are either 0 or 1. He goes to great length to explain how this can be done, and how that, in turn, can be used to access arbitrarysized "arrays" of variables, and other similar things. The description is fairly complex.
However, all we need to do to prove Turing completeness is the following:
Let each variable have a value of either 0 or 1.
Define two boolean operations:
NOT: Negates a value. If the value is in variable x, then y = NOT x is computed by y = 1  x.
AND: Combines two values together by a boolean "and". If the values to be combined are x and y, then z = x AND y is computed by z = x y.
Recall that NOT(x AND y) = x NAND y, and that any desired set of logic gates can be built out of NAND gates.
To emulate a given computer design, designate one variable for each gate. Write a program that emulates one "clock cycle", by performing the boolean operation for each gate. Loop that program indefinitely.
Details and Refinements
The universality of NAND gates is outlined here. For our purposes, a number of simplifications are possible using other arithmetic operations. For example, z = x OR y can be computed by z = x + y  x y, and x XOR y is z = x + y  2 x y = x  (2 x y  y).
Memory cells, registers and other persistent states can be implemented by gates, using flopflops and latches of various kinds. The common element in all is that a gate's output feeds back into its input, possibly through another gate. Examples are described here) and here. However, it is usually not necessary to propagate every signal simultaneously. You can get a "latch" in a single variable just by leaving a value alone and not recalculating it.
Combining these techniques, a 3bit binary counter (with its state held in variables b2, b1 and b0) can be incremented with five boolean opration steps (here showing how many arithmetic calculations and Zuse Z3 instructions each takes):

This is far fewer steps than what it would take to implement the counter as a cascade of three flipflops, each built entirely from NAND gates.
How Efficient Is This?
It should be fairly clear that we suffer a very large amount of slowdown in order to gain Turingcompleteness. Let's make a rough calculation of how much slower it gets:
Given a number of gates, we can estimate the size of a Zuse Z3 program neccesary to emulate a realworld system. Note that memory cells amount to gates: although actually storing the value persistently takes no calculation at all, reading and writing takes some conditional tests to determine if it is time to change this memory location, and if so what to set it to. This is called "addressing"; in real memories there are two "select lines" called the row and column, each is decoded from half of the binary digits of the memory cell's address). The percell logic overhead is one twoinput AND for writing, and one twoinput OR for reading, per cell. Other parts of the system (CPU, bus interfaces, etc) have more inputs per gate, or their gates' outputs affect more gates (these are actually equivalent, and collectively called fanout). We can conservatively estimate that it will take 10 arithmetic operations to emulate each memory gate (counting memory cells as gates).
For a concrete example, an Apple II has 64K bytes of memory (48K RAM and 16K ROM), a CPU with about 5000 transistors, and miscellaneous other sections amounting to no more than 5000 more transistors. Each byte is 8 bits, so the memory has 2^{19} cells. As just described, it takes two gates for each memory cell. The total gate count for the whole Apple II comes to just a bit over a 10^{6} gates (and the same number of Zuse Z3 memory locations to hold their values). If each gate takes an average of 10 Zuse Z3 instructions to emulate, then our emulation program would have 10^{7} steps, which is 10 million. So it takes 10^{6} variables and 10^{7} instructions to emulate the Apple II.
Remember, it would have to execute all 10 million steps just to emulate a single clock cycle of the Apple II's operation. If the Apple II takes 10,000 clock cycles to perform one floatingpoint operation, then the emulation program would take 10^{11} steps to accomplish the same calculation.
Remember that the Zuse Z3 is designed to do one floatingpoint operation in a single instruction. Thus we get a roughly 10^{11} slowdown by emulating an Apple II.
This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2018 Feb 04. s.11