Computer History
Overview
I define three classes of computers based on size and wealth of user (all prices adjusted to US dollars in late 1990's):
A. High-end scientific and military computers costing around $20 million. Usually called "large mainframe" or "supercomputer".
B. Mainstream business (or engineering) computers costing around $200,000. Common names include "MIS server", "blade server", "high-end workstation".
C. "Consumer" machines costing around $2000. Usually called "home computer" or "personal computer".
If we look at the history of each of these three classes of computers we see that computer history has been repeating itself remarkably well. Here are the major phases in the evolution of the computer. Each phase represents about a factor of 100 in speed improvement. For each class, the first significant representative system from that phase is listed:
Phase |
-----Class-A----- |
-----Class-B----- |
-----Class-C----- | Developments that define this phase |
0 |
1941 |
year |
year | Special-purpose or experimental systems, never brought to a large customer base. Includes specialized CPU designs (no conditional branch, etc.) |
1 |
year |
year |
1976 | Smallest CPU design that can do any useful work, with whatever switching technology and clock speed allow for cheapest manufacture. Simple multistate sequential operation: fetch, decode, fetch operands, execute, write back, increment PC, repeat. |
2 |
1955 |
year |
1984 | Simple improvements: Push the clock speed, vary the voltages and fabrication techniques, widen the ALU and opcode, more registers and memory, more instructions and addressing modes, instruction prefetch. |
3 |
1964 |
year |
1991 | Pipeline the 5 major parts of the CPU (fetch, decode, operands, execute, write back). CPU becomes faster than main memory; add a cache. Add floating-point and memory management hardware. |
4 |
1976 |
year |
1994 | Multiple ALUs, register scoreboarding, out-of-order execution, multiple sets of condition codes, speculative execution, floating-point pipeline, etc. Instruction set and programming model are redefined to fit the architecture. |
5 |
1988 |
1997 |
2001 | Market splits between modestly parallel (2 to 8 CPUs; suitable for multitasking and business, database, expert systems) and vector processor designs (suitable for scientific, graphics, visualization, imaging). Algorithms become obsolete; applications are rewritten to fit the architecture. |
6 |
1995 |
year. |
year. | Market converges again, on massively parallel systems (32 or more CPU cores, same program running on all CPUs). 30-50% of the hardware cost is dedicated to inter-CPU communication. All old programming techniques become obsolete; programmers re-learn how to program. |
7 |
2002?. |
year. |
year. | Development focused on inter-CPU communication. As number of nodes increases, the vast majority of hardware cost is dedicated to communication. Bisection bandwidth grows faster than aggregate processing power. |
8 |
year. |
year. |
year. | Development focused on physical size and cooling. Speed of light limits the number of nodes that can work on a single problem. 3-D fabrication (stacked wafers); all air space is replaced with heat fins, coolant pipes, communication lines, power wires and thermally conductive glue. |
9 |
year. |
year. |
year. | (It's difficult to see what happens next. By this time I expect that the fundamental switching elements will have switched from transistors to some kind of optical, chemical or biological phenomenon.) |
10 |
year. |
year. |
year. | The plateau. Computers stop growing exponentially as the laws of physics are finally reached. Architecture stops changing and is perfected. Hardware stops changing and is perfected. Existing software stops changing and is perfected. New software continues to be written as always. |
Amdahl's Law
If N is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and p is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, (and s + p = 1) then Amdahl's law says that speedup is given by
speedup = 1 / (s + p / N )
This would seem to limit the benefit of parallel processing to 1/s. However, it has been found that when faster hardware becomes available users run bigger problems on the faster hardware. It has also been found that for all problems that need to be run on "big" computers, the value of s goes down as the problem size increases.
Moore's Law
The term "Moore's Law" originated with Carver Mead in 1970, who was referring to Gordon Moore, a co-founder of Intel. The idea was now new with Moore: Douglas Engelbart predicted something similar in a 1960 lecture1, and Alan Turing wrote in a 1950 paper that computers would have enough memory by the year 2000 to rival the higher-thinking capacity of a human mind, and then estimates this at about 109 or 1010 bits.[3]
In 1965, Moore predicted that the logic density of computer circuitry would follow an exponential rate of growth. Moore's data concerned the number of transistors that could be placed on an integrated circuit at a "minimum cost per component" point. The minimum cost per component occurs at a certain point in the tradeoff between circuit complexity and yield, and as technology improves it moves toward higher component density and lower cost per transistor.
Similar measures of logic densisty such as the total "gate count" or "transistor count" on a chip (ignoring its cost or yield) are more common today. Logic denisty can be approximated by:
bits per square inch = 2 (T-1962)
Where T is the current year. By this formula, in 1965 the logic density would be 23 = 8 bits per square inch (which was true), and in 1997 it would be 235 = 34 billion (which is an overestimate, in fact we're only up to about 1 billion).
"Moore's Law" (speed variant)
Moore didn't actually address this, but it is the version most people have heard — the statement that computer speed increases exponentially. To make it precise, we'll relate the speed in floating point operations per second to the price P in dollars and the time T:
speed = P * 1.53 (T-1973)
Including the price as a linear term in the formula is deceiving: usually you cannot count on getting twice as much power for twice as much money. There are actually many different factors at work. Some of them work against each other:
high-end technology: Higher-cost machines have a smaller market, so the develpment costs can't be spread out as much, and faster designs usually require more expensive technology (like multi-port caches for shared-memory designs). This effect causes bigger machines to cost more per megaflop than they otherwise would.
adjusting job size to match the machine: Programs designed to run efficiently on small machines don't run efficiently on large machines. If you buy a faster machine and run the same program on it, expecting performance proportional to your machine's rated performance, you'll be disappointed. But if you adjust (recompile, optimize, etc.) the program and its data for the larger machine you can keep up.
software optimization effort: Higher-priced machines get utilized more efficiently. A lot of money was spent on a fast machine and a lot of effort will get spent on optimizing the software to run fast. This effect causes bigger machines to cost less per megaflop than they otherwise would.
commodity hardware: In some cases, notably the ASCI projects, very large supercomputers are built from ordinary low-end technology. This effect counteracts the high-end technology effect, causing bigger machines to cost less per megaflop than they otherwise would.
inflation: The formula was calculated to ignore inflation. If you account for inflation by converting all prices to 1973 dollars, then you would need to change 1.53 in the formula to around 1.58. The difference of about 3% is the average inflation rate during the whole period. That higher figure is the actual rate of computer speed improvement.
recent trends not fully explained: In the last few years there has been a noticable difference in Moore's Law as compared to the 70's and 80's. Scientific American, September 1995:
"The rate of improvement in microprocessor technology has risen from 35 percent a year only a decade ago to its current high of approximately 55 percent a year, or almost 4 percent each month. Processors are now three times faster than had been predicted in the early 1980's; it is as if our wish was granted, and we now have machines from the year 2000."
Theories explaining this include a possible greater push for development as the computer market becomes bigger (and thus more lucrative), and/or the spinoff of US government-funded projects such as the space program and military projects like ARPANET and ASCI.
To illustrate the competing effects, here is a comparison across the three classes: In the beginning of 1997, the Intel ASCI Option Red machine (a "class A" machine) had just been completed at a cost of $46 million and its rated speed was about 1.34 TFLOPs. A 16-CPU Sun HPC 10000 (a "class B" machine) would cost about $750 thousand with a rated speed of 7.0 GFLOPs. A Gateway 2000 P5-200 with a 200 MHz Pentium Pro (a "class C" machine) cost about $2500 and was rated at 63 MFLOPs. All three MFLOPs ratings come from the LINPACK benchmark, with array sizes appropriate to get the most performance out of each machine. Let's see what the formula says:
machine | performance based on formula | actual performance |
Option Red | 46000000 * 1.53(1997-1973) = 1.24 TFLOPs | 1.34 TFLOPs |
Sun HPC 10000 | 750000 * 1.53(1997-1973) = 20.3 GFLOPs | 7.0 GFLOPs |
Gateway 2000 PPro-200 | 2500 * 1.53(1997-1973) = 67.7 MFLOPs | 63 MFLOPs |
Several of the effects can be seen at work. The high-end technology effect drives up the cost of the Sun HPC 10000. The commodity hardware effect is evident in the Option Red, and a lot of software optimization was done there as well. (The Pentium Pro 200 in the Gateway can actually do over 150 MFLOPs if you optimize the code to use the pipeline and cache well.)
Applying the formula to a recent machine, my 2009 Apple Mac Pro gives me 30.5 GFLOPs (out of a theoretical 72) in double precision when running the program I wrote for the Xmorphia project. It is the 8-core 2.26-GHz model, with a list price of $3300; the formula gives 3500 * 1.53(2009-1973) = 14.7 GFLOPs.
Temporary Plateaus and Architecture Shifts
At some point around 2000-2005 there began to be a clear discrepancy of sorts between the current state of the art in classes A B and C as defined above. For example, in 2008 one could buy a machine (e.g. from Dell or Apple) with 8 processor cores running at 2.8 GHz (a two-socket system, each with a quad-core Intel Xeon) for only slightly over the defining class C price point of $2000. At the same time, it was hard to find anything near the defining class B price point with more than about twice the CPU performance (an example would be the HP ProLiant DL585G5 with 4 AMD chips, 16 cores at 2.3GHz, which is more than twice the defining price). Meanwhile, a $20 million supercomputer cluster can easily be built that has 1000 processor cores or more (using, say, 128 8-core blade systems linked by fibre channel, or some such thing — many examples exist in Jack Dongarra's database).
The reason for the discrepancy is that one of the computer classes (in this case class B) was undergoing a "temporary plateau". The plateau is evident in a graph of "performance/price" plotted over time. The graph shows a steady increase for a while, then enters a period of much slower increase, then resumes the older (fast) rate of increase. In 2008, class A systems had already gone through the plateau, and class C had not yet entered it.
Plateaus happen for different reasons, and this plateau with class B provides a useful case study. The cause is easy to see if you look at class A and class C: class A has already gone through the same architectural transition, and class C has not gotten there yet. Class B is currently going through a transition that supercomputers went through in the 1980s: pushing the limits of the shared memory parallel architecture. A good example is the IBM x3950 M2. This is a machine that uses Intel Xeon processors (designed for use on 4-socket motherboards) with a custom IBM-designed memory controller system enabling 8 processors to share the same memory.
Sources:
A lot of class B systems from the period 2001-2008 can be found at www.tpc.org. Unlike class A and class C, the market for class B frequently changes. For example, for a long period in the 1980's, class B systems were high-end graphics workstations.
More Data
Here is some additional information about a few machines organized by phase (as in the above table) with some more information, plus a long list of links to places where I cound information I used to compile the figures.
Here is a list of price data I collected while trying to establish the validity of the idea that the three classes shown in the table above have followed the same track of evolution over roughly the same time scale.
Semiconductor Chip Manufacturing Process History
This table is my attempt to give the first example of a microprocessor using each industry-standard process size. In each case I count it only when Intel (or IBM) has shipped it in a "volume quantity" to customers, and if it uses the new process size (or smaller) for the entire (or almost the entire) die. A complete
process length2 | transistors | when | speed and name |
10 u | 2300 | 1971.11 | 108 KHz Intel 4004 |
6 u | 6000 | 1974.04 | 2 MHz Intel 8080 |
3 u | 6500 | 1976.03 | 5 MHz Intel 8085 |
29K | 1978.06 | 5 MHz Intel 8086 | |
1.5 u | 134K | 1982.02 | 6 MHz 80286 |
1 u | ? | 1988.04 | 25 MHz 386 DX |
.8 u | 1.2 M | 1991.06 | 50 MHz 486 DX |
.6 u | 3.2 M | 1994.03 | 90 MHz Pentium (P54C) |
.35 u | 3.2 M | 1995.06 | 133 MHz Pentium (mixed .6/.35 process was used in 1995.03 for the 120-MHz Pentium) |
.25 u (P856) | 4.5 M | 1997.09 | 200MHz Pentium MMX |
.18 u (P858) | 27.4 M | 1999.06 | 400 MHz Pentium II |
130 nm (Px60) | 28 M | 2002.02 | Pentium-IIIM "coppermine" First process to use copper, rather than aluminum, interconnects; first to use 300-mm wafers (but 200-mm wafers are also used with this process) |
90 nm (P1262) | >= 58M | 2004.02 | 2 GHz IBM PowerPC 970FX "CMOS 10S" process: first process to use strained Silicon for the channel (semiconducting layer). Now using 300-mm wafers exclusively. |
65 nm (P1264) | - | 2007.06 | prediction from 2003 ITRS: completion of first three-year retooling cycle |
45 nm (P1266) | x | 2010.06 | prediction from 2003 ITRS: metallic gates plan is to use high-K dielectric rather than SiO2, and metal gate electrode rather than polysilicon |
32 nm (P1268) | x | 2013.06 | prediction from 2003 ITRS: extreme ultraviolet (EUV) lithography |
22 nm (P1270) | x | 2016.06 | prediction from 2003 ITRS will probably require an interconnect material superior to copper |
16 nm | x | 2020.06 | At this point quantum tunneling will be a problem |
Footnotes
1 : Wikipedia, Moore's law, encyclopedia article (retrieved 2010 March 16)
2 : "process length" is the DRAM 1/2 pitch, a dimension relevant to memory chips. For CPU cores, the relevant dimension is "physical gate length", and is about 45% of the process length. For example, when the 90nm process is used, the gate length is 37nm.
Bibliography
[3] Alan Turing, Computing Machinery and Intelligence, Mind, LIX(236) pp. 433-460 (1950).
Other Sources
http://www.informit.com/articles/article.asp?p=130978&seqNum=3 http://www.architosh.com/news/2003-12/2003c-1203-tb-moore.phtml http://www.willus.com/archive/timeline.shtml http://public.itrs.net/Files/2003ITRS/ExecSum2003.pdf http://news.com.com/2100-73373-5112061.html?tag=stpop http://www.intel.com/research/silicon/itroadmap.htm
Acknowledgments:
Most benchmark numbers are from old LINPACK reports.
test: see Amdahl
This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2013 Jan 21. s.27