Geoff FordenWhy do we test things?

I remember having a conversation with a missile engineer some time ago about the North Korean Nodong missile; he said “no one in their right mind would field a missile that has only been successfully tested once!” At the time, that made a lot of sense to me. But exactly how many tests do you need? And more importantly, how do you decide how many tests you need? These questions should all be determined by the reason why tests are performed in the first place.

I think I understand bullet testing. When developing a new bullet, you test it millions and millions of times to make sure they work right in all imaginable situations and that you have a high degree of confidence that they will work. But, of course, bullets only cost a dollar or two each so there is little problem with running a standard quality control test program to allow you to achieve real confidence they are going to work. National missile defense tests cost about $100 million each so we are never going have the “95% confidence that the system works 95% of the time” that some critics of missile defense have been advocating. (I’m not against that in principle; I’m just saying it’s never going to happen. Any missile defense development program has to be adjusted to that reality.)

I was reminded of my conversation with the missile engineer when NASA announced it was awarding SpaceX part of a $3.5 Billion dollar contract to deliver supplies to the ISS based on a single successful test flight. SpaceX is quickly becoming my favorite sociological experiment in missile development. Touted as a more cost effective way of getting into space, SpaceX has hired former NASA engineers and uses government facilities without, I’m sure, contributing to paying off their development costs; just some sort of use fee. But now, it seems, that the real way they are going to save money is not to have the sort of expensive testing program we might expect from a government development program. This isn’t going to be a rant against SpaceX, which as I say, is one of my favorite sociological experiments (which also doesn’t imply that I think they are doing the right things!) The problem is, I’m not sure what the government would use that development program for, anyways. If we are not using flight tests to determine statistical reliability, perhaps only one successful test is really all that is needed. If so, what does that tell us about countries just starting the development of their missiles?

Tests Associated with Various Development Programs

Program No. of Tests No. Successful Tests
Falcon-1 (SpaceX) 4 1
W-76 4 2*
RS-24 3 3
Al Samoud I (Iraq) 37 33
Al Samoud II (Iraq) 24 22
Nodong (DPRK only) 2 1
Taepo’dong I (DPRK) 1 0**
Taepo’dong II (DPRK) 1 0

*I have arbitrarily dropped the two tests with anomalous results from the successful column.
**First 2 stages successful.

Integration Tests

One reason I like the Falcon-1 test series so much is illustrated by the reason the third flight test failed. Developed by engineers and scientists who have had plenty of experience developing other missiles, this missile failed (I believe) because they were concentrating so much on economic factors, namely the reuse of the first stage engine. If you want to reuse an engine, you don’t want to go firing pyrotechnics that blow holes in the nozzle to quickly drain the fuel. On the other hand, if you don’t quickly and reliably shut the engine down, the remaining fuel might cause the first stage to continue to produce a little bit of thrust and hence risk bumping into the second stage engine and breaking it as they separate. That is exactly what happened. Could SpaceX have caught this error if it had run more ground checks? If so, were they cut to reduce design costs? I hope you see why I like it so much.

The RS-24 is another interesting case that seems to be devoted to testing an integrated system. Pavel Podvig has made a very convincing argument that the RS-24 is a Topol-M missile with more than one warhead uploaded onto its bus. In that case, perhaps it shouldn’t need very many flight tests to get it up to speed. In fact, one might think that only the post-boost bus needs testing. But perhaps even that doesn’t need much testing since some claim that the Topol-M’s bus was tested for more than one warhead without loading any more on it by simply maneuvering as if it did have the warheads. (Some Russians claim exactly the same thing for some US post-boost buses. The US responds to those charges by claiming that additional maneuvers were needed for range safety reasons. And so it goes.) On the other hand, as the Falcon-1 test series shows, integrating different components does introduce new modes of failure. Were three tests enough? Apparently so, since Russia has said they will now introduce the RS-24 into their arsenal.

Statistical Uncertainty

The US philosophy of testing nuclear weapons is perhaps the hardest to understand; not least because so much is buried in secrecy. One could have imagined that, since the US performed over 1054 nuclear explosion tests (it appears that some tests had more than one explosive device tested at a time) and “developed” a total of 112 nuclear weapons, they could have used these tests to establish a reasonable statistical reliability for each weapon. After all, this corresponds to nine tests per bomb design with a significant number left over for testing one-point-safety, which would be reassuring. Except that the US testing philosophy was never to test to this level.

Instead, our nuclear tests were supposed to develop weapon designers’ expertise; an expertise from which they could judge the reliability of a nuclear design without further testing. This must rely on two assumptions that are probably true most of the time: 1) that the non-nuclear components are tested individually and as a whole enough times to establish a statistical reliability for the non-nuclear functioning of the design and 2) the nuclear process involves so many “particles” that statistical fluctuations cannot have a significant effect on the design’s function.

Some doubt that the later is true for the W-76, a mainstay of the submarine leg of our nuclear triad. Critics have suggested that the possibility that a macroscopic instability exists that violates the second assumption. It is also one of the few warheads for which the US has released information on its testing. It had a total of four tests during its development and apparently two of them had “anomalies.” They could have had anomalously high yields, or anomalously low yields, or anomalies that didn’t affect the yield; the open literature doesn’t say. However, we know that one anomaly resulted in a retest and the other in a change in a component (but no retest). Fortunately, there have probably been enough tests of the W-76 with the few stockpile surveillance tests done in the later years of testing to establish a reasonable statistical reliability, especially when more than one warhead is devoted to each target.

Other Countries
Given these examples of developed countries’ R&D programs, Iraq’s development of the Al Samoud I and II are very reassuring. Not only did they use flight tests to iron out the bugs, they went on to what we would call an extensive operational test and evaluation series. The last 11 Al Samoud II flight tests were for verification of the “firing table,” determining the range under various conditions such as changes to the pitch program etc. (One of these failed, so the operation failure rate of the Al Samoud II was probably around 10%.) Still, I cannot help suspecting that somebody in a powerful position might have made a lot of money for each test flight flown. Hence their large numbers. Still, if other countries followed this sort of a testing program, we would never miss their development of an ICBM.

North Korea, on the other hand, doesn’t seem to need nearly as many flight tests. Apparently only one successful test was needed for DPRK to start selling its Nodong missile abroad. Various analysts have come up with ingenious reasons for this and they could very well be right. But, on the other hand, do we really understand why and how we test complex systems well enough to claim to understand North Korea’s? I am full of doubt.

Note added: Just to be clear, when I say I think there have probably been enough stockpile surveillance tests of the W-76 to give a reasonable statistical confidence to the W-76’s reliability, that was not the intention of the surveillance tests. In fact, this statement is based only on my estimates of the numbers tested that I derived from a correlation analysis and published in Jane’s Intelligence Review in July 2005. As I hope I made clear, the reliability of nuclear weapons is officially based on the judgment of the designers and not on tests. Perhaps not surprisingly, that is probably the case with all the other tests considered here.

Comments

  1. Mark Gubrud

    Coupla points/thoughts/speculations (correct me if I’m wrong):

    I thought the Falcon stage separation bump only occurred once, and the fix was a simple software correction (wait a little longer for 1st stage shutdown before second stage ignition and separation). It was apparently a complete oversight, almost as bad as NASA’s famous cm vs. inches screwup.

    Maybe the reason the al Samouds appear to have been so thoroughly tested is that Iraq was laboring under UNSC orders not to develop missiles over 150 km range, so there was no rush to go larger and farther; also, Iraq took the opportunity to move to solid fuel. We may further speculate that, with larger and longer-range missile projects proscribed, it made sense to keep testing, developing a more thorough understanding of the technology and keeping people busy and trained.

    In the case of nuclear weapons, if by “large number of particles” you are referring to atoms and neutrons, obviously the numbers are so huge that no significant statistical fluctuations are expected. However, the more relevant “particles” are the detonation and initiation components, material and form factors of nuclear components, etc., which are all subject to statistical variations in manufacture even when these are kept within specified tolerance bands. These factors can of course be assessed by inspection and testing of the isolated components without explosion tests.

    This last point generalizes to all kinds of complex technical systems. Given a robust and proven design, with good safety margins built in, one can have confidence that the system will perform if all its components perform as specified, and in this case if the individual components can be separately tested and shown to be highly reliable, so that overall system reliability should be very high, like 95%, then one can have confidence in the system; but if the result of such analyses is not that the system will perform with high reliability, i.e. if the cumulative uncertainties place the overall performance at something like 90% or lower, then it would be very hard to know whether the actual failure rate would be around 10% or much higher, without a substantial number of full-system tests.

    For basic purposes of deterrence, when nuclear weapons and ICBMs are involved, high proven reliability seems unnecessary. Still, I doubt that we would miss any ICBM test by a nation that might want to develop a weapon to deter us, and it does seem that before placing any reliance on their deterrent they would want at least one successful test.

  2. D (History)

    You know how to tell the difference between an RS-24 and a Topol-M, right? The Topol-M has a UID.

    I kid, I kid—of course there’s more to it than that. For example, part of the RS-24 canister is painted red. I mean, that’s a huge difference right there.

  3. David Wright (History)

    Interesting post. I wanted to comment on the statement that missile defense critics are calling for 95% confidence in 95% effectiveness of the system. That criteria actually came from the Pentagon, not from critics. The Operational Requirements Document (ORD) for the missile defense system being developed under Clinton reported stated that the user must be 95% confident that the system will be 95% effective against a limited attack. To do this, the system would be expected to achieve an 85% single shot kill probability and target 4 kill vehicles on each target. (See Michael Dornheim, “Missile Defense Design Juggles Complex Factors,” Aviation Week and Space Technology, 24 February 1997, p. 54.) This was apparently intended for the case of very simple or no countermeasures.

    You are right that demonstrating this through testing is not likely to happen since it requires a very large number of tests. For a series of 20 tests, all 20 would have to be successful to provide 95% confidence that the kill probability was 85% or greater. On the other hand, if there were three failures in a test series, a total of 50 tests (with the other 47 successful) would need to be conducted to provide 95% confidence that the single-shot kill probability was 85% or greater.

    For more information on this, see the testing section (Chapter 10) of Countermeaures, http://www.ucsusa.org/assets/documents/nwgs/cm_all.pdf

  4. Geoff Forden (History)

    Thanks for the correction, David!