- Advertisement -

- Advertisement -

OHIO WEATHER

Chip Errors Are Becoming More Common and Harder to Track Down


Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectable flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkable just a decade ago.

As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the last year.

The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable.

In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software — it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook, now known as Meta, did not return requests for comment on its study.

“They’re seeing these silent errors, essentially coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Dr. Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught.

Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways.

Companies that run large data centers began reporting systematic problems more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto reported that each year as many as 4 percent of Google’s millions of computers had encountered errors that couldn’t be detected and that caused them to shut down unexpectedly.

In a microprocessor that has billions of transistors — or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 — even the smallest error can disrupt systems that now routinely perform billions of calculations each second.

At the beginning of the semiconductor era, engineers worried about the possibility of cosmic rays occasionally flipping a single transistor and changing the outcome of a computation. Now they are worried that the switches themselves are increasingly becoming less reliable. The Facebook researchers even argue that the switches are becoming more prone to wearing out and that the life span of computer memories or processors may be shorter than previously believed.

There is growing evidence that the problem is worsening with each new generation of chips. A report published in 2020 by the chip maker Advanced Micro Devices found that the most advanced computer memory chips at the time were approximately 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.

Tracking down these errors is challenging, said David Ditzel, a veteran hardware engineer who is the chairman and founder of Esperanto Technologies, a maker of a new type of processor designed for artificial intelligence applications in Mountain View, Calif. He said his company’s new chip, which is just reaching the market, had 1,000 processors made from 28 billion transistors.

He likens the chip to an apartment building that would span the surface of the entire United States. Using Mr. Ditzel’s metaphor, Dr. Mitra said that finding new errors was a little like searching for a single running faucet, in one apartment in that building, that malfunctions only when a bedroom light is on and the apartment door is open.

Until now, computer designers have tried to deal with hardware flaws by adding to special circuits in chips that correct errors. The circuits automatically detect and correct bad data. It was once considered an exceedingly rare problem. But several years ago, Google production teams began to report errors that were maddeningly difficult to diagnose. Calculation errors would happen intermittently and were difficult to reproduce, according to their report.

A team of researchers attempted to track down the problem, and last year they published their findings. They concluded that the company’s vast data centers, composed of computer systems based upon millions of processor “cores,” were experiencing new errors that were probably a combination of a couple of factors: smaller transistors that were nearing physical limits and inadequate testing.

In their paper “Cores That…



Read More: Chip Errors Are Becoming More Common and Harder to Track Down

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy

Get more stuff like this
in your inbox

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for subscribing.

Something went wrong.