Internet Routing Architectures (CISCO):Designing Stable Internets

Network Upgrades

Networks are dynamic. Performance improvement, site consolidation, and support expansion all require changes and adaptations. Changes might include upgrades to new versions of software or hardware, additions of more links, additions of more bandwidth, or reconfiguration of a network's layout.

For obvious reasons, administrators prefer to bring a system down for upgrading during some period when it usually experiences minimal usage. The downtime for some networks cannot exceed more than an hour, even at night, because of time zone differences. Despite these difficulties, the upgrade period itself is not usually the time when errors are most significant because administrators always keep a backup plan and can revert to the old setup if the new setup does not work. In case of configuration or software/hardware problems, network instabilities will take effect the next day when everybody is back online. At that point, reverting to the old setup is not likely to be a viable option. Unfortunately, to rectify the situation, administrators sometimes start adding or changing configuration on-the-fly, making the situation even worse.

To reduce the likelihood of causing disruptions, network changes should be simulated if possible in nonproduction environments. In addition, multiple major changes should not be deployed at the same time. It is, for example, unwise for a provider to perform major router software upgrades, switch hardware, and change cabling, all at the same time. Good planning and network simulation is the key to successful network upgrades.

Human Error

Most of the network instabilities caused by human error occur because an administrator circumvents an administration policy or makes a change without knowledge of possible effects. It is easy to make mistakes in complex network configurations. One wrong filter, and an entire AS can be isolated. Administrators should anticipate problems before they occur.

Here's an example of the kinds of errors that can happen: any router can send the default 0.0.0.0 via BGP to its neighbors. If you are not careful, traffic will take the wrong route. As much as it is somebody else's responsibility to send appropriate default routes, it is your responsibility to protect yourself by making sure that you filter any unwanted routes, default or otherwise, that come your way. The list of possible human errors is long: someone might advertise somebody else's networks; a provider might stop advertising your networks; somebody summarizes the wrong networks. The point is, don't expect everyone else to play by your rules. Other administrators can (usually inadvertently) deploy rules that directly conflict with your rules, which can lead to serious performance and connectivity degradation.

Backup Link Overloads

In some cases, a link failure will cause a backup link to be overloaded with traffic. This occurs because the backup link is handling all the additional traffic that is now being routed its way on top of its normal traffic. Even if the link can handle the load, a router might not be able to handle the additional load—depending on its horsepower. This can cause major performance degradation for the end-user.

In the process of trying to get a handle on network instabilities, BGP implementations have introduced several helpful features. Although these features do not provide a complete solution, they are significant preventative measures of route instability.

BGP Stability Features

Of course, developing effective routing policies and configuring them correctly is at the core of building stability. BGP's attribute selections, as discussed throughout this book, are tools for building that core stability. In addition, BGP functions that can help provide a buffer against route instability effects include:

• Controlling Route and Cache Invalidation

• Route Dampening

Controlling Route and Cache Invalidation

The basis of any BGP conversation is the TCP/IP session that takes place between two neighbors. The neighbor connection itself is based on the OPEN message, which contains parameters such as the BGP version number. In addition, exchanged routing updates carry different attributes such as the metric, communities, and AS_Path. Anytime an administrator changes attributes or policies, BGP implementations require that a BGP TCP session with its neighbor be reset (broken and restarted) for the modified routing behavior to take effect.

Troubleshooting:
Ch. 11, pp. 433-437. Controlling Route and Cache Invalidation

Unfortunately, every time the TCP session is reset, routing is interrupted. When a session is reset, the routing cache gets invalidated, routes disappear, and route instability cascades throughout the Internet. By the time the session is brought back up and routes/cache are re-established, real damage could result.

Cisco Systems introduced a mechanism called soft reconfiguration that enables administrators to reconfigure attributes on-the-fly without killing an already established TCP session. As a result, the routing cache is not cleared, and impact on the route is minimal.

Table of Contents