Starting with some general information about BGP. It's a path vector protocol and it uses several different "metrics" to decide which path to use. It is a massively scalable protocol and is the core of the internet. It's also widely used in a lot of big company networks, particularly datacenter usage is something I'm more interested in. Also, it's the only EGP used today. There is a great base of information at the RFC page.
There are two versions: BGP and MP-BGP. Multiprotocol BGP is, as the name states, BGP with support of multiple protocols. IPv6, address families and unicast/multicast. This is another post.
BGP comes in 3 different site configurations:* Single home: Peering one router with one ISP * Dual home: Peering with two connections to the same ISP * Multi home: Peering with multiple ISPs at the same time
There are devices called BGP route servers, which are accessible to anyone by terminal. They are BGP routers from different vendors where people have access to use various show commands in regards to BGP. One of their purposes is for people to see whether their BGP routing policies are being updated on the internet.
Network Layer Reachability Information & Next-Hop
BGP is described as not being a routing protocol, but an application exchanging NLRI. The distinction is, BGP knows what routes are advertised from other peers, but it can't independently get there. BGP doesn't know the path to the destination either, it only knows next-hop and it trusts the information received from the peers. The router will still choose the best next-hop, if there are more than one, and it will do so using attributes.
The NLRI packet contains: Prefixes, attributes and next-hop.
Attributes, which are sort of metrics, decide which next-hop to use for a network. The attributes are values such as MED, Local preference, Weight, which affect next-hop decision only between peering neighbors. It also carries AS_PATH, which contains every BGP AS that has to be traversed to reach the prefix. Complete coverage of attributes is another post.
The next-hop, is the peer configured in the BGP process, either static or dynamic. The condition to establish neighbors is simply a requirement of reachability to the next-hop IP. This just means that the IP configured in the neighbor statement, needs to be in the routing table. This can be an IGP, static routing or directly connected interfaces.
The topology and configuration below should help illustrate.
Starting with peering on the physical interface
With this I haven't configured any routes, but because I'm peering on the physical interfaces, the routing information is there by directly connected. The peers and routing information: !(/content/images/2016/07/R2-bgp-peer.PNG) !(/content/images/2016/07/R2-R3-ip-route.png)There are some things to note here. The peering works like this, but if I use loopback interfaces for neighbors, I'll need to have an IGP or static routes.
I put R3 in a different AS than R1 and R2, otherwise it wouldn't receive the 192.168.1.0/24 advertisement. This is because of BGP split horizon, which is a rule applied to iBGP that states no peer can further advertise an iBGP prefix that is has received, to anything else than an eBGP peer. Basically, there has to be full mesh iBGP peering, but that's a side track to this.
As shown in the routing tables of R2 and R3, the 192.168.1.0/24 network from R1 is being properly advertised. If I ping from R2 I can reach it, but from R3 I can't. It's very simple why, because R3 sends traffic to R1, sourcing from 188.8.131.52, which doesn't exist in R1's routing table. To fix this, I can add a static route on R1, which points towards the interface R3 uses as source.
This illustrates that BGP can advertise networks that can't necessarily be used to communicate with. However, more often than not, there will not be any ping connectivity between BGP routers and endpoint networks. The interfaces used in the BGP process for neighbors and next-hop, can be a completely different scope and they just have to provide passage for the prefixes.
We just need next-hop reachability in the routing tables and BGP can advertise networks.
AS numbers play a much greater role in BGP than the other protocols. Because BGP is used by the ISPs, most of these AS numbers are public, meaning the AS number belongs to an organization and they can't be used by anyone else. It also means you can't just put up BGP on a router and start peering with the service providers. You have to request a public AS number from the regional address organization. To get an AS number you need to state why your routing policy is different from that of the ISP or have a multihomed site configuration.
The numbering for 16bit is 1 - 64495 for public AS and the rest are used for experimental, documentation and private purposes. The AS numbers aren't scarce, anymore. Originally it was a 16bit value, but that is insufficient. A 32bit value was developed and supported from IOS version 12.4.24T. The 32bit notation is 16bit.16bit instead of a 32bit value. There are 2 reasons for not using the 32bit notation: It's easier to read in the output of various BGP information and it's backwards compatible.
A notation example would be 65000.65000 or 0.65000. Now the last one is easy to see why it's backwards compatible, since the number can just be sent in the old 16bit format. But how does an old BGP speaker understand the 32bit format? The short answer is 32bit BGP routers will hide the 4byte AS numbers with a substitute TRANS_AS 23456 and uses new attributes AS4-PATH and AS4-AGGREGATOR to retain the 32bit information/capability across non-supportive speakers. It's defined in RFC 4893. I want to do a separate post and get into proper detail about the 32bit AS, because it's a lenghty subject that involves a lot of parts from the BGP process.
On the configuration surface, it can be as little as one command needed to establish. However, depending on how/what the neighbors peer through, there can be a lot of extra configuration needed. There's also a lot of theory to the neighbor relationship process.
Unlike EIGRP and OSPF, which have their own neighbor process build into the protocols, BGP uses TCP for reliable transmission. I'm going by the standard TCP and I'm not taking into account any vendor specifics or the likes. TCP is a layer 4 transmission protocol, that uses SYN and ACK messages to establish a connection that is confirmed by both ends. The way it does this with BGP is by 3-way handshake. TCP handshake works by one speaker sending a SYN message to the configured neighbor on port 179. The speaker sending the first SYN will be the TCP client and will source the request from a random port (within a range). The speaker receiving the request is the TCP server. If the connection is there, the server will respond with an ACK and send a SYN request. The client receives the SYN and sends the last ACK, 3 steps. Now the transmission window is open and will stay open for as long as BGP keepalive messages are transmitted. I know it says SYN>SYN,ACK>ACK on most documentations, but it doesn't matter in which order it's written, something to do with lexicography.
The fact that TCP is used and a random port on one end has an implication when using ACL. There is actually BGP specific ACL syntax and it is something to keep in mind when configuring the routers.
With TCP out of the way, there is still a somewhat big part left for the neighbor process. Other than the TCP part, there are 6 stages to BGP neighbor establishment, the first 3 of which are where TCP operate. These states work in a flow diagram, which is further down the page.
This process is the BGP finite machine. It includes other parts that help determine how the stages progress. The "Start Event" is a term specifying that the BGP speaker is initialized and router resources have been allocated. ConnectRetry Timer is a value that is local matter and have to just be big enough for the TCP transmission to establish. This timer is part of the first 3 stages. BGP Identifier is the BGP router-id.
- Idle: The speakers are sending/listening for the TCP connection. The speakers transition to the Connect state when they are configured with neighbors/ready to initiate a connection. A ConnectRetry timer will initiate. A speaker can go back to Idle for many reasons. Configuration change, BGP process resetting and so on. The RFC is a bit political in the definition stating "In response to any other event (initiated by either system or operator), the local system releases all BGP resources associated with this connection and changes its state to Idle", which is referring transition from the other stages to idle. RFC 1771
- Connect: The speakers are trying to do the TCP handshake. If the TCP times out, the speaker will go into the Active state and the ConnectRetry timer is reset. If the connection succeeds, an Open message will be sent and the speaker transitions to OpenSent.
- Active: In this stage the TCP connection is still trying to be established and will try for as long as the ConnectRetry timer is. Now there are 3 conditions in the Active stage. If the ConnectRetry times out, the speaker will return to the Connect stage. If the speaker receives a transmission from another speaker, but the IP is not matched to the configuration, then the speaker will reset the ConnectRetry and continue in Active. If the connection succeeds, an Open message will be sent and the speaker transitions to OpenSent.
- OpenSent: TCP window is established and an Open message has been sent and the speaker is waiting for an Open message from its peer. The Open message contains the following information that has to be accepted by the speaker: Version number, AS, Hold Time, BGP Identifier and optional parameters (authentication). If there are any errors in these values, a Notification message is sent and there are different error messages depending on which parameter is incorrect and the speaker returns to Idle stage. If these values are accepted, then the speaker sends a Keepalive message, which contains a KeepAlive Timer/Hold Timer. The hold timer is negotiated between the speakers and should land on the lowest timer proposed. The default KeepAlive is 60 seconds and Hold Timer is 3 times that (180 seconds). Now the speaker is in OpenConfirm.
- OpenConfirm: The KeepAlive has been sent and the speaker is just waiting for its peer to send a KeepAlive. Once it receives this KeepAlive, it will be Established. If the Hold Timer expires, the speaker goes into Idle and sends a Notification message.
- Established: We're good to go. BGP now exchanges KeepAlive, Updates and Notifications. Both KeepAlive and Updates reset the Hold Timer. The Hold Timer can be negotiated to zero, which means it can't timeout, but they can still lose their neighbor relationship by other error handling methods.
There are some unmentioned conditions to the finite machine flow, but their significance would be in regards to troubleshooting. How the speakers closes or fails their connection may be significant to what is wrong, but not to the understanding on this flow.
In regards to the TCP connection step it's hard to imagine that these things wouldn't happen so fast that they can overlap. BGP has something called Connection Collision Detection, which is exactly what handles this. It will allow parallel TCP connections (one from both side) and will have both TCP connections all the way into the OpenConfirm stage. In this stage it will decide which connection gets to remain. If BGP closes a parallel connection, it'll send a Notifcation with error code "Cease".
Now the BGP Finite Flow. Bear with me, I am not a grand flow diagram creator.
I hope I haven't forgotten anything, but this is also just the introduction to BGP.