Recently, one of my friends asked me to share some experience and hints about my work in KiwiICT. So I decided to start writing here to show real cases from the work.
Last week we received a service request with the subject “Fortigate OSPF issue”. Customer had the following infrastructure:
In the diagram presented above, physical links towards standby Fortigate were removed for simplicity.
Cisco switches were not in our scope for this customer and complete topology above was built during investigation. Customer was keeping multiple tenants inside the data center. Traffic between tenants was isolated by multiple VLANs, VRFs with dedicated OSPF peering, plus dedicated VDOM on the Fortigate firewall. Port-channels between Fortigate appliances and Cat3850X switches were configured as 802.1Q trunks with multiple subinterfaces for L3 peering and OSPF adjacencies. The same picture was for the rest of the shown links – they all were configured as trunk interfaces with multiple VLANs/L3 subinterfaces with OSPF peering for each tenant. Through this, two ECMP routes for each network behind VDOM were available in the customer`s VRF on DC Core switches and vice versa.
Problem Description: Customer faced multiple errors on a physical link between Cat3850X-1 (Gi1/0/1) and Fortigate 3000D active member (Port1). Since these errors affected a part of the customer`s traffic, which was load-balanced to this link by the Port-channel and CEF algorithms, the local NOC team decided to disable it. This action did not fix the issue, on the contrary, it got worse. The local NOC team came to agreement that the problem was with stateful inspection on Fortigate and sent a service request to us.
We started our investigation from the first customer`s tenant. When affected interface was disabled, the routing table appeared to have only one route available instead of two ECMP routes as before:
Before disabling the link :
At the same time both OSPF peering were in UP state and interfaces had following link cost:
As can be seen, Fortigate decreased the total cost of the affected port-channel because total bandwidth became 10 Gbps. As a result, this route disappeared from the forwarding table and traffic processing modules.
We asked the NOC team to show how OSPF topology looks like on the network switches.
No changes, both routes had equal OSPF cost. The DC core switch continued to keep ECMP and load-balance traffic towards both Cat3850X. Therefore, all the traffic directed to Cat3850X-1 was blackholed on the Fortigate because the route through this link was no longer in the forwarding table.
This behavior of the Cisco switch was caused by SVI interfaces. By default the SVI had 1Gbps bandwidth and it did not take into account current physical (or aggregated) interface bandwidth. That was not supposed to affect the amount of traffic through the Port-channel interfaces, however, it did affect the OSPF cost calculation.
As a workaround, we suggested making both the port-channel member interfaces monitored by the HA process. In the case of a link failure, the Fortigate cluster is switched to a standby node. That’s it. Simple and clear work.
And how would you deal with such case? Could you manage it more simple and more quick?