Ethernet switch

SONiC:

xiru2ly3rd88
10 min readApr 26, 2022

open source network operating system based on Linux that runs on switches from multiple vendors and ASICs

SAI (Switch Abstraction Interface)

SAI accept by the Open Compute Project (OCP) as a standardized C API to program ASICs.

ref:

allows network hardware vendors to develop innovative hardware architectures to achieve great speeds while keeping the programming interface consistent.

ref:

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.opencompute.org/documents/switch-abstraction-interface-ocp-specification-v0-2-pdf

FASTPATH

Broadcom的software, 用來開發ethernet product,
ex: stacking, switching, routing, multicast, QoS

ref:

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://docs.broadcom.com/doc/FASTPATH-Networking-Software-Release-8-7-PB

Basic knowledge:

  • MAC Address Type:
  • Unicast

The least significant bit of the first octet of an address is set to 0.

E.g. 0001.4455.6677

  • Multicast

The least significant bit of the first octet is set to 1.

e.g. 0100.CCCC.DDDD

  • Broadcast

In hexadecimal the broadcast address would be FF:FF:FF:FF:FF:FF.

Virtual LAN(VLAN)

separate broadcast domain

•VLAN ID

A digit number ranging from 1 to 4094.

[Ingress] VLAN Member

The physical interface members of the VLAN.

[Ingress] PVID (Port VLAN ID)

The VLAN ID (native VID) assigned to the un-tag frames coming to the port.

PVID can be changed.

[Egress] VLAN Tag Member

When a frame is sent though this port, device inserts a VLAN tag between source MAC and Ethernet type/length.

[Egress] VLAN Un-tag Member

When a frame is sent though this port, device will not add the VLAN tag.

  • VLAN Tag:

Base on IEEE 802.1Q, a VLAN tag is a 4 bytes field adding between source MAC address and Ethernet Type/Size fields of the original frame.

TPID = Tag Protocol Identifier (0x8100 is used for 802.1Q tagged frame)

PCP = Priority Code Point (from 0(lowest) to 7 (highest))

CFI = Canonical Format Indicator (always set to zero for Ethernet)

VID = VLAN Identifier (0x000 and 0xFFF are reserved)

IPv4

Classless Inter-Domain Routing (CIDR) is a method for allocating IP addresses and routing IP packets.

A CIDR Block: 10.10.1.32/27

IP Range -> 10.10.1.32~10.10.1.63

Host useable IP address range is 10.10.1.33~10.10.1.62

The all-ones host address of each subnet is that subnet’s broadcast address.

The all-zero host address of each subnet is subnet’s Network ID.

Broadcast Address:

172.16.0.0/12, which has the subnet mask 255.240.0.0,

the broadcast address is 172.16.0.0 | 0.15.255.255 = 172.31.255.255

Multicast Address:

IPv4: 224.0.0.0/4

IPv6: ff00::/8

IPv6

128 bits 解決IPv4 數量不夠問題

常用 64 bits prefix + 64 bits interface ID

EUI-64: 用MAC產生interface ID

MAC 48 bits 插入 FFFE 16 bits -> 7th bit 由0改成1

ex:

MAC: 00:50:56:C1:A0:E8

-> 00:50:56:FF:FE:C1:A0:E8 -> 02:50:56:C1:A0:E8

Open System Interconnection 7 layer

Unicast Packet Forwarding

In IP unicast packet forwarding, the host table is searched first. If no matching destination, switch performs Longest Prefix Match (LPM) in unicast routing table to select the destination. If there is no matched prefix found, the specific IP packet is dropped.

L2

FDB Table

  • MAC learning is the process of obtaining the MAC addresses of all the nodes on a network.

The Forwarding DataBase (FDB) table is used by a Layer 2 device to store the MAC addresses and VLAN ID that have been learned and which ports that MAC address was learned on.

The switch uses the forwarding database to forward packets to the appropriate bridge in the bridge group.

STP(spanning tree protocol):

防止broadcast storm, 那些port要傳資料, 那些port不要傳

Flow:

  1. election root switch: switch priority 小的 -> MAC 小的

2.non-root switch 選出 root port: cost 小的(cost跟bandwidth有關, 可以config)

3.所有網段選一個designated port: switch priority 小的 -> MAC 小的

4.non-root/designated port -> non-designated port -> blocking

Election rule:

Bridge protocol data unit(BPDU): include bridge ID, cost, port ID

Port state

  • Disable: port is shutdown
  • Blocking:

1. not tx/rx data

2. rx BPDU , need to know STP state

3. when switch start up, all port is blocking

  • Listening:

1. not tx/rx data

2. tx/rx BPDU

3. particulate root switch/root port/designated port election

  • Forwarding:

1. tx/rx data

2. tx/rx BPDU

Topology change: add/remove switch

When switch discover topology change -> tx topology change notify(TCN) BPDUs to root switch

  • Topology change:

1. listen/blocking/disable -> forwarding
2. listen/forwarding -> blocking/disable

  • PortFast:

insert PC to topology or PC turn on/off 不會改變STP topology

the port connect to PC need to passing: blocking -> listening -> learning -> forwarding

把這個port設為port fast可以跳過listening, learning

Rapid Spanning Tree protocol(RSTP)

  1. 解決STP converge too slow(listening, learning delay 15s, total 30s)

switch 不是被動等BPDU, 而是主動跟相鄰的switch溝通

2. 向下支援STP

  • Port role: 增加alternate, backup port

1.Root port

2.Designated port

3.Alternate port:

non root/designated port rx 較佳BPDU from DP (other switch) -> become alternate port, 作為 backup path to root switch

4. Backup port

non root/designated port rx 較佳BPDU from DP (自己) -> become backup port,作為 backup path to root switch

  • Port state:
  • Link type: RSTP新增

Edge port:
1. 傳承portfast, discarding -> forwarding
2. 如果不會loop, 不產生TCN

Point to point non-edge port: full duplex, communicate with RSTP

Shared non-edge port: half duplex, communicate with STP

  • Sync. process: 新加入的switch 會發proposal election root switch

case 1: new switch not become root switch

case 2: new switch become root switch

  • Topology change

ref:

用在edge switch(switch c, d), 建立tunnel讓STP/CDP/VTP packet不會受core network影響, switch A,B,E就像在同一個LAN.

Encapsulate: STP 進到 core network(switch A -> switch C), 換掉MAC

Decapsulate: STP 離開 core network(switch C -> switch B), 還原MAC

Port Channel

2個以上的physical ethernet port -> 1 logical port

優點:

1. 提高bandwidth
2. redundancy: 1個掛了, 還有一個可以用

port的設定要相同: speed, duplex, STP, Vlan…

Mode:

static: port 先 shut down(avoid loop) -> set static mode on -> port no shutdown

dynamic:

active
passive

Link Aggregation Control Protocol(LACP):

最少一邊設為active
1. Hot standby: active port down -> hot standby補上
2. max 16 port in one channel (8 active + 8 hot standby)
active , hot standby election: port priority > port ID

  • Loading balance:

IP Multicast

reduce bandwidth

https://www.cisco.com/c/en/us/td/docs/ios/solutions_docs/ip_multicast/White_papers/mcst_ovr.html

Traditional TV vs IPTV:

  • Traditional TV:

Antenna connect mode:

  1. direct
  2. In-direct

If user need to obtain voice or internet services they need separate subscriptions for the same from telco or ISP respectively.

  • IPTV:

IPTV offers video, data and audio in one signal connection(telephone company, internet service provider).

  • Tradition TV vs IPTV:
Traditional TV vs IPTV

Multicast VLAN:

應用: host在不同vlan要同樣的multicast stream

好處: reduce the bandwidth of the mcast stream src

Flow:

  1. create multicast vlan, this become the only vlan over the mcast traffic
  2. enable IGMP snooping
  3. switch forward the mcast traffic form the src intf to the host connect to receive intf that not the mcast vlan member; the host remain his own vlan for bandwidth and security

ex: host A: vlan 10, host B: vlan 20 都要mcast stream
1. 把vlan 10, 20 加進 mcast vlan 1100

2. 把host A/B接的port 設成receive port, 接mcast stream的port設成src port

ref:

IGMP:

as the transport for several related multicast protocols(ex: DVMRP, PIM)

IGMP is an integral part of IP and must be enabled on all routing devices and hosts that need to receive IP multicast traffic.

L3

Address Resolution Protocol(ARP):用IP找MAC

  • Proxy ARP:

Flow:

rx ARP -> 查詢ARP table -> Find, reply ARP

Routing

Static: route: user config

Dynamic: routing protocol:

2台router config OSPF, EIGRP…, 交換routing information

  • Next hop:

ex: pkt DIP 192.168.2.154 -> next hop 192.168.12.2 -> second lookup -> 192.168.12.0/24 , 從 Ethernet0/0出去

  • Route election: Administrative distance(AD) > metric

相同network只會有一條

  • Subnet route & longest match

Ex: pkt DIP 192.168.1.33 -> 192.168.1.32/29 -> next hop 192.168.12.12

Static route setting next hop

  • IP

ex: ip route 192.168.101.0 255.255.255.0 192.168.12.2

show ip route

C 192.168.12.0/24, e0/0

S 192.168.101.0 [1/0] via 192.168.12.2

pkt DIP 192.168.101.1 -> next hot 192.168.12.2 -> second lookup -> e0/0

-> e0/0 發ARP (問192.168.12.2的MAC) -> R2 reply -> R1 紀錄在ARP table

  • Interface

ex: ip route 192.168.101.0 255.255.255.0 Ethernet0/0

C 192.168.12.0/24, e0/0

S 192.168.101.0 [1/0] via e0/0

pkt DIP 192.168.101.1 -> next hot e0/0 -> e0/0發ARP (問192.168.101.1的MAC)

-> if R2 E0/0 Proxy ARP enable & routing table 有紀錄192.168.10.1, reply ARP

-> R1紀錄在ARP table,

會記錄多筆DIP
ex: pkt DIP 192.168.101.1
pkt DIP 192.168.101.2

Interface + IP

ex: ip route 192.168.101.0 255.255.255.0 Ethernet0/0 192.168.12.2

避免ARP 紀錄多個DIP

Policy based routing

Flow:

  1. set access-list

2. set route-map

3. put route-map on intf

Administrative Distance(AD):

Router能透過超過1個routing protocol到達dst, 用AD來判斷要選哪一個routing protocol

Routing Table Object

The unicast routing table consists of routing entries managed by Routing Table Object (RTO), also known as Forwarding Information Base (FIB), which is used for IP packet routing.

The RTO will record following information:

1. route prefix / mask length

2. next-hop

3. cost

4. weight

5. route-type

RTO will select the best routing path for each routing prefix and update it into the LPM table.

Routing Table Learning

The routing table can learn routing entry from static configuration, dynamic routing protocol learning and assign local host IP interface address.

Routing Flow

1. Lookup LPM table by DIP address to get next-hop’s IP.

2. Lookup ARP table by next-hop’s IP to find MAC address.

3. Lookup FDB table by MAC to get Egress port and VLAN.

4. Lookup VLAN table by (Egress port+VLAN) to decide VLAN tag.

5. Change SMAC to outgoing IP interface’s MAC.

6. Change DMAC to next-hop’s MAC.

7. Change VID base on FDB and VLAN information.

8. Send the packet to egress port base on FDB information.

Packet Routing Example:

  • Environment:

Routing Table in DUT-A:

10.1.1.0/24, local route

20.2.2.0/24, local route

30.3.3.0/24, NH=20.2.2.2

ARP Table in DUT-A: 20.2.2.2 , 00–00–00–00–00–04

FDB Table in DUT-A: 00–00–00–00–00–01,vlan10,eth1/0/1

00–00–00–00–00–02,vlan10,L3 lookup

00–00–00–00–00–03,vlan20,L3 lookup

00–00–00–00–00–04,vlan20,eth1/0/2

VLAN Table in DUT-A:

VLAN10, untag member: eth1/0/1

VLAN20, tag member: eth1/0/2

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Host X send an IP packet to Host Y,

SMAC:00–00–00–00–00–01,DMAC: 00–00–00–00–00–00–02 ,Untag

SIP: 10.1.1.99 , DIP: 30.3.3.99

Protocol Independent Multicast(PIM):

Use reverse path forwarding(RPF) checking, 在routing table查src -> 建shortest path tree(SPT)

PIM-DM:

traffic flooding, 不要的再自己prune

每隔3 min 重新flooding traffic

Assert:

同網段, 同mcast grp, src 只有assert intf可以forward traffic

election: administrative distance > metric > interface ID

PIM-SM

traffic 不會flooding, 要的人自己join

多出Rendezvous Point(RP)

Flow:

1. R1 rx mcast 224.1.1.1 -> register to RP

2. PC1 join 224.1.1.1 on R2

mcast 224.1.1.1 path: R5 -> R1 -> R3 -> R4 -> R2

SPT 會將path優化: R5 -> R1 -> R2

  • Designated Router: 只有Designated Router 可以跟RP溝通

Bootstrap Router Protocol:

BSR: 類似mapping agent(把RP資料傳給大家)

Candidate RP:

只有一個RP, RP掛掉traffic 就不會forward,

設定多個candidate RP, active RP 掛掉時, backup RP起來做事

RP election: priority > hash value

load balance RP 數量 = 2^(32-hash mask)

ex: 2台 -> hash mask = 31

Bi-dercition PIM:

同時是mcast src, mcast dst.

以RP為起點, 在src, dst建立SPT

ex: 會議系統, 說話(src), 接收影音(dst)

Multicast Boundary:

Restrict mcast traffic not forward to another domain

在port上用access list

Routing Information Protocol(RIP):

屬於Distance vector的routing protocol,

  • Passive Interface:

R4不跑RIP -> R3的e0/1 就可以設為passive interface

Auth:

MD5
Text

Metric: 只用hop count計算, 最大15, 16以上unreachable

待續…

ref: 待整理

BCM ASIC

Hash table

--

--

xiru2ly3rd88
xiru2ly3rd88

Written by xiru2ly3rd88

0 Followers

學習筆記不保證100%正確, 只是用來快速複習; 聯絡信箱: xiru2ly3rd88@gmail.com

No responses yet