Ethernet switch
SONiC:
open source network operating system based on Linux that runs on switches from multiple vendors and ASICs
SAI (Switch Abstraction Interface)
SAI accept by the Open Compute Project (OCP) as a standardized C API to program ASICs.
ref:
allows network hardware vendors to develop innovative hardware architectures to achieve great speeds while keeping the programming interface consistent.
ref:
chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.opencompute.org/documents/switch-abstraction-interface-ocp-specification-v0-2-pdf
FASTPATH
Broadcom的software, 用來開發ethernet product,
ex: stacking, switching, routing, multicast, QoS
ref:
chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://docs.broadcom.com/doc/FASTPATH-Networking-Software-Release-8-7-PB
Basic knowledge:
- MAC Address Type:
- Unicast
The least significant bit of the first octet of an address is set to 0.
E.g. 0001.4455.6677
- Multicast
The least significant bit of the first octet is set to 1.
e.g. 0100.CCCC.DDDD
- Broadcast
In hexadecimal the broadcast address would be FF:FF:FF:FF:FF:FF.
Virtual LAN(VLAN)
separate broadcast domain
•VLAN ID
A digit number ranging from 1 to 4094.
[Ingress] VLAN Member
The physical interface members of the VLAN.
[Ingress] PVID (Port VLAN ID)
The VLAN ID (native VID) assigned to the un-tag frames coming to the port.
PVID can be changed.
[Egress] VLAN Tag Member
When a frame is sent though this port, device inserts a VLAN tag between source MAC and Ethernet type/length.
[Egress] VLAN Un-tag Member
When a frame is sent though this port, device will not add the VLAN tag.
- VLAN Tag:
Base on IEEE 802.1Q, a VLAN tag is a 4 bytes field adding between source MAC address and Ethernet Type/Size fields of the original frame.
TPID = Tag Protocol Identifier (0x8100 is used for 802.1Q tagged frame)
PCP = Priority Code Point (from 0(lowest) to 7 (highest))
CFI = Canonical Format Indicator (always set to zero for Ethernet)
VID = VLAN Identifier (0x000 and 0xFFF are reserved)
IPv4
Classless Inter-Domain Routing (CIDR) is a method for allocating IP addresses and routing IP packets.
A CIDR Block: 10.10.1.32/27
IP Range -> 10.10.1.32~10.10.1.63
Host useable IP address range is 10.10.1.33~10.10.1.62
The all-ones host address of each subnet is that subnet’s broadcast address.
The all-zero host address of each subnet is subnet’s Network ID.
Broadcast Address:
172.16.0.0/12, which has the subnet mask 255.240.0.0,
the broadcast address is 172.16.0.0 | 0.15.255.255 = 172.31.255.255
Multicast Address:
IPv4: 224.0.0.0/4
IPv6: ff00::/8
IPv6
128 bits 解決IPv4 數量不夠問題
常用 64 bits prefix + 64 bits interface ID
EUI-64: 用MAC產生interface ID
MAC 48 bits 插入 FFFE 16 bits -> 7th bit 由0改成1
ex:
MAC: 00:50:56:C1:A0:E8
-> 00:50:56:FF:FE:C1:A0:E8 -> 02:50:56:C1:A0:E8
Open System Interconnection 7 layer
Unicast Packet Forwarding
In IP unicast packet forwarding, the host table is searched first. If no matching destination, switch performs Longest Prefix Match (LPM) in unicast routing table to select the destination. If there is no matched prefix found, the specific IP packet is dropped.
L2
FDB Table
- MAC learning is the process of obtaining the MAC addresses of all the nodes on a network.
The Forwarding DataBase (FDB) table is used by a Layer 2 device to store the MAC addresses and VLAN ID that have been learned and which ports that MAC address was learned on.
The switch uses the forwarding database to forward packets to the appropriate bridge in the bridge group.
STP(spanning tree protocol):
防止broadcast storm, 那些port要傳資料, 那些port不要傳
Flow:
- election root switch: switch priority 小的 -> MAC 小的
2.non-root switch 選出 root port: cost 小的(cost跟bandwidth有關, 可以config)
3.所有網段選一個designated port: switch priority 小的 -> MAC 小的
4.non-root/designated port -> non-designated port -> blocking
Election rule:
Bridge protocol data unit(BPDU): include bridge ID, cost, port ID
Port state
- Disable: port is shutdown
- Blocking:
1. not tx/rx data
2. rx BPDU , need to know STP state
3. when switch start up, all port is blocking
- Listening:
1. not tx/rx data
2. tx/rx BPDU
3. particulate root switch/root port/designated port election
- Forwarding:
1. tx/rx data
2. tx/rx BPDU
Topology change: add/remove switch
When switch discover topology change -> tx topology change notify(TCN) BPDUs to root switch
- Topology change:
1. listen/blocking/disable -> forwarding
2. listen/forwarding -> blocking/disable
- PortFast:
insert PC to topology or PC turn on/off 不會改變STP topology
the port connect to PC need to passing: blocking -> listening -> learning -> forwarding
把這個port設為port fast可以跳過listening, learning
Rapid Spanning Tree protocol(RSTP)
- 解決STP converge too slow(listening, learning delay 15s, total 30s)
switch 不是被動等BPDU, 而是主動跟相鄰的switch溝通
2. 向下支援STP
- Port role: 增加alternate, backup port
1.Root port
2.Designated port
3.Alternate port:
non root/designated port rx 較佳BPDU from DP (other switch) -> become alternate port, 作為 backup path to root switch
4. Backup port
non root/designated port rx 較佳BPDU from DP (自己) -> become backup port,作為 backup path to root switch
- Port state:
- Link type: RSTP新增
Edge port:
1. 傳承portfast, discarding -> forwarding
2. 如果不會loop, 不產生TCN
Point to point non-edge port: full duplex, communicate with RSTP
Shared non-edge port: half duplex, communicate with STP
- Sync. process: 新加入的switch 會發proposal election root switch
case 1: new switch not become root switch
case 2: new switch become root switch
- Topology change
ref:
用在edge switch(switch c, d), 建立tunnel讓STP/CDP/VTP packet不會受core network影響, switch A,B,E就像在同一個LAN.
Encapsulate: STP 進到 core network(switch A -> switch C), 換掉MAC
Decapsulate: STP 離開 core network(switch C -> switch B), 還原MAC
Port Channel
2個以上的physical ethernet port -> 1 logical port
優點:
1. 提高bandwidth
2. redundancy: 1個掛了, 還有一個可以用
port的設定要相同: speed, duplex, STP, Vlan…
Mode:
static: port 先 shut down(avoid loop) -> set static mode on -> port no shutdown
dynamic:
active
passive
Link Aggregation Control Protocol(LACP):
最少一邊設為active
1. Hot standby: active port down -> hot standby補上
2. max 16 port in one channel (8 active + 8 hot standby)
active , hot standby election: port priority > port ID
- Loading balance:
IP Multicast
reduce bandwidth
https://www.cisco.com/c/en/us/td/docs/ios/solutions_docs/ip_multicast/White_papers/mcst_ovr.html
Traditional TV vs IPTV:
- Traditional TV:
Antenna connect mode:
- direct
- In-direct
If user need to obtain voice or internet services they need separate subscriptions for the same from telco or ISP respectively.
- IPTV:
IPTV offers video, data and audio in one signal connection(telephone company, internet service provider).
- Tradition TV vs IPTV:
Multicast VLAN:
應用: host在不同vlan要同樣的multicast stream
好處: reduce the bandwidth of the mcast stream src
Flow:
- create multicast vlan, this become the only vlan over the mcast traffic
- enable IGMP snooping
- switch forward the mcast traffic form the src intf to the host connect to receive intf that not the mcast vlan member; the host remain his own vlan for bandwidth and security
ex: host A: vlan 10, host B: vlan 20 都要mcast stream
1. 把vlan 10, 20 加進 mcast vlan 1100
2. 把host A/B接的port 設成receive port, 接mcast stream的port設成src port
ref:
IGMP:
as the transport for several related multicast protocols(ex: DVMRP, PIM)
IGMP is an integral part of IP and must be enabled on all routing devices and hosts that need to receive IP multicast traffic.
L3
Address Resolution Protocol(ARP):用IP找MAC
- Proxy ARP:
Flow:
rx ARP -> 查詢ARP table -> Find, reply ARP
Routing
Static: route: user config
Dynamic: routing protocol:
2台router config OSPF, EIGRP…, 交換routing information
- Next hop:
ex: pkt DIP 192.168.2.154 -> next hop 192.168.12.2 -> second lookup -> 192.168.12.0/24 , 從 Ethernet0/0出去
- Route election: Administrative distance(AD) > metric
相同network只會有一條
- Subnet route & longest match
Ex: pkt DIP 192.168.1.33 -> 192.168.1.32/29 -> next hop 192.168.12.12
Static route setting next hop
- IP
ex: ip route 192.168.101.0 255.255.255.0 192.168.12.2
show ip route
C 192.168.12.0/24, e0/0
S 192.168.101.0 [1/0] via 192.168.12.2
pkt DIP 192.168.101.1 -> next hot 192.168.12.2 -> second lookup -> e0/0
-> e0/0 發ARP (問192.168.12.2的MAC) -> R2 reply -> R1 紀錄在ARP table
- Interface
ex: ip route 192.168.101.0 255.255.255.0 Ethernet0/0
C 192.168.12.0/24, e0/0
S 192.168.101.0 [1/0] via e0/0
pkt DIP 192.168.101.1 -> next hot e0/0 -> e0/0發ARP (問192.168.101.1的MAC)
-> if R2 E0/0 Proxy ARP enable & routing table 有紀錄192.168.10.1, reply ARP
-> R1紀錄在ARP table,
會記錄多筆DIP
ex: pkt DIP 192.168.101.1
pkt DIP 192.168.101.2
Interface + IP
ex: ip route 192.168.101.0 255.255.255.0 Ethernet0/0 192.168.12.2
避免ARP 紀錄多個DIP
Policy based routing
Flow:
- set access-list
2. set route-map
3. put route-map on intf
Administrative Distance(AD):
Router能透過超過1個routing protocol到達dst, 用AD來判斷要選哪一個routing protocol
Routing Table Object
The unicast routing table consists of routing entries managed by Routing Table Object (RTO), also known as Forwarding Information Base (FIB), which is used for IP packet routing.
The RTO will record following information:
1. route prefix / mask length
2. next-hop
3. cost
4. weight
5. route-type
RTO will select the best routing path for each routing prefix and update it into the LPM table.
Routing Table Learning
The routing table can learn routing entry from static configuration, dynamic routing protocol learning and assign local host IP interface address.
Routing Flow
1. Lookup LPM table by DIP address to get next-hop’s IP.
2. Lookup ARP table by next-hop’s IP to find MAC address.
3. Lookup FDB table by MAC to get Egress port and VLAN.
4. Lookup VLAN table by (Egress port+VLAN) to decide VLAN tag.
5. Change SMAC to outgoing IP interface’s MAC.
6. Change DMAC to next-hop’s MAC.
7. Change VID base on FDB and VLAN information.
8. Send the packet to egress port base on FDB information.
Packet Routing Example:
- Environment:
Routing Table in DUT-A:
10.1.1.0/24, local route
20.2.2.0/24, local route
30.3.3.0/24, NH=20.2.2.2
ARP Table in DUT-A: 20.2.2.2 , 00–00–00–00–00–04
FDB Table in DUT-A: 00–00–00–00–00–01,vlan10,eth1/0/1
00–00–00–00–00–02,vlan10,L3 lookup
00–00–00–00–00–03,vlan20,L3 lookup
00–00–00–00–00–04,vlan20,eth1/0/2
VLAN Table in DUT-A:
VLAN10, untag member: eth1/0/1
VLAN20, tag member: eth1/0/2
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Host X send an IP packet to Host Y,
SMAC:00–00–00–00–00–01,DMAC: 00–00–00–00–00–00–02 ,Untag
SIP: 10.1.1.99 , DIP: 30.3.3.99
Protocol Independent Multicast(PIM):
Use reverse path forwarding(RPF) checking, 在routing table查src -> 建shortest path tree(SPT)
PIM-DM:
traffic flooding, 不要的再自己prune
每隔3 min 重新flooding traffic
Assert:
同網段, 同mcast grp, src 只有assert intf可以forward traffic
election: administrative distance > metric > interface ID
PIM-SM
traffic 不會flooding, 要的人自己join
多出Rendezvous Point(RP)
Flow:
1. R1 rx mcast 224.1.1.1 -> register to RP
2. PC1 join 224.1.1.1 on R2
mcast 224.1.1.1 path: R5 -> R1 -> R3 -> R4 -> R2
SPT 會將path優化: R5 -> R1 -> R2
- Designated Router: 只有Designated Router 可以跟RP溝通
Bootstrap Router Protocol:
BSR: 類似mapping agent(把RP資料傳給大家)
Candidate RP:
只有一個RP, RP掛掉traffic 就不會forward,
設定多個candidate RP, active RP 掛掉時, backup RP起來做事
RP election: priority > hash value
load balance RP 數量 = 2^(32-hash mask)
ex: 2台 -> hash mask = 31
Bi-dercition PIM:
同時是mcast src, mcast dst.
以RP為起點, 在src, dst建立SPT
ex: 會議系統, 說話(src), 接收影音(dst)
Multicast Boundary:
Restrict mcast traffic not forward to another domain
在port上用access list
Routing Information Protocol(RIP):
屬於Distance vector的routing protocol,
- Passive Interface:
R4不跑RIP -> R3的e0/1 就可以設為passive interface
Auth:
MD5
Text
Metric: 只用hop count計算, 最大15, 16以上unreachable
待續…
ref: 待整理