

# What You Make Possible











### Exploring the Engineering Behind the Making of a Switch BRKARC-3466











2

### Agenda

- Overview
- Concept
- System Design
- Mechanical / Physical Design
- Buffer Design
- Forwarding Design

- ASIC Engineering
- Software Engineering

# Hardware Engineering



## Overview











### Timeline



### **Nexus 7000 and F2/F2E Modules**







# Concept









### Concept

What customer problem will the product solve?

- Vision
- Market
- Cost
- Time to Market
- Differentiation
- Innovation

- Technology
- Life Cycle
- How Big?
- How many ports?
- Fixed vs Modular

# Backward Compatibility



### Nexus 7000 Vision (circa 2007)

**Cisco's End-to-End Data Centre Switching** platform; providing solutions for 10G, 40G, and **100G for Access, Aggregation, and Core.** 

**Consolidate IP, Storage, and IPC networks onto** a single Ethernet fabric and deliver innovative features and services that provide value to our customers.





### **DC Evolutionary Innovation**

#### **2011 Phase 2** 1/2 Terabit Slot

#### **FCS** DC3 **80G Slot**

#### **2009 Phase 1** <sup>1</sup>/<sub>4</sub> Terabit Slot



**10GbE Access 10GbE Aggregation Unified Fabric** 

#### **10G Access** 40 / 100G Aggregation **Unified Fabric**

BRKARC-3466

**10G Aggregation** 

© 2013 Cisco and/or its affiliates. All rights reserved.

#### **2013 Phase 3**





#### **F2 Series High Level Goals**

- 48 Ports 1/10G Line Rate 64 Bytes
- Low Latency
- L2MP, TRILL, FEX, FCoE, L3 Forwarding
- Optimise for Data Centre
- IPv4 & IPv6 **Equal Performance**
- Cost target







# System Design









#### Many Factors to Weigh Applicable to any Switch / Router Design

- Standards requirements
- Market requirements
- Designability
- Silicon technology
- Processor technology

- Manufacturability
- Time to market
- Flexibility
- Budget
- Modular / Fixed



# Many Factors to Weigh

Baseline Data Centre Switch Requirements

### **Data Plane**

- Buffering
- No packet drop
- Throughput
- Port count
- Modular
- No single point of failure
- In-order delivery
- Future protocol compatibility

### **Control plane**

- Modular
- state handling)
- activation
- No single point of failure
- Scaleable
- Unit Testable

# Restartable (including active-active)

#### Non-disruptive code load &

#### Future protocol compatibility



## Mechanical / Physical Design





### **Mechanical Design**



Nexus 7010 Rear

Nexus 7010 Front



Nexus 7010 Rear

N7K-AC-6.0kW **Power Supply** 

Fabric N7K-C7010-FAB-1

BRKARC-3466

© 2013 Cisco and/or its affiliates. All rights reserved.

#### 48 x 1G BaseT N7K-M148GT-11

#### N7K-SUP1 Supervisor



Cisco Public

### Industrial Design / Usability









### Industrial Design / Usability







# Buffer Design









### Memory Technology

DRAM:Dynamic Random Access MemoryDDISDRAM:Synchronous Dynamic Random Access1 TreDRAM:embed Dynamic Random Access MemoryRed

SRAM: Static Random Access MemorySRASSRAM: Synchronous Static Random Access6 T

DDR: Double Data Rate – Transfer on Rising and Falling Edges of Clock QDR: Quad Data Rate – Transfers on Rising and Falling and 2 intermediate points between them

#### DDR3 Latency ~10ns 1 Transistor + 1 Capacitor Requires Refresh

#### SRAM Latency 1 cycle 6 Transistors



### **Packet Buffer Design**



#### Fixed

- Memory segmented into a fixed cell size, like 128, 384, 512 bytes
- If packet is smaller than cell size, then the left over space in page is unused
- Easier to map to different memory banks
- Packed
  - Packets are placed head to tail into the packet memory
  - More efficient utilisation of the memory
  - More complex free space management and bank mapping



### One plus one does not equal Two Flow Balancing







10 Links @ 1Gbps Each Bandwidth = 10Gbps Flow Bandwidth = 1Gbps Serialisation Delay = 20uS



#### 1 Link @ 10Gbps Each Bandwidth = 10Gbps Flow Bandwidth = 10Gbps Serialisation Delay = 2uS

### **One Plus One Equal Two Word Spraying**



Creates a link N x link speed, with small increase in Latency due to supporting difference in delays between physical lanes Example 40/100G Ethernet, 180ns maximum skew between lanes



### **Scrambing / Encoding**

|                      | Serdes<br>(Gbps) | Encoding  | Bandwdith<br>after Encoding (Gbps) |
|----------------------|------------------|-----------|------------------------------------|
| PCI express v1       | 2.525            | 8b/10b    | 2.02                               |
| PCI express v2       | 5G               | 8b/10b    | 4                                  |
| PCI express v3       | 7.99             | 128b/130b | 7.867                              |
| 10G Ethernet XAUI    | 4 x 3.126G       | 8b/10b    | 10                                 |
| 10G Ethernet XFI     | 10.3125          | 64b/66b   | 10                                 |
| 40G Ethernet 4x XFI  | 4x 10.3125       | 64b/66b   | 40                                 |
| 100G Ethernet        | 10 x 10.3125     | 64b/66b   | 100                                |
| 100G Ethernet 4x 25G | 4x 25.78125G     | 64b/66b   | 100                                |
|                      |                  |           |                                    |
| 2G FC                | 2.125            | 8b/10b    | 1.7                                |
| 4G FC                | 4.250            | 8b/10b    | 3.4                                |
| 8G FC                | 8.5              | 8b/10b    | 6.8                                |
| 16G FC               | 14.025           | 64b/66b   | 13.6                               |

© 2013 Cisco and/or its affiliates. All rights reserved.



### Single ASIC

- Scalability limited by memory bandwidth/size
- Typically optimised for fixed configuration
- Cost effective with small port counts
- Often used as building block





### **Switch Architecture**



Mesh

Crossbar



BRKARC-3466

© 2013 Cisco and/or its affiliates. All rights reserved.



#### Clos / Fat Tree

### **Complete System – Pull Fabric**





# Forwarding Design









### **High Level View of Forwarding**







#### Forwarding Decision



### **10G Ethernet Forwarding Rate**

#### 1x10G Ethernet Forwarding Rate





#### 10G Ethernet = 14.88Mpps @ 64 Bytes 67.2ns to receive a packet

#### 100G Ethernet = 148.8Mpps @ 64 Bytes

#### 6.72 ns to receive a packet









#### **Content Addressable** Memory





#### Storing 1 bit in TCAM takes 10-12 transistors Cisco

#### **Ternary Content Addressable Memory**

| .001000            | Lkup #1<br>Lkup #2     |
|--------------------|------------------------|
| .001110            | Lkup #3                |
| .001010<br>.0010XX | ← Hit #1!              |
| .001XX0<br>.001XXX | ← Hit #3!<br>← Hit #2! |
| esult #1           |                        |
| esult #2           |                        |
| esult #3           |                        |

### **Hash Tables**

Input MAC Address 0000.c000.0001



**Mathematical Functional** produce value between 0 and Page Size



#### Compare if value in each page matches input value







### **Tries**

- Many different \*tries
  - Bitwise Trie
  - Balanced Trie
  - Patricia Trie
  - Fixed or Variable Stride Tries
- Store information in each leaf or pointer to table with information in it





### **Algorithmic TCAMs**

- From a software point of view looks like TCAM
  - May be all algorithmic or combination of TCAM and algorithmic
  - Software Driver takes TCAM representation and compiles the table to optimally utilise the underlying device
- Why? Algorithmic approaches allow the tables to scale with less than linear power increase

#### **High Level View** Algorithmic TCAM



#### Implementation Algorithmic TCAM





### L3 Table: Design 1



#### **Rewrite Information**

- ADJ 1 Rewrite SRC A+DST A MAC
- ADJ 2 Rewrite SRC A+DST B MAC
- ADJ 3 Rewrite SRC A+DST C MAC
- ADJ 4 Rewrite SRC A+DST D MAC
- ADJ 5 Rewrite SRC A+DST D MAC
- ADJ 6 Rewrite SRC A+DST F MAC
- ADJ 7 Rewrite SRC A+DST G MAC
- ADJ 8 Rewrite SRC A+DST H MAC
- ADJ 9 Rewrite SRC A+DST I MAC
- ADJ 10 Rewrite SRC A+DST J MAC



## L3 Table: Design 2



### **Rewrite Information**

- ADJ 1 Rewrite SRC A+DST A MAC
- ADJ 2 Rewrite SRC A+DST B MAC
- ADJ 3 Rewrite SRC A+DST C MAC
- ADJ 4 Rewrite SRC A+DST D MAC
- ADJ 5 Rewrite SRC A+DST E MAC
- ADJ 6 Rewrite SRC A+DST F MAC
- ADJ 7 Rewrite SRC A+DST G MAC
- ADJ 8 Rewrite SRC A+DST H MAC
- ADJ 9 Rewrite SRC A+DST I MAC
- ADJ 10 Rewrite SRC A+DST J MAC



### L2 Table / Host Table / FIB **Common Optimisation**

- Hash tables take less space than TCAMs and Tries
- Instead of placing /32 or /128 entries for host entries into the FIB, place them into the hash table
- Common for the L2 table and the Host table to share the same memory
- Allows for the FIB Table to be smaller since it does not need to contain single path /32 and /128 entries



## **Forwarding Design**













## **Forwarding Design**



### References

- Network Algorithmics,: An Interdisciplinary Approach to Designing Fast Networked Devices George Varghese
- Art of Computer Programming Vol 1-4, Donald E. Knuth
- Introduction to Algorithms, Third Edition Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and **Clifford Stein**
- IEEE SIGCOMM Papers



# ASIC Engineering









## **ASICs vs FPGAs**

- **ASIC Application Specific** Integrated Circuit
- A finished IC which is built to the exact specification & functionality of the customer
- Can make optimal use of the underlying silicon circuits
- Low part cost, High upfront investment

CISCO 08-0872-01

Significant development time



- An IC that can be configured with the required functionality <u>after</u> it is installed into a target system
- Flexibility vs. sub-optimal use of underlying silicon circuits
- Higher part cost
- Shorter development time
- Main players: Xilinx, Altera





BRKARC-3466

**Cisco Public** 

### FPGA (EPLD) Field Programmable



### CMOS





out







module nand2(a,b,c) input a,b; ouput c; begin c <= !(a & b); end







## Why is Die Size Important?



Silicon Wafer

### With same number of defects per wafer, smaller die size results in higher yield per wafer





## **Integrated Circuit Production**



Cisco Public



### Packaging

Cut wafer into dies Dies into IC packages



## **Integrated Circuit Production**





Silicon Foundry

## **ASIC Design Process**



## F2/F2E ASIC - Clipper



Technology **Die Size Total SRAM Total eDRAM Total TCAM Register Array** Logic Gates Signal Pin Package IO

| F2          | F2E                                                                           |
|-------------|-------------------------------------------------------------------------------|
| IBM Cu-65   | IBM Cu-45                                                                     |
| 18.0x18.3mm | 12.28x12.28mm                                                                 |
| 33.3Mb      | 33.3Mb                                                                        |
| 134Mb       | 134Mb                                                                         |
| 2.94Mb      | 2.94Mb                                                                        |
| 1.34Mb      | 1.34Mb                                                                        |
| 45M         | 45M                                                                           |
| 186         | 186                                                                           |
| 840         | 840                                                                           |
|             | IBM Cu-65<br>18.0x18.3mm<br>33.3Mb<br>134Mb<br>2.94Mb<br>1.34Mb<br>45M<br>186 |





## **Memory and Packet Corruption** Protection

- No ECC or Parity no way to determine if a software or hardware problem
- Parity will detect single bit errors
- ECC will detect 2 bit errors, and correct single bit
- Parity and ECC apply to a word (32 or 64 bits)
- CRC Detect if a set of bytes (normally a packet) has been corrupted





## **ASIC** Packaging

- Electrical parasitics of the chip package are critical
- Impacts electrical properties of high speed signals
- Manufacturing tolerances constra minimum ball pitch
- Limit to number of available signal I/O pins





### References

Indistinguishable From Magic: Manufacturing Modern Computer Chips

http://www.youtube.com/watch?v=NGFhc8R\_uO4



# Hardware Engineering







### F2 Block Diagram



## **Thermal Modeling**

### **Component Case Temperatures**

**-76** C •107 C 83.5 C <mark>\_1</mark>05 C **\_80**.1 C •109 C <mark>₊80.</mark>7 ¢ 🔒 1 0 3 C <mark>. 93.</mark>4 C <mark>₊87.4</mark> C **₀81.3** C •75.2 C **₀83.4** C •77.7 C •94.2 C <mark>₊89.</mark>1 C <mark>₊94.3</mark> ¢ **₊77.8** C •83 C **88.9** C •82.4 C •78.6 C •73.6 C **€68.3 C** •73.1 C •68 •85.8 C **•94.7** C **●97.8** C •97 C



### **Temperature Contours**

## **Electrical / Mechanical Layout**







### 20 Layers



### **EDVT** Electronic Design Validation Test

- All tests performed using offline diagnostics and again with NXOS
- On-board power supplies have voltages margined to +5% & -5%
- Temperature testing occurs while
- Soaking for 12 hours at 55° C and -5° C
- Ramping between extremes at 1° C per minute
- Power cycle testing occurs during 12-hour soak



### RDT **Reliability Demonstration Test**

- The Reliability Demonstration Test (RDT) is Cisco's approach to verifying the stated reliability of a product prior to production release.
- The reliability to be demonstrated is the product's MTBF (Mean Time Between Failure).
- RDT replicates the end user operating environment and application through accelerated test time. It is expected that all hardware features are exercised in RDT.
- All new products including systems and boards are subject to RDT.



BRKARC-3466



## **Power Consumption**



- Data Sheet
  - Typical 340W
  - Maximum 450W



## **Generic Online Diagnostics**

Generic Online Diagnostics provide a diagnostic framework for detecting hardware faults and verifying the health of hardware components throughout the chassis.

Diagnostics run during system Boot-Up, after OIR, On-Demand using the CLI, or as Health Checks in the background.

### **Problem Areas:**

- Hardware Components (ASICs)
- Interfaces (Ethernet, SFP+, etc...)
- Connecters (loose connectors, bent pins, etc...)
- Memory Failure (Failure over time)
- Solder Joints



# Software Engineering









## **NXOS Architecture**





### **Multi-threaded**

Scalability with SMP and multi-core CPUs Faster Route Re-convergence Lower mean-time-to-recovery

### **Modularity**

Most of the features are conditional Can be enabled/disabled independently Maximises efficiency Minimises resources utilisation

## **Separation Control Plane and**

### **Data Plane**

No "software forwarding feature" Fully distributed hardware forwarding

### Line Card Offloading

Offload to line card CPUs Scales with # of line cards

BRKARC-3466

### **Real-Time**

### Real-Time preemptive scheduling System operational when CPU is 100%



- Optimal hardware programming

## **Software Engineering**

SW Functional Spec

SW Design Spec

**Unit Test Plan** 



### **Unit Integration Plan**

## **Design Review**



\$p.a ps proved 100/10 (1)2) 20K OXTON OXSOOD GAK\_ 6×755 AMIPS Oxbiff 02736 0×1025 DMAC ARP) FITF.FFTC.FFF PMAC 2. OXYON DOT . 10 COS (FILLIS) 2 MAC [SMAX 078100 311 12 16 16 DATE SAME doile 3. Unknown Unicad MACTAble forte, A MOC = A 1/3 1/4 12:000 232.1.1.1 CLEAN SE. OUUC.CH SAVE TRIN Cisc

## **Development Test**



### Regression

### Testing of completed integrated feature Test for interactions with other features

FCS

### Test for interoperability with Cisco and

# Build scripts to automate testing so is

## **First Customer Ship**

**Product Requirements Document** 



ASIC

# Q & A









## **Complete Your Online Session Evaluation**

### Give us your feedback and receive a Cisco Live 2013 Polo Shirt!

Complete your Overall Event Survey and 5 Session Evaluations.

- Directly from your mobile device on the **Cisco Live Mobile App**
- By visiting the Cisco Live Mobile Site www.ciscoliveaustralia.com/mobile
- Visit any Cisco Live Internet Station located throughout the venue

Polo Shirts can be collected in the World of Solutions on Friday 8 March 12:00pm-2:00pm





communities, and on-demand and live activities throughout the year. Log into your Cisco Live portal and click the "Enter Cisco Live 365" button. www.ciscoliveaustralia.com/portal/login.ww



Don't forget to activate your Cisco Live 365 account for access to all session material,



# CISCO

