Mellanox Technologies ===================== =============================================================================== Ethernet over IB (EoIB) for Linux README August 2010 Document No. 3289 =============================================================================== Contents: ========= 1. Overview 1.1 General 1.2 EoIB Topology 1.2.1 External ports (eports) and GW 1.2.2 Virtual Hubs (vHubs) 1.2.3 Virtual NIC (vNic) 2. EoIB Configuration 2.1 EoIB Host Administered vNic 2.1.1 Central Configuration File - mlx4_vnic.conf 2.1.2 vNic Specific Configuration Files - ifcfg-ethX 2.1.3 mlx4_vnic_confd 2.2 EoIB Network Administered vNic 2.3 VLAN Configuration 2.4 EoIB Multicast Configuration 2.5 EoIB and QoS 2.6 IP Configuration Based on DHCP 2.6.1 DHCP Server 2.7 Static EoIB Configuration 2.8 Sub Interfaces (VLAN) 3. Retrieving EoIB Information 3.1 mlx4_vnic_info 3.2 ethtool 3.3 Link State 3.4 Bonding Driver 3.5 Jumbo Frames 4. Advanced EoIB settings: 4.1 Module Parameters 4.2 vNic Interface Naming 1 Overview ========== 1.1 General ----------- The Ethernet over IB (EoIB) mlx4_vnic module is a network interface implementation over InfiniBand. EoIB encapsulates Layer 2 datagrams over an InfiniBand Datagram (UD) transport service. The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its payload. To perform this operation the module performs an address translation from Ethernet layer 2 MAC addresses (48 bits long) to InfiniBand layer 2 addresses made of LID/GID and QPN. This translation is totally invisible to the OS and user. Thus, differentiating EoIB from IPoIB which exposes a 20 Bytes HW address to the OS. The mlx4_vnic module is designed for Mellanox's ConnectX family of HCAs and intended to be used with Mellanox's BridgeX gateway family. Having a BridgeX gateway is a requirement for using EoIB. It performs the following operations: * Enables the layer 2 address translation required by the mlx4_vnic module. * Enables routing of packets from the InfiniBand fabric to a 1 or 10 GigE Ethernet subnet. 1.2 EoIB Topology ----------------- EoIB is designed to work over an InfiniBand fabric and requires the presence of two entities: * Subnet Manager (SM) * BridgeX gateway The required subnet manager configuration is similar to that of other InfiniBand applications and ULPs and is not unique to EoIB. The BridgeX gateway is at the heart of EoIB. On one side, usually referred to as the "internal" side, it is connected to the InfiniBand fabric by one or more links. On the other side, usually referred to as the "external" side, it is connected to the Ethernet subnet by one or more ports. The Ethernet connections on the BridgeX's external side are called external ports or eports. Every BridgeX that is in use with EoIB needs to have one or more eports connected. 1.2.1 External Ports (eports) and GW The combination of a specific BridgeX box and a specific eport is referred to as a gateway (GW). The GW is an entity that is visible to the EoIB host driver and is used in the configuration of the network interfaces on the host side. For example, in host administered vNics the user will request to open an interface on a specific GW identifying it by the BridgeX box and eport name. Distinguishing between GWs is important because they determine the network topology and affect the path that a packet traverses between hosts. A packet that is sent from the host on a specific EoIB interface will be routed to the Ethernet subnet through a specific external port connection on the BridgeX box. 1.2.2 Virtual Hubs (vHubs) Virtual hubs connect zero or more EoIB interfaces (on internal hosts) and an eport through a virtual hub. Each vHub has a unique virtual LAN (VLAN) ID. Virtual hub participants can send packets to one another directly without the assistance of the Ethernet subnet (external side) routing. This means that two EoIB interfaces on the same vHub will communicate solely using the InfiniBand fabric. EoIB interfaces residing on two different vHubs (whether on the same GW or not) cannot communicate directly. There are two types of vHubs: - a default vHub (one per GW) without a VLAN ID - vHubs with unique different VLAN IDs Each vHub belongs to a specific GW (BridgeX + eport), and each GW has one default vHub, and zero or more VLAN-associated vHubs. A specific GW can have multiple vHubs distinguishable by their unique VLAN ID. Traffic coming from the Ethernet side on a specific eport will be routed to the relevant vHub group based on its VLAN tag (or to the default vHub for that GW if no vLan ID is present). 1.2.3 Virtual NIC (vNic) A virtual NIC is a network interface instance on the host side which belongs to a single vHub on a specific GW. The vNic behaves like any regular hardware network interface. The host can have multiple interfaces that belong to the same vHub. 2 EoIB Configuration ==================== The mlx4_vnic module supports two different modes of configuration: - host administration where the vNic is configured on the host side - network administration where the configuration is done by the BridgeX and this configuration is passed to the host mlx4_vnic driver using the EoIB protocol. Both modes of operation require the presence of a BridgeX gateway in order to work properly. The EoIB driver supports a mixture of host and network administered vNics. 2.1 EoIB Host Administered vNic ------------------------------- In the host administered mode, vNics are configured using static configuration files located on the host side. These configuration files define the number of vNics, and the vHub that each host administered vNic will belong to (i.e., the vNic's BridgeX box, eport and VLAN id properties). The mlx4_vnic_confd service is used to read these configuration files and pass the relevant data to the mlx4_vnic module. EoIB Host Administered vNic supports two forms of configuration files: - A central configuration file (mlx4_vnic.conf) - vNic-specific configuration files (ifcfg-ethXX) Both forms of configuration supply the same functionality. If both forms of configuration files exist, the central configuration file has precedence and only this file will be used. 2.1.1 Central Configuration File - /etc/infiniband/mlx4_vnic.conf The mlx4_vnic.conf file consists of lines, each describing one vNic. The following file format is used: name=eth44 mac=00:25:8B:27:14:78 ib_port=mlx4_0:1 vid=3 vnic_id=5 bx=00:00:00:00:00:00:04:B2 eport=A10 name=eth45 mac=00:25:8B:27:15:78 ib_port=mlx4_0:1 vnic_id=6 bx=00:00:00:00:00:00:05:B2 eport=A10 name=eth47 mac=00:25:8B:27:16:84 ib_port=mlx4_0:1 vid=2 vnic_id=7 bx=BX001 eport=A11 name=eth40 mac=02:AA:8B:27:17:93 ib_port=mlx4_0:2 vnic_id=8 bx=BX001 eport=A12 The fields used in the file have the following meaning: name - The name of the interface that is displayed when running ifconfig. mac - The mac address to assign to the vNic. ib_port - The device name and port number in the form [device name]:[port number]. The device name can be retrieved by running ibv_devinfo and using the output of hca_id field. The port number can have a value of 1 or 2. vid - VLAN ID (an optional field). If it exists the vNic will be assigned the VLAN ID specified. This value must be between 0 and 4095. If no vid is specified or value -1 is set, the vNic will be assigned to the default vHub associated with the GW. vnic_id - A unique number per vNic between 0 and 32K. bx - The BridgeX box system GUID or system name string. eport - The string describing the eport name. 2.1.2 vNic Specific Configuration Files - ifcfg-ethX EoIB configuration can use the ifcfg-ethX files used by the network service to derive the needed configuration. In such case, a separate file is required per vNic. Additionally, you need to update the ifcfg-ethX file and add some new attributes to it. On Red Hat the new file will be of the form: DEVICE=eth2 HWADDR=00:30:48:7d:de:e4 BOOTPROTO=dhcp ONBOOT=yes BXADDR=BX001 BXEPORT=A10 VNICIBPORT=mlx4_0:1 VNICVLAN=3 (Optional field) The fields used in the file have the following meaning: DEVICE - An optional field. The name of the interface that is displayed when running ifconfig. If it is not present, the trailer of the configuration file name (e.g. ifcfg-eth47 => "eth47") is used instead. BXADDR - The BridgeX box system GUID or system name string. BXEPORT - The string describing the eport name. VNICVLAN - An optional field. If it exists, the vNic will be assigned the VLAN ID specified. This value must be between 0 and 4095. VNICIBPORT - The device name and port number in the form [device name]:[port number]. The device name can be retrieved by running ibv_devinfo and using the output of hca_id filed. The port number can have a value of 1 or 2. HWADDR - The mac address to assign the vNic. Other fields available for regular eth interfaces in the ifcfg-ethX files may also be used. 2.1.3 mlx4_vnic_confd After updating the configuration files you are ready to create the host administered vNics. Usage: /etc/init.d/mlx4_vnic_confd {start|stop|restart|reload|status} Note: this script manages host administrated vNics only, to retrieve general information on the vNics on the system including network administrated vNics, refer to mlx4_vnic_info section 3.1 2.2 EoIB Network Administered vNic ---------------------------------- In network administered mode, the configuration of the vNic is done by the BridgeX. If a vNic is configured for a specific host, it will appear on that host once a connection is established between the BridgeX and the mlx4_vnic module. This connection between the mlx4_vnic modules and all available BridgeX boxes is established automatically when the mlx4_vnic module is loaded. If the BridgeX is configured to remove the vNic, or if the connection between the host and BridgeX is lost, the vNic interface will disappear (running ifconfig will not display the interface). Similar to host administered vNics, a network administered, vNic resides on a specific vHub. See BridgeX documentation on how to configure a network administered vNic. To disable network administered vNics on the host side load mlx4_vnic module with the net_admin module parameter set to 0. 2.3 VLAN configuration ---------------------- As explained in the topology section, a vNic instance is associated with a specific vHub group. This vHub group is connected to a BridgeX external port and has a VLAN tag attribute. When creating/configuring a vNic you define the VLAN tag it will use via the vid or the VNICVLAN fields (if these fields are absent, the vNic will not have a VLAN tag). The vNic's VLAN tag will be present in all EoIB packets sent by the vNics and will be verified on all packets received on the vNic. When passed from the InfiniBand to Ethernet, the EoIB encapsulation will be disassembled but the VLAN tag will remain. For example, if the vNic "eth23" is associated with a vHub that uses BridgeX "bridge01", eport "A10" and VLAN tag 8, all incoming and outgoing traffic on eth23 will use a VLAN tag of 8. This will be enforced by both BridgeX and destination hosts. When a packet is passed from the internal fabric to the Ethernet subnet through the BridgeX it will have a "true" Ethernet VLAN tag of 8. The VLAN implementation used by EoIB uses OS un-aware VLANs. This is in many ways similar to switch tagging in which an external Ethernet switch adds/strips tags on traffic preventing the need of OS intervention. EoIB does not support OS aware VLANs in the form of vconfig. 2.3.1 Configuring VLANs To configure VLAN tag for a vNic, add the VLAN tag property to the configuration file in host administrated mode, or configure the vNic on the appropriate vHub in network administered mode. In the host administered mode when a vHub with the requested VLAN tag is not available, the vNIC's login request will be rejected. Host administered VLAN configuration in centralized configuration file: Add "vid=" or remove vid property for no VLAN Host administered VLAN configuration with ifcfg-ethX configuration files Add "VNICVLAN=" or remove VNICVLAN property for no VLAN Notes: o Using a VLAN tag value of 0 is not recommended because the traffic using it would not be separated from non VLAN traffic. o For Host administered vNics, VLAN entry must be set in the BridgeX first, refer to BridgeX documentation for more information. 2.4 EoIB Multicast Configuration -------------------------------- Configuring Multicast for EoIB interfaces is identical to multicast configuration for native Ethernet interfaces. Note: EoIB maps Ethernet multicast addresses to InfiniBand MGIDs (Multicast GID). It ensures that different vHubs use mutually exclusive MGIDs. Thus preventing vNics on different vHubs from communicating with one another. 2.5 EoIB and QoS ---------------- EoIB enables the use of InfiniBand service levels. The configuration of the SL is performed through the BridgeX and enables setting different data/control service level values per BridgeX box. Please refer to BridgeX documentation for the use of non default SL. 2.6 IP Configuration Based on DHCP ---------------------------------- Setting an EoIB interface configuration based on DHCP (v3.1.2 which is available via www.isc.org) is performed similarly to the configuration of Ethernet interfaces. When setting the EoIB configuration files, verify that it includes following lines: For RedHat: BOOTPROTO=dhcp For SLES: BOOTPROTO='dchp' Note: If EoIB configuration files are included, ifcfg-eth files will be installed under: /etc/sysconfig/network-scripts/ on a RedHat machine /etc/sysconfig/network/ on a SuSE machine 2.6.1 DHCP Server Using a DHCP server with EoIB does not require special configuration. The DHCP server can run on a server located on the Ethernet side (using any Ethernet HW) or on a server located on the InfiniBand side and running EoIB module. 2.7 Static EoIB Configuration ----------------------------- To configure a static EoIB you can use an EoIB configuration that is not based on DHCP. Static configuration is similar to a typical Ethernet device configuration. See your Linux distribution documentation for additional information about configuring IP addresses. Note: Ethernet configuration files are located at: /etc/sysconfig/network-scripts/ on a RedHat machine /etc/sysconfig/network/ on a SuSE machine 2.8 Sub Interfaces (VLAN) ------------------------- EoIB interfaces do not support creating sub interfaces via the vconfig command. To create interfaces with VLAN, refer to the VLAN section 2.3.1. 3. Retrieving EoIB Information ============================== 3.1 mlx4_vnic_info ------------------ To retrieve information regarding EoIB interfaces, use the script mlx4_vnic_info. This script provides detailed information about a specific vNic or all EoIB vNic interfaces, such as: BX info, IOA info, SL, PEKY, Link state and interface features. If network administered vNics are enabled, this script can also be used to discover the available BridgeXs from the host side. To to discover the available BridgeXs, run: # mlx4_vnic_info -s | grep BX To receive the full vNic information of eth10, run: # mlx4_vnic_info -i eth10 For help and usage, run: # mlx4_vnic_info --help 3.2 ethtool ----------- ethtool application is another method to retrieve interface information and change its configuration. EoIB interfaces support ethtool similarly to HW Ethernet interfaces. The supported Ethtool options include the following options: -c, -C - Show and update interrupt coalesce options -g - Query RX/TX ring parameters -k, -K - Show and update protocol offloads -i - Show driver information -S - Show adapter statistics For more information on ethtool run: ethtool -h 3.3 Link State -------------- An EoIB interface can report two different link states: - The physical link state of the interface that is made up of the actual HCA port link state and the status of the vNics connection with the BridgeX. If the HCA port link state is down or the EoIB connection with the BridgeX has failed, the link will be reported as down because without the connection to the BridgeX the EoIB protocol cannot work and no data can be sent on the wire. The mlx4_vnic driver can also report the status of the external BridgeX port status by using the mlx4_vnic_info script. If the eport_state_enforce module parameter is set, then the external port state will be reported as the vNic interface link state. If the connection between the vNic and the BridgeX is broken (hence the external port state is unknown)the link will be reported as down. - the link state of the external port associated with the vNic interface Note: A link state is down on a host administrated vNic, when the BridgeX is connected and the InfiniBand fabric appears to be functional. The issue might result from a misconfiguration of either BXADDR or/and BXEPORT configuration file. To query the link state run: ifconfig and check for "RUNNING" in the result text. Example: # ifconfig eth2 eth2 Link encap:Ethernet HWaddr 00:25:8B:00:04:00 inet6 addr: fe80::225:8bff:fe00:400/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:49 errors:0 dropped:11 overruns:0 frame:0 TX packets:25 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:11278 (11.0 KiB) TX bytes:5821 (5.6 KiB) An alternative is to use ethtool and test for "Link detected". Example: # ethtool eth2 Settings for eth2: Supported ports: [ ] Supported link modes: Supports auto-negotiation: No Advertised link modes: Not reported Advertised auto-negotiation: No Speed: Unknown! (10000) Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x00000000 (0) Link detected: yes 3.4 Bonding Driver ------------------ EoIB uses the standard Linux bonding driver. For more information on the Linux Bonding driver please refer to: /Documentation/networking/bonding.txt. Currently not all bonding modes are supported (e.g., LACP is not supported). 3.5 Jumbo Frames ---------------- EoIB supports jumbo frames up to the InfiniBand limit of 4K bytes. To configure EoIB to work with jumbo frames you need to configure the entire InfiniBand fabric to use 4K MTU. This includes configuring the SM, InfiniBand switches and configuring the ConnectX HCA. To configure the HCA port to work with 4K MTU set the mlx4_core module parameter to "set_4k_mtu". For how to configure the SM and switches refer to their corresponding documentation. 4. Advanced EoIB Settings ========================= 4.1 Module Parameters --------------------- The mlx4_vnic driver supports the following module parameters. These parameters are intended to enable more specific configuration of the mlx4_vnic driver to customer needs. The mlx4_vnic is also effected by module parameters of other modules such as set_4k_mtu of mlx4_core. This modules are not addressed in this document. The available module parameters include: * tx_rings_num: Number of TX rings used per vNic, use 0 for #cores [default 0] * rx_rings_num: Number of RX rings, use 0 for #cores [default 0]. The received rings service all vNics that use the HCA port. * eport_state_enfroce: Bring vNIC link indication up only when corresponding External Port is up [default 0]. * lro_num: Number of LRO sessions per ring or disable=0 [default 32] * napi_weight: NAPI weigth [default 32] * max_tx_outs: Max outstanding TX packets [default 16] * vnic_net_admin: Network administration enabled [default 1]. If disabled no network administered interfaces will be opened. For all module parameters list and description, run: # modinfo mlx4_vnic 4.2 vNic Interface Naming ------------------------- The mlx4_vnic driver enables the kernel to determine the name of the registered vNic. By default, the Linux kernel assigns each vNic interface the name eth, where is an incremental number that keeps the interface name unique in the system. The vNic interface name may not remain consistent among hosts or BridgeX reboots as the vNic creation can happen in a different order each time. Therefore, the interface name may change because of a "first-come-first-served" kernel policy. In automatic network administered mode, the vNic MAC address may also change, which makes it difficult to keep the interface configuration persistent. To control the interface name, you can use standard Linux utilities such as IFRENAME(8), IP(8) or UDEV(7). For example, to change the interface eth2 name to eth.bx01.a10, run: #ifrename -i eth2 -n eth.bx01.a10 To generate a unique vNic interface name, use the mlx4_vnic_info script with the '-u' flag. The script will generate a new name based on the scheme: eth...[vlan-id] For example, if vNic eth2 resides on an InfiniBand card on the PCI BUS ID 0a:00.0 PORT #1, and is connected to the GW PORT ID #3 without VLAN, its unique name will be: # mlx4_vnic_info -u eth2 eth2 eth10.1.3 You can add your own custom udev rule to use the output of the script and to rename the vNic interfaces automatically. To create a new udev rule file under /etc/udev/rules.d/61-vnic-net.rules, include the line: SUBSYSTEM=="net", PROGRAM=="/sbin/mlx4_vnic_info -u %k", NAME="%c{2+}" Notes: - UDEV service is active by default however if it is not active, run: # /sbin/udevd -d - When vNic MAC address is consistent, you can statically name each interface using the UDEV following rule: SUBSYSTEM=="net", SYSFS{address}=="aa:bb:cc:dd:ee:ff", NAME="ethX" Refer to udev man pages for more details on UDEV rules syntax.