Internet-Draft MCPHINT May 2022
Robinson Expires 27 November 2022 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-robinson-intarea-mcphint-00
Published:
Intended Status:
Standards Track
Expires:
Author:
H. Robinson
Stratus Technologies, Inc.

Multiple Core Performance Hint Option

Abstract

This standard defines a method for differentiating between unrelated data streams when the source and destination ports are encrypted. This method MAY be used by hardware or software to evenly distribute incoming workload between multiple CPU cores and/or other processing elements.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 27 November 2022.

Table of Contents

1. Introduction

The Internet protocol allows datagrams to be re-ordered. Protocols which require datagrams to be ordered must retain out of order datagams until preceding datagrams have been received. While this works, the effect of out of order datagrams on network performance is highly detrimental: Out of order packets at first appear to be packet loss from the receivers point of view. The perceived packegt loss can trigger unneeded retransmission and delays from TCP and any other protocol which uses packet loss to implement congestion control.

With the advent of 10Gbit transmission speeds, it is not possible for a single CPU core to keep up with the incoming data running at full line speed. Hardware vendors have implemented mechanisms to distribute incoming datagrams to multiple CPU cores. If they did this on a random or round-robin basis, the different latencies between the multiple cores would result in datagram re-ordering, which can severly impact performance. Hardware solves this problem by distributing the data deterministically between CPU cores: This is done using a hash of the source and destination IP addresses and the source and destination port numbers. Using just the source and destination IP addresses is not sufficient, because the resulting traffic will often go to a single CPU core.

A performance problem arises when handling IPSec traffic: The port numbers are encrypted and can no longer be read by the hardware.

The performance problem also occurs with fragmented datagrams: The port numbers are only in the first fragment.

This standard defines IPv4 and IPv6 options to provide differentiation that can be used to distribute incoming datagrams to multiple CPU cores.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [RFC2119].

2. IPv4 Option Format

A host transmitting an IPv4 datagram MAY add an MCPHINT option to the IPv4 header under any of the following circumstances:

The MCPHINT option provides 2 bytes of differentiation data. If present, the MCPHINT option MUST occur first - at offset 20 from the beginning of the IPv4 header.

The MCPHINT option MUST NOT be used with upper layer protocols which do not have unique identifiers beyond the IPv4 source and destination address.

The datagram MUST NOT be for the ICMP protocol.

The format of the IPv4 MCPHINT options is:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Type        |  Length = 4   |  Differentiation Data         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    Type = TBD_IP4OPT_MCPHINT

Refer to RFC0791 [RFC0791] for more information about IP options.

If there is a mechanism by which an application can provide IPv4 options for transmission and that mechanism is used to provide an MCPHINT option, the value provided by the application MUST be used.

The macro OPT_MCPHINT MAY be added to netinet/in.h defined as TBD_IP4OPT_MCPHINT.

3. IPv6 Option Format

A host transmitting an IPv6 datagram MAY add an MCPHINT option under any of the following circumstances:

The MCPHINT option MUST be added to a destination options header. The MCPHINT option provides 2 bytes of differentiation data. The Destination options header is defined in section 4.6 of RFC8200 [RFC8200].

If present, the MCPHINT option MUST occur first in the first destination options header - normally at offset 42 from the beginning of the IPv6 header.

Note that RFC8200 [RFC8200] requires that per fragment destination headers to be followed by a routing header. If one applies this hint to a packet containing an IPv6 fragmentation header, a routing header must be included. RFC8200 [RFC8200] explicitly states that a routing header with zero "Segments Left" is always ignored; so, this is possible.

The MCPHINT option MUST NOT be used with upper layer protocols which do not have unique identifiers beyond the IPv6 source and destination address.

The datagram MUST NOT be for the ICMP6 protocol.

The format of the IPv6 MCPHINT options is:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Type        |  Data Len = 2 |   Differentiation Data        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    Type = TBD_IP6OPT_MCPHINT

If there is a mechanism by which an application can provide destination options for transmission and that mechanism is used to provide an MCPHINT option, the value provided by the application MUST be used.

The macro IP6OPT_MCPHINT MAY be added to netinet/ip6.h defined as TBD_IP6OPT_MCPHINT.

4. Differentiation Data

For both IPv4 and IPv6, there is two bytes of differentiation data. The differentiation data MUST NOT be zero. The differentation data MUST be the same for all datagrams in a logical stream. The actual value chosen for differentiation data is left to the implementation.

A preferable mechanism would be to generate two bytes of random data when a socket is created and to use that data for the life of the socket. The random data could be updated every time a connection is specified.

Alternatively, exclusive or'ing the source and destination ports is an acceptable method for generating the differentiation data.

5. Forwarding

Forwarding is already defined to pass through unknown options.

6. Tunneling

Tunneling implementations MAY copy the MCPHINT option from the datagrams being tunneled to the outer headers.

7. Parsing Input Datagrams

7.1. IPv4

Refer to section 3.1 in RFC0791 [RFC0791]. The input parsing algorithm for detecting the presence of differentiation data is

 o IHL MUST be greater than or equal to 6
 o The byte at offset 20 MUST be TBD_IP4OPT_MCPHINT

If those checks pass, then the differentation data can be found at offset 22.

7.2. IPv6

Refer to sections 3, 4.2 and 4.6 in RFC8200 [RFC8200]. The input parsing algorithm for detecting the presence of differentiation data is

 o Next Header (offset 6) MUST be 60 (for destination options).
 o The byte at offset 42 MUST be TBD_IP6OPT_MCPHINT

If those checks pass, then the differentation data can be found at offset 44.

8. Future Considerations

A future revision of this standard could allow the differentation data to be longer as long as the first two bytes are generated the same way.

A future revision of this standard could add fields to this option.

9. Security Considerations

The MCPHINT option provides some minimal insight to internal network configurations that wouldn't otherwise be discernable for IPSec tunnels.

Xor'ing the port numbers to obtain differentiation data provides slightly more information than using random data.

The implementation MUST provide an adminitrative mechanism to disable the use of MCPHINT options.

If the implementation implements both random generation of differentiation data AND uses the Xor'ing ports method, there MUST be separate administrative mechanisms for each method.

10. IANA Considerations

IANA is asked to assign a value for TBD_IP4OPT_MCPHINT under "Internet Protocol Version 4 (IPv4) Parameters", "IP Option Numbers", Refer to RFC2780 [RFC2780] and RFC0791 [RFC0791].

The Copy bit MUST be 1 and the class bits MUST be 00.

IANA is asked to assign a value for TBD_IP6OPT_MCPHINT under "Internet Protocol Version 6 (IPv6) Parameters", "Destination Options and Hop-by-Hop Options", Refer to RFC2780 [RFC2780] and RFC8200 [RFC8200].

The act bits MUST 00 and the chg bit MUST be 0

11. Appendix A - Design Considerations

This is done as an option so it may be added without affecting implementations that don't implement it.

Use with ICMP and ICMPv6 is prohibited because there is no reason to optimize them and, given that correct IP layer behavior depends on thier transmission, it is best to avoid anything that might interfere with correct operation..

One should note that when using this option with IPSec, the same security association is likely to be processed on multiple CPU cores. This requires a good locking design to acheive the desired performance improvement. It also requires much larger replay windows.

11.1. IP Nofification

Stratus has applied for a patent on this. Stratus intends to allow use of the patent free of charge. I will be filing the appropriate formal notification as soon as I figure out what it is and get it signed by the appropriate management.

11.2. Issues To Resolve

My original writeup of this put the new IPv6 option in the Hop-by-Hop header, because that is always ensured to be a per fragment header. The option was moved to the destination options header given the advice in section 4.8 of RFC8200 Section 4.5 of RFC8200 [RFC8200].[RFC8200] explicitly states that there are only the following combinations of per fragment headers:

 IPv6 Header
 IPv6 Headar, Hop-by-Hop Header
 IPv6 Header, Destination Options Header, Routing Header
 IPv6 Header, Hop-by-Hop Header, Dest Options Header, Routing Header

This implies that getting MCPHINTs into a fragmented header will require the insertion of a null routing header if one isn't present (which is the normal case).

So, I am wondering if I was mislead by section 4.8 in RFC8200 [RFC8200] and this option really belongs in the hop-by-hop header?

I see that some other drafts have picked new values for option numbers and instructed the IANA to allocate specific numbers. I like this idea. Can anyone recommend deprected values which could be assigned without getting into trouble?

12. Normative References

[RFC0791]
Postel, J., "Internet Protocol", STD 5, RFC 791, DOI 10.17487/RFC0791, , <https://www.rfc-editor.org/info/rfc791>.
[RFC8200]
Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", STD 86, RFC 8200, DOI 10.17487/RFC8200, , <https://www.rfc-editor.org/info/rfc8200>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC2780]
Bradner, S. and V. Paxson, "IANA Allocation Guidelines For Values In the Internet Protocol and Related Headers", BCP 37, RFC 2780, DOI 10.17487/RFC2780, , <https://www.rfc-editor.org/info/rfc2780>.

Author's Address

Herb Robinson
Stratus Technologies, Inc.
5 Mill & Main Place, Suite 500
Maynard, Massachusetts 1004
United States of America

mirror server hosted at Truenetwork, Russian Federation.