IEN: 168 VAX-UNIX Networking Support Project Implementation Description Robert F. Gurwitz Computer Systems Division Bolt Beranek and Newman, Inc. Cambridge, MA 02138 January, 1981
VAX-UNIX Networking January, 1981 Support Project IEN 168 1 Introduction The purpose of this report is to describe the implementation of network software for the VAX-11/780 * running UNIX. ** This is being done as part of the VAX-UNIX Networking Support Project. The overall purpose of this effort is to provide the capability for the VAX to communicate with other computers via packet- switching networks, such as the ARPANET. Specifically, the project centers around an implementation of the DoD standard host-host protocol, the Transmission Control Protocol (TCP) [4]. TCP allows communication with ARPANET hosts, as well as hosts on networks outside the ARPANET, by its use of the DoD standard Internet Protocol (IP) [3]. The implementation is designed for the VAX, running VM/UNIX, the modified version of UNIX 32/V developed at the University of California, Berkeley [1]. This version of UNIX includes virtual paging capabilities. In the following paragraphs, we will discuss some features and design goals of the implementation, and its organization. 2 Features of the Implementation 2.1 Protocol Dependent Features 2.1.1 Separation of Protocol Layers The TCP software that we are developing for the VAX incorporates several important features. First, the implementation provides for separation of the various protocol layers so that they can be accessed independently by various applications. (1) Thus, there is a capability for access to the TCP level, which will provide complete, reliable, multiplexed, host-host communications connections. In addition, the IP level is also accessible for applications other than TCP, which require its internet addressing and data fragmentation/reassembly services. Finally, the implementation also allows independent access to the local network interface (in this case, to the ARPANET, whose host interface is defined in BBN Report No. 1822 _______________ * VAX is a trademark of Digital Equipment Corporation. ** UNIX is a trademark of Bell Laboratories. (1) In this context, the terms application and user refer to any software that is a user of lower level networking services. Thus, programs such as FTP and TELNET can be considered applications when viewed from the TCP level, and TCP itself may be viewed as an application from the IP level. -1-
VAX-UNIX Networking January, 1981 Support Project IEN 168 [2]) in a "raw" fashion, for software which wishes to communicate with hosts on the local network and do its own higher level protocol processing. 2.1.2 Protocol Functions Another feature of the implementation is to provide the full functionality of each level of protocol (TCP and IP), as described in their specifications [3,4]. Thus, on the TCP level, features such as the flow control mechanism (windows), precedence, and security levels will be supported. On the IP level, datagram fragmentation and reassembly will be supported, as well as IP option processing, gateway-host flow control (source-quenching) and routing updates. However, it is anticipated that some of these features (such as handling IP gateway-host routing updates, and IP option processing) will be implemented in later stages of development, after more basic features (such as TCP flow control and IP fragmentation/reassembly) are debugged. 2.2 Operation System Dependent Features 2.2.1 Kernel Resident Networking Software There are several features of the implementation which are operating system dependent. The most important of these is the fact that the networking software is being implemented in the UNIX kernel as a permanently resident system process, rather than a swappable user level process. This organization has several implications which bear on performance. The most obvious effect is that since the networking software is always resident, it can more efficiently respond to network and user initiated events, as it is always available to service such events and need not be swapped in. In addition, residence in the kernel removes the burden of the use of potentially inefficient interprocess communication mechanisms, such as pipes and ports, since simpler data structures, such as globally available queues, can be used to transmit data between the network and user processes. Kernel provided services, (e.g., timers and memory allocation) also become much easier and more efficient to use. -2-
VAX-UNIX Networking January, 1981
Support Project IEN 168
The large address space of the VAX makes this organization
practical and allows the avoidance of expedients like the NCP
split kernel/user process implementation, that have been
necessary in previous UNIX networking software on machines with
limited address space, like the PDP 11/70. It is hoped that the
kernel resident approach will contribute to the speed and
efficiency of this TCP.
2.2.2 User Interface
Use of the "traditional" UNIX file oriented user interface
is another operating system dependent feature of this
implementation. The user will access the network software by
means of standard system file I/O calls: open, close, read, and
write. This entails modification of certain of these calls to
accommodate the extra information needed to open and maintain a
connection. In addition, the communication of exceptional
conditions to the user (such as the foreign host going down) must
also be accommodated by extension of the standard system calls.
In the case of open, for example, use of the call's mode field
will be extended to accommodate a pointer to a parameter
structure. In the case of exceptional conditions, the return
code for reads and writes will be used to signal the presence of
exceptional conditions, much like an error. An additional status
call (ioctl) will be provided for the user to determine detailed
information about the nature of the condition, and the general
status of the connection.
In this way, the necessary additional information needed to
maintain network communications will be supported, while still
allowing the use of the functionality that the UNIX file
interface provides, such as the pipe mechanism.
In the initial versions, this interface will be the standard
UNIX blocking I/O mechanism. Thus, outstanding reads for data
which has not been accepted from the foreign host, and writes
which exceed the buffering resources of a connection will block.
It is expected that the await/capacity mechanism, currently
available for Version 6 systems, will be added to the VM/UNIX
kernel in the near future. These non-blocking I/O modifications
will be supported by the network software, relieving the blocking
restriction.
-3-
VAX-UNIX Networking January, 1981 Support Project IEN 168 3 Design Goals Several design goals have been formulated for this implementation. Among these goals are efficiency and low operating system overhead, promoted by a kernel resident network process, which allows for reduced process and interprocess communication overhead. Another goal of the implementation is to reduce the amount of extraneous data copying in handling network traffic. To achieve this, a buffer data structure has been adopted which has the following characteristics: intermediate size (128 bytes); low overhead (10 bytes of control information per buffer); and flexibility in data handling through the use of data offset and length fields, which reduce the amount of data copying required for operations like IP fragment reassembly and TCP sequence space manipulations. The use of queueing between the various software levels has been limited in the implementation by processing incoming network data to the highest level possible as soon as possible. Thus, an unfragmented message coming from the network is passed to the IP and TCP levels, with queueing taking place at the device driver only until the message has been fully read from the network. Similarly, on the output side, data transmission is only attempted when the software is reasonably certain that the data will be accepted by the network. Finally, it is planned that the inclusion of the network software will entail relatively little modification of the basic kernel code beyond that provided by Berkeley. The only modifications to kernel code outside the network software will be slight changes to the file I/O system calls to support the user interface described above. In addition, an extension to the virtual page map data structure in low core will be necessary to support the memory allocation scheme, which makes use of the kernel's page frame allocation mechanisms. 4 Organization 4.1 Control Flow -4-
VAX-UNIX Networking January, 1981
Support Project IEN 168
4.1.1 Local Network Interface
The network software can be viewed as a kernel resident
system process, much like the scheduler and page daemon of
Berkeley VM/UNIX. This process is initiated as part of network
initialization. A diagram of its control and data flow is shown
in Figure 1.
| |-----| |-----| |-----| |-----| |
| |LOCAL| |-----| |LOCAL| | | | | |
| | NET | |input| | NET | | IP | | TCP | |
|->|INPUT|->|queue|->|INPUT|->|INPUT|->|INPUT| |
| | I/F | |-----| | | | | | | |
N | |-----|==========>|-----| |-----| |-----| |
| ^ (wakeup) ^ \ (timer) |
E | | | \ / | U
| (input) V \ / |
T | ( int ) |-----| \ / | S
| |frag | |-----| |-----| |
W | |queue| | |=>| |->| E
| |-----| | TCP | |USER | |
O | |-----||-----| |MACH | | I/F | | R
| |unack|| snd |<---->| |<=| |<-|
R | (outpt) |queue||queue| |-----| |-----| |
| ( int ) |-----||-----| / \ / |
K | | / \ / |
| V / \ / |
| |-----| |-----| |-----| |-----| \ / |
| |LOCAL| |-----| |LOCAL| | | | | |-----| |
| | NET | |outpt| | NET | | IP | | TCP | | rcv | |
|<-|OUTPT|<-|queue|<-|OUTPT|<-|OUTPT|<-|OUTPT| |queue| |
| | I/F | |-----| | | | | | | |-----| |
| |-----|<----------|-----| |-----| |-----| |
| |
| |
|<----------TCP PROCESS------------>|
| |
Figure 1 . Network Software Organization
Its main flow of control is an input loop which is activated (via
wakeup) by the network interface device driver when an incoming
message has been completely read from the network. (It can also
be awakened by TCP user or timer events, described below.) The
message is then taken from an input queue and dispatched on the
basis of local network format (e.g., 1822 leader link number).
ARPANET imp-host messages (RFNMs, incompletes, imp/host status)
-5-
VAX-UNIX Networking January, 1981 Support Project IEN 168 are handled at this level. For other types of messages, the local network level input handler calls higher level "message handlers." The "standard message handler" is the IP input routine. Handlers for other protocols at this level (such as UNIX NCP) may be accommodated in either of two ways. First, a "raw message" service is available which simply queues data on specified links to/from the local network. By reading or writing on a connection opened for this service, a user process may handle its own higher level protocol communication. Alternatively, for frequently used protocols, a new handler may be defined in the kernel and called directly. 4.1.2 Internet Protocol At the IP level, the fragment reassembly algorithm is executed. Unfragmented messages with valid IP leaders are passed to the higher level protocol handler in a manner similar to the lower level dispatch, but on the basis of IP protocol number. The "standard handler" is TCP. Another protocol handler interprets IP gateway-host flow control and routing update messages. Fragmented messages are placed on a fragment reassembly queue, where incoming fragments are separated by source and destination address, protocol number, and IP identification field. For each "connection" (as defined by these fields), a linked list of fragments is maintained, tagged by fragment offset start and end byte numbers. As fragments are received, the proper list is found (or a new one is created), and the new fragment is merged in by comparing start and end byte numbers with those of fragments already on the list. Duplicate data is thrown away. A timer is associated with this queue, and incomplete messages which remain after timeout are dropped and their storage is freed. Completed messages are passed to the next level. 4.1.3 TCP Level At the TCP level, incoming datagrams are processed via calls to a "TCP machine." This is the TCP itself, which is organized as a finite state machine whose states are roughly the various states of the protocol as defined in [4], and whose inputs include incoming data from the network, user open/close/read/write requests, and timer events. Input from the network is handled directly, passing through the above described -6-
VAX-UNIX Networking January, 1981 Support Project IEN 168 levels. User requests and timer events are handled through a work queue. When a user process executes a network request via system call, the relevant data (on a read or write) is copied from user to kernel space (or vice versa), a work entry is enqueued, and the network process is awakened. Similarly, when timers associated with TCP (such as the retransmission timer) go off, timer work requests are enqueued and the network input process is awakened. Once awakened, it checks for the presence of completed messages from the network interface and processes them. After these inputs are processed, the TCP machine is called to handle any outstanding requests on the work queue. The network process then sleeps, waiting for more network input or work requests. Thus, the TCP machine may be called directly with network input, or awakened indirectly to check its work queue for user and timer requests. After reset processing and sequence and acknowledgement number validation, acceptable received data is sequenced and placed on the receive queue. This sequencing process is similar to the IP fragment reassembly algorithm described above. Data placed on this queue is acknowledged to the foreign host. Received data whose sequence numbers lie outside the current receive window are not processed, but are placed on an unacknowledged message queue. The advertised receive window is determined on the basis of the remaining amount of buffering allocated to the connection (see below). When buffering becomes available, data on the unacknowledged message queue is then processed and placed on the receive data queue. On the output side, TCP requests for data transmission result in calls to the IP level output routine. This routine does fragmentation, if necessary, and makes calls on the local network output routine. Outgoing messages are then placed on a buffering queue, for transmission to the network interface by the device driver. In data transmission, an attempt is made to ensure that data moving from the highest level (TCP), will not be sent unless there is reasonable certainty that the lower levels will have the necessary resources to accept the message for transmission to the network. All data to be sent is maintained on a single send queue, where data is added on user writes, and removed when proper acknowledgement is received. Whenever the TCP machine sends data, a retransmission timer is set, and the sequence number of the first data byte on the queue is saved. After initial transmission the sequence number of the next data to send is advanced beyond what was first sent. If the retransmission timer -7-
VAX-UNIX Networking January, 1981
Support Project IEN 168
goes off before that data is acknowledged, the sequence number of
the next data to send is backed up, and the contents of the send
buffer (for the length determined by the current send window) is
retransmitted, with the ACK and window fields set appropriately.
The retransmission timer is set with increasingly higher values
from 3 to 30 seconds, if the saved sequence number does not
advance.
A persistence timer is also set when data is sent. This
allows communication to be maintained if the foreign process
advertises a zero length window. When the persistence timer goes
off, one byte of data is forced out of the TCP.
4.2 Buffering Strategy
As mentioned earlier, all data is passed from the network to
the various protocol software layers in intermediate sized (128
byte) buffers. The buffers have two chain pointers, a data
offset, and a data length field (see Figure 2). As data is read
from the network or copied from the user, multiple buffers are
chained together. Protocol headers are also held in these
buffers. As messages are passed between the various software
levels, the offset is modified to point at the appropriate
header. The length field gives the end of data in a particular
buffer. This offset/length pair facilitates merging of messages
in IP fragment reassembly and TCP sequencing.
The allocation of these buffers is handled by the network
software. Buffers are obtained by "stealing" page frames from
the kernel's free memory map (CMAP). In VM/UNIX, these page
frames are 1024 bytes long, and thus have room for eight 128 byte
buffers. The advantage of using kernel paging memory as a source
of network buffers is that their allocation can be done totally
dynamically, with little effect on the operation of the overall
system. Buffers are allocated from a cache of free page frames,
maintained on a circular free list by the network memory
allocator. As the demand for buffers increases, new page frames
are stolen from the paging freelist and added to the network
buffer cache. Similarly, as the need for pages decrease, free
pages are returned to the system. To minimize fragmentation in
buffer allocation within the page frames, the free list is
sorted. When no more pages are available for allocation, data on
the IP reassembly and TCP unacknowledged data queues are dropped,
and their buffers are recycled.
-8-
VAX-UNIX Networking January, 1981 Support Project IEN 168 ^ |------------------------| ^ | | -> NEXT BUFFER | | 10 |------------------------| | BYTES | QUEUE LINK | | | |-----------|------------| | V | OFFSET | LENGTH | | |-----------|------------| | | | 128 | | BYTES | | | | D A T A | | | | | | | | | | | | | | |------------------------| V Figure 2 . Layout of a Network Buffer The number of pages that can be stolen from the system is limited to a moderate number (in practice 64-256, depending on network utilization in a particular system). To enforce fairness of network resource utilization between connections, the number of buffers that can be dedicated to a particular connection at any time is limited. This limit can be varied to some small degree by the user when a connection is opened. Thus, a TELNET user may open a connection with the minimum 1K bytes of send and receive buffering; while an FTP user, anticipating larger transfers, might desire up to 4K of buffering. The effect of this connection buffering allocation is to place a limit on the amount of data that the TCP may accept from the user for sending before blocking, and the amount of input from the network that the TCP may acknowledge. Note that in receiving, the network software may allocate available buffers beyond the user's connection limit for incoming data. However, this data is considered volatile, and may be dropped when buffer demands go higher. Incoming data is acknowledged by TCP only until the user's connection buffer limit is exhausted. The advertised TCP flow control window for a connection is set on the basis of the remaining amount of this buffering. Thus, the network software must insure that it has enough buffering for 1) its own internal use in processing data on the IP and local network levels; 2) retaining acknowledged TCP data -9-
VAX-UNIX Networking January, 1981
Support Project IEN 168
that have not been copied to user space; and 3) retaining data
accepted by the TCP for transmission which have not yet been
acknowledged by the foreign host TCP. Other data, such as
unacknowledged TCP input from the network and fragments on the IP
reassembly queue are vulnerable to being dropped when demand for
more buffers makes necessary the recycling of buffers on these
queues. Since there is an absolute limit on the number of page
frames that may be stolen from the paging system, and hence the
total number of buffers available, there is a resultant limit on
the total number of simultaneous connections.
Several data structures are required for stealing page
frames from the kernel and maintaining the buffer free list.
These include enough page table entries for mapping the maximum
number of page frames which can be stolen from the system, an
allocation map for allocating these page table entries, and the
free page list itself. For a 256 page maximum, this requires 2K
bytes of page tables, 1K bytes for page frame allocation mapping,
and another 1K bytes for the network freelist. The maximum page
parameter and others, including the minimum and maximum amount of
buffering that the user may specify are modifiable constants of
the implementation.
4.3 Data Structures
Along with the data structures needed to support the buffer
management system, there are several others used in the network
software (see Figure 3). The focus of activity is the user
connection block (UCB), and the TCP control block (TCB). The UCB
is allocated from a table on a per connection basis. It holds
non-protocol specific information to maintain a connection. This
includes a pointer the UNIX process structure of the opener of a
connection, (2) a pointer to the foreign host entry for the peer
process's host, a pointer to the protocol-specific connection
control block (for TCP, the TCB), pointers to the user's send and
receive data buffer chain, and miscellaneous flags and status
information. When a network connection is opened, an entry in
the user's open file table is allocated, which holds a pointer to
the UCB.
For TCP connections, a TCB is allocated. All TCBs are
chained together to facilitate buffer recycling. The TCB
contains a pointer to the corresponding UCB, a block of sequence
number variables and state flags used by the TCP finite state
_______________
(2) For details on data structures specific to UNIX, see [5].
-10-
VAX-UNIX Networking January, 1981 Support Project IEN 168 Foreign Host Table |--------| Network |------>|Host Adr| Conn Table | |--------| |--------| | | #RFNM | |--->|->Proc |<--+--| |--------| | |--------| | | | Status | | |->Host |---| | |--------| Per User | |--------| | File Table | | ->TCB |---| | TCB |--------| | |--------| | | |--------| | Flags | | |->S Buf | |--+--->| ->next | |--------| | |--------| | |--------| | ->UCB |---| |->R Buf | |----| ->UCB | |--------| |--------| |--------| | Flags | | FSM | | and | |Sequence| | Status | | Vars | |--------| |--------| |->Snd Q | |--------| |->Rcv Q | |--------| |->UnackQ| |--------| | Flags | | and | | Status | |--------| Figure 3 . Network Data Structures machine, pointers to the various TCP data queues, and flags and state variables. Protocols other than TCP would have their own control blocks instead of the TCB. For the "raw" local network and IP handlers, all necessary information is kept in the UCB. Finally, there is a foreign host table, where entries are allocated for each host that is part of a connection. The entry contains the foreign host's internet address, the number of outstanding RFNM's for 1822 level host-imp communication, and the status of the foreign host. Entries in this table are hashed on the foreign host address. -11-
VAX-UNIX Networking January, 1981
Support Project IEN 168
5 References
[1] Babaoglu, O., W. Joy, and J. Porcar, "Design and
Implementation of the Berkeley Virtual Memory Extensions to
the UNIX Operating System," Computer Science Division, Dept.
of Electrical Engineering and Computer Science, University
of California, Berkeley, December, 1979.
[2] Bolt Beranek and Newman, "Specification for the
Interconnection of a Host and an IMP," Bolt Beranek and
Newman Inc., Report No. 1822, May 1978 (Revised).
[3] Postel, J. (ed.), "DoD Standard Internet Protocol," Defense
Advanced Research Projects Agency, Information Processing
Techniques Office, RFC 760, IEN 128, January, 1980.
[4] Postel, J. (ed.), "DoD Standard Transmission Control
Protocol," Defense Advanced Research Projects Agency,
Information Processing Techniques Office, RFC 761, IEN 129,
January, 1980.
[5] Thompson, K., "UNIX Implementation," The Bell System
Technical Journal, 57 (6), July-August, 1978, pp. 1931-1946.
-12-
VAX-UNIX Networking January, 1981 Support Project IEN 168 Table of Contents 1 Introduction.......................................... 1 2 Features of the Implementation........................ 1 2.1 Protocol Dependent Features......................... 1 2.1.1 Separation of Protocol Layers..................... 1 2.1.2 Protocol Functions................................ 2 2.2 Operation System Dependent Features................. 2 2.2.1 Kernel Resident Networking Software............... 2 2.2.2 User Interface.................................... 3 3 Design Goals.......................................... 4 4 Organization.......................................... 4 4.1 Control Flow........................................ 4 4.1.1 Local Network Interface........................... 5 4.1.2 Internet Protocol................................. 6 4.1.3 TCP Level......................................... 6 4.2 Buffering Strategy.................................. 8 4.3 Data Structures.................................... 10 5 References........................................... 12 -i-
mirror server hosted at Truenetwork, Russian Federation.