The Packet-Based Protocol
HyperTransport employs a packet-based protocol in which all information —address, commands, and data — travel in packets which are multiples of four bytes each. Packets are used in link management (e.g. flow control and error reporting) and as building blocks in constructing more complex transactions such as read and write data transfers.
It should be noted that, while packet descriptions in this chapter are in terms of bytes, the link's bidirectional interface width (2, 4, 8, 16, or 32 bits) ultimately determines the amount of packet information sent during each bit time on HyperTransport links. There are two bit times per clock period.
Before looking at packet function and use, the following sections describe the mechanics of packet delivery over 2,4,8,16, and 32 bit scalable link interfaces.
8 Bit Interfaces
For 8-bit interfaces, one byte of packet information may be sent in each bit time. For example, a 4-byte request packet would be sent by the transmitter during four adjacent bit times, least significant byte first as shown in Figure 4-1 on page 61. Total time to complete a four-byte packet is two clock periods.
Figure 4-1. Four Byte Packet On An 8-Bit Interface
Interfaces Narrower Than 8 Bits
For link interfaces which are narrower than 8 bits, the first byte of packet information is shifted out over multiple bit times, least significant bits first. Referring to Figure 4-2 on page 62, a 2-bit interface would require four bit times to transmit each byte of information. After the first byte is sent, subsequent bytes in the packet are shifted out in the same manner. Total time to complete four byte packet: eight clock periods.
Figure 4-2. Four Byte Packet On A 2-Bit Interface
Interfaces Wider Than 8 Bits
For 16 or 32 bit interfaces, packet delivery is accelerated by sending multiple bytes of packet information in parallel with each other.
16 Bit Interfaces
On 16-bit interfaces, two bytes of information may be sent in each bit time. Referring to Figure 4-3 on page 63, note that even numbered bytes travel on the lower portion of the 16 bit interface, odd numbered bytes on the upper portion.
Figure 4-3. Four Byte Packet On A 16-Bit Interface
32 Bit Interfaces
Similarly, four bytes of information may be sent in each bit time on a 32-bit interface. This is shown in Figure 4-4 on page 64. Note that Byte 0 travels on the low portion of the interface (CAD0-7), Byte 1 on the second byte lane (CAD8-15), etc.
Figure 4-4. Four Byte Packet On A 32-Bit Interface
A reminder: Because all HyperTransport packets are multiples of 4 bytes, bits of packet information always divide evenly into the available bus width. There never is a need to "pad" unused bit lanes.
The Two Packet Types: Control And Data
Packets moving across links fall into two groups: control packets and data packets. Control packet types are further divided into three additional classes: Information, Request, and Response.
Control Packet Purpose
The three classes of control packets serve the following purposes on a HyperTransport link:
Information packets
Information packets are always 4 bytes each. They are used for nearest neighbor communication between the transmitter-receiver pairs on each link; communication between these nodes is necessary for dynamic flow control updates and other miscellaneous functions. Information packets are not buffered internally or subject to flow control; when sent by a transmitter they must be accepted by the receiver.
Request packets
Requests are 4 bytes in length if there is no address field, or 8 bytes if the packet does include an address field. They may be either posted or non-posted, and the basic job of a request is to define a pending data or message transaction, or to help bridges manage posted write transactions (through the use of Flush and Fence commands). These packets originate at a source device and are accepted by a target device.
Devices in the path between the source and target forward requests along, subject to HyperTransport rules for ordering.
Response packets
Responses are always 4 bytes each. They are returned by the target after it has serviced a non-posted request. Devices in the path between the response sender and the original requester forward responses along, subject to HyperTransport rules for ordering.
When associated with a non-posted write or flush request, the target done response packet acts as a confirmation (returned to the source device) that the operation has completed. In the event of a problem delivering non-posted write data or completing the flush, the response packet will contain an error flag and a bit indicating whether the target done response is being returned by the intended target OR by another device acting on its behalf (e.g. end-of-chain device).
For read transactions, which are always split in HyperTransport, the read response packet precedes the returning data and identifies the specific read request being serviced. In the event of an error when fetching the data, the read response will contain an error flag and a bit indicating whether the problem occurred at the intended target or at an end-of-chain device acting on its behalf. If there is an error, all data is driven back as FFh by either the target or the end-of-chain device.
Data Packets
While there is only one type of data packet, consisting of 1-16 Dwords, the payload of valid information within a data packet ranges from 0-64 valid bytes--depending on the attributes of the request that caused it. The appropriate time to send a data packet also depends on the request/response associated with it:
For write requests, the data packet is sent immediately after the request. Because there is no routing information in a data packet, the request is used to deliver the data to the intended target.
For read requests, the data packet immediately follows the read response. Key fields in the response are filled in with requester transaction stream information provided in the read request (e.g. UnitID and Source Tag). The response is then used to route the read data packet back to the original requester.
Atomic read-modify-write requests are a hybrid. A data packet is sent with the request (as in a write transaction) and another data packet is returned following the read response (as in a read transaction).
Finally, some requests don't have data packets at all (e.g. Flush and Fence).
The Need To Interleave Control And Data Packets
An important feature of HyperTransport packet management is that a transmitter may interleave control packets with data packets associated with earlier requests. Interleaving control packets with data helps mitigate "stalls" in sending new control packets on the multiplexed CAD bus when large data transfers are in progress. An example of such as stall is as follows:
A transmitter starts a Sized Dword Write of 64 bytes (16 dwords) on a 2-bit HyperTransport link interface.
After the write transaction commences, the transmitter realizes it needs to send a read request or NOP information packet over the bus.
Without the ability to interleave control packets, the transmitter would have to send the entire data payload first (64 bytes x 4 bit times/byte = 256 bit times). This represents a worst-case latency of 128 clocks to start sending the new control packet.
To avoid such situations, HyperTransport allows a transmitter to insert new control packets into a data payload on four byte boundaries, as long as the control packets do not have any immediate data of their own. For example, read requests and NOP flow control packets are candidates for interleaving; write requests would not be candidates for interleaving because they are accompanied by immediate data.
The CTL Signal Indicates Packet Type
A transmitter uses the CTL signal on a HyperTransport link interface to indicate the presence of control vs. data packets it is sending concurrently on the CAD bus. When CTL is asserted (high), a control packet is in transit on the CAD bus; when CTL is deasserted (low), a data packet is being sent. During idle periods, CTL is asserted and control information NOP packets are sent.
When interleaving control packets and data packets on a link, the transmitter is required to observe the following rules as it asserts and deasserts the CTL signal:
CTL is always asserted and deasserted on four byte boundaries.
The only time CTL is deasserted is when a data packet associated with an earlier control packet (e.g., request or response packet) is being sent.
CTL is asserted during all bit times of a control packet; for control packets which are either 4 or 8 bytes, the packet must be sent in its entirety without deasserting CTL (there is no interleaving within control packets). This also means that flow control must assure that transmitters never start sending a control packet if the receiver lacks sufficient buffer space to accept all bytes at full speed. Changes in flow control buffer availability are reported by means of NOP packets.
CTL is deasserted through all bit times of data packets.
Re-assertion of CTL within data packets is permitted on four byte boundaries if the transmitter decides to interleave a new control packet, providing it does not have immediate data of its own. After the control packet is sent, CTL is again deasserted and the current data packet transfer resumes.
Only one data packet may be in progress at a time, although it may be paused for the interleaving of control packet(s).
Ordering of control packets is not affected by the fact that data packets may be paused to interleave them.
The bit time immediately following the end of a data packet is always the start of a control packet, and CTL must be asserted.
Figure 4-5 illustrates packet transmission basics and the use of the CTL signal by a transmitter in accordance with the above rules.
Packet Format: Control Packets
Table 4-1 on page 69 summarizes the HyperTransport Information, Request, and Response control packet types and the command names associated with them. Some things to note in the table:
For each packet variant, the virtual channel (VChan) is indicated in the second column: posted, non-posted, or response. Note: information packets do not travel in any of the virtual channels and are not subject to flow control.
The first byte in each control packet type contains a 6-bit Command (CMD) Code. By sending this information at the beginning of a control packet, the receiver is informed immediately of the type of packet being transferred, the number of bytes to expect, and the format of the bit fields contained within. The Command Codes are shown in the left column of Table 4-1.
In some Command Codes, a number of bits are variables (indicated by ".xxx") which are used to select transaction options: dword vs. byte transfer count, isochronous flag, coherency requirement, etc.; refer to the Comments column Table 4-1 for usage of each optional bit.
Table 4-1. Control Packets And The HyperTransport Command Types CMD Code
V Chan
Command Name
Packet Type
Comments
000000
-----
NOP
Info
Used by each receiver to report flow-control information to its transmitter.
111111
-----
Sync/Error
Info
Similar to PCI SERR#, indicates need for link reset and re-synchronization.
101xxx
Posted
Sized Write (Posted)
Request
Usage of three least-significant bits:
[2] Dword/Byte (1 = dword; 0 = byte).
[1] Isoc request (1 = Isoc; 0 = std.).
[0] Coherency (1 = req'd; 0 = not)
001xxx
Non-Posted
Sized Write (Non-Posted)
Request
Usage of three least-significant bits:
[2] Dword/Byte (1 = dword; 0 = byte).
[1] Isoc request (1 = Isoc; 0 = std.).
[0] Coherency (1 = req'd; 0 = not)
111010
Posted
Broadcast Message
Request
Broadcast messages originate at host bridge, and are accepted and propagated downstream by all devices which see them.
01xxxx
Non-Posted
Sized Read (all reads are non-posted)
Request
Usage of four least-significant bits:
[3] Response may pass posted requests (1 = OK; 0 = Do not pass)
[2] Dword/Byte (1 = dword; 0 = byte).
[1] Isoc(1 = Isochronous; 0 = std.).
[0] Coherency (1 = req'd; 0 = not)
000010
Non-Posted
Flush
Request
Forces all preceding posted writes in same transaction stream to destination (within host).
111100
Posted
Fence
Request
Forces all preceding posted writes to destination (all virtual channels).
111101
Non-Posted
Atomic RMW
Request
A non-posted write transaction with a read response. Two variants: Fetch and Add, Compare and Swap. Both variants allow reading, modification, and write back of a "locked" memory location semaphore.
110000
Resp
Read Response
Response
On read and Atomic RMW transactions, read response precedes the data being returned by target. In the event of a failure in completing the read, error bits in the response indicate the nature of the problem.
110011
Resp
Target Done
Response
On non-posted write or flush transactions, target done response confirms completion. In the event of a failure, error bits in the response indicate the nature of the problem.
Control Packets: Information
There are two types of Information control packets, NOP and Sync/Error. These four-byte packets are exchanged between the transmitter-receiver pairs on a single link. Unlike request and response packets, information packets are not flow controlled; when one is sent by a transmitter to its corresponding receiver, it must be accepted.
NOP Packet
The NOP (No Operation) command indicates an idle condition on the link. After the link is initialized, each transmitter issues NOP commands continuously unless another command type is required. In addition to indicating the idle condition, these packets inform the device receiving them about changes in the status of flow control buffers and other miscellaneous information concerning link management and diagnostics. Figure 4-6 on page 71 depicts the various fields of the four-byte NOP packet. Table 4-2 immediately following summarizes the usage of each bit field.
Figure 4-6. Control Packets: NOP Information
Table 4-2. HyperTransport NOP Packet Bit Assignments Byte
Bit
Function
0
5:0
NOP Command Code. This is the six bit command code for a NOP information packet. Value = 000000b.
0
6
DisCon. When this bit is set to a one, the transmitter is indicating that it is starting a LDTSTOP# disconnect sequence. All six buffer release fields must all be = 0 when this bit is set (see next two bytes in packet format).
0
7
Reserved. Tie to low level.
1
1:0
PostCmd[1:0]. Number of posted command buffer entries released since last NOP. Two bit field is coded as:
00 = 0 posted command buffer entries released since last NOP
01 = 1 posted command buffer entry released since last NOP
10 = 2 posted command buffer entries released since last NOP
11 = 3 posted command buffer entries released since last NOP
1
3:2
PostData[1:0]. Number of posted data buffer entries released since last NOP. Two bit field is coded as:
00 = 0 posted data buffer entries released since last NOP
01 = 1 posted data buffer entry released since last NOP
10 = 2 posted data buffer entries released since last NOP
11 = 3 posted data buffer entries released since last NOP
1
5:4
Response[1:0]. Number of response command buffer entries released since last NOP. Two bit field is coded as:
00 = 0 response buffer entries released since last NOP
01 = 1 response buffer entry released since last NOP
10 = 2 response buffer entries released since last NOP
11 = 3 response buffer entries released since last NOP
1
7:6
ResponseData[1:0]. Number of response data buffer entries released since last NOP. Two bit field is coded as:
00 = 0 response data buffer entries released since last NOP
01 = 1 response data buffer entry released since last NOP
10 = 2 response data buffer entries released since last NOP
11 = 3 response data buffer entries released since last NOP
2
1:0
NonPostCmd[1:0]. Number of non-posted command buffer entries released since last NOP. Two bit field is coded as:
00 = 0 non-posted command buffer entries released since last NOP
01 = 1 non-posted command buffer entry released since last NOP
10 = 2 non-posted command buffer entries released since last NOP
11 = 3 non-posted command buffer entries released since last NOP
2
3:2
NonPostData[1:0]. Number of non-posted data buffer entries released since last NOP. Two bit field is coded as:
00 = 0 non-posted data buffer entries released since last NOP
01 = 1 non-posted data buffer entry released since last NOP
10 = 2 non-posted data buffer entries released since last NOP
11 = 3 non-posted data buffer entries released since last NOP
2
4
Reserved. Tie to low level.
2
5
Isoc. When set, this bit indicates that flow-control information being sent in this NOP applies to the isochronous virtual channels. Isochronous operation is optional; unless it has been enabled on the link, no isochronous flow-control information should be sent. If this bit is = 0, flow-control information being sent in bytes 0,1, and 2 applies to standard posted, non-posted, and response virtual channels.
2
6
Diag. (Optional Feature) Software enables CRC testing by writing the CRC Start Test bit in the Link Control Register. When Diag bit is first detected set = 1, the CRC diagnostic testing phase commences: The receiver, seeing this NOP bit set, ignores its CAD and CTL signals for 512 bit times. Then the transmitter sends any test pattern on the CAD/CTL lines; CRC is checked by the receiver, and errors are logged. If enabled, sync flood will be also performed on CRC test error. Aside from CRC check, CAD bus data values are ignored during test and not retransmitted.
2
7
Reserved. Tie to low level
3
7:0
Reserved. Tie to lowlevel
Sync/Error Packet
If a reset or error condition occurs which requires a re-synchronization of HyperTransport devices, a "sync flood" pattern may be issued. All bit fields of a Sync/Error packet are 1's, allowing a device to detect and decode a Sync packet even if it has a corrupt sense of clock rate and link width. Each transmitter that drives the Sync pattern holds it until the link resets and re-synchronizes. Any receiver on an 8-, 16-, or 32-bit link assumes it has detected a Sync event if decodes sync packets or if all 1's are received for 16 bit times on the lowest 8 bits of the link; this time is extended to 32 bit times on a 4-bit link interface and 64 bit times on a 2-bit link interface.
The Sync/Error information packet is illustrated in Figure 4-7 on page 74 using normal decode logic. Table 4-3 on page 74 defines the Sync packet bit fields.
Figure 4-7. Control Packets: Sync Information
Table 4-3. HyperTransport Sync Packet Bit Assignments Byte
Bit
Function
0
5:0
Sync Command Code. This is the six bit command code for a Sync information packet. Value = 111111b.
0
7:6
Reserved. Must be driven to 1's.
3:1
7:0
Reserved. Must be driven to 1's.
Control Packets: Requests
As shown previously in Table 4-1 on page 69, there are a number of different request types; each variant has a slightly different way of using the fields within its request packet. In this section, the basic packet format layout used by the principal request types is covered, including Sized Read (always non-posted), Sized Write (posted and non-posted), Broadcast Message (always posted), Flush (always non-posted), Fence (always posted), and Atomic Read-Modify-Write (always non-posted).
Sized Read And Sized Write Requests
The eight-byte sized read and sized write packets (abbreviated RdSized and WrSized in the Specification) are the mainstream commands used to perform most of the data transfers to both memory or I/O in HyperTransport. Some of the options available with sized read and write requests are:
Byte or dword read/write data transfers; valid data transferred ranges from 0 bytes to 64 bytes (16 dwords).
Posted or non-posted virtual channel for writes. Reads are always split transactions traveling in the non-posted virtual channel.
Isochronous posted or non-posted virtual channels for the request and any subsequent response. Isochronous flow control buffers are required to support this traffic.
Coherency option bit which indicates whether the transaction requires enforcement of host cache coherency. If the transaction does not target host memory, this feature does not apply.
Assignment of a non-zero Sequence ID attribute to requests forces other devices to maintain strict ordering for all requests from same source. A Sequence ID of 0 indicates that there is no strict ordering required.
Use of reserved ranges in RdSized and WrSized request packet address fields to support special-case transactions, including configuration cycles, interrupt requests, and End-Of-Interrupt (EOI) messages, etc.
Generic RdSized And WrSized Request Packet Format
Figure 4-8 on page 76 depicts the various fields of the eight-byte Sized Read or Sized Write packet. Table 4-4 on page 76 summarizes the usage of each bit field.
Figure 4-8. Control Packets: Generic Sized Read/Sized Write Requests
Table 4-4. HyperTransport Sized Read/Write Packet Bit Assignments Byte
Bit
Function
0
5:0
Command Code. This is the six bit command code for RdSized and WrSized requests.
x01xxxb = WrSized Request
001xxxb = RdSized Request
Usage of bits marked "x": refer to Table 4-1 on page 69.
0
7:6
SeqID[3:2]. (also see Byte 1, bits 5,6). This field tags groups of requests that are part of a strongly ordered sequence. The SeqID value is assigned by the requestor; all transactions within the same transaction stream and virtual channel, and having the same non-zero SeqID value must have their ordering maintained. The SeqID value of 0 is reserved, and indicates a transaction is not part of an ordered sequence.
1
4:0
UnitID[4:0]. In a request, this field identifies the source of a transaction. UnitID of 0 is used by host bridges; non-zero UnitIDs are for interior devices. Because of this convention, requests with UnitID = 0 are moving downstream (from the bridge), and requests with UnitID > 0 are moving upstream (from an interior device). Physical devices are allowed to consume multiple UnitIDs.
1
6:5
SeqID[1:0]. (also see Byte 0, bits 6,7). This is the other half of the 4-bit field used to tag groups of requests that are part of a strongly ordered sequence. The SeqID value of 0 is reserved, and indicates a transaction is not part of an ordered sequence.
1
7
PassPW. When set, this bit indicates that this packet may pass packets in the posted request virtual channel of the same transaction stream. If the bit is clear, this packet must stay ordered behind them.
2
4:0
SrcTag[4:0]. This 5-bit field is used as a transaction tag that uniquely identifies all outstanding transactions sourced by the same UnitID. Each UnitID may have up to 32 outstanding transactions at a time. The UnitID and SrcTag values together uniquely identify non-posted requests in a particular transaction stream. The SrcTag field is reserved and not used for posted requests.
2
5
Compat. When set, this bit indicates that this request packet should only be claimed by the system subtractive decode device which is responsible for forwarding transactions to legacy devices (e.g. compatibility bridge). Requests with this bit set originate at the host bridge and travel downstream in the part of the topology called the "compatibility chain."
2
7:6
Mask/Count[1:0]. (also see Byte 3 bits 0,1). This is the lower half of the 4-bit field that defines dword transfer count or valid bytes in a dword transfer. The meaning of this field depends on whether a byte/dword read or write transfer is being done:
For (Sized) Byte Read transfers: This field is a 4 bit mask indicating which of the four bytes within the target dword are valid (much like byte enables in PCI). Any mask pattern is valid.
For (Sized) Byte Write transfers: This (n-1) field indicates the total number of dwords to be transferred, plus the required dword write mask that precedes data. Example: If 6 dwords containing bytes of interest are to be transferred, the count field would be ((6 + 1)-1) = 6.
For (Sized) Dword Read or Write transfers: This field is an n-1 count indicating the total number of dwords to be transferred. Again, a count of 0 = 1 dword; a count of 15d = 16 dwords.
3
1:0
Mask/Count[3:2]. (also see Byte 2 bits 6,7). This is the upper half of the 4-bit field that defines which bytes are valid during a RdSized or WrSized transfer. The meaning of this field depends on whether a byte or dword transfer is being done. Refer to Byte 2, bits 7:6 above.
3
7:2
Start Address[7:2] (also see Bytes 4-7 bits 0-7) This field provides the lowest bits of the dword-aligned, 40 bit HyperTransport target start address. Refer to the HyperTransport address map for a detailed description of the address ranges set aside for memory, I/O, configuration cycles, broadcast messages, interrupts, etc.
7:4
7:0
StartAddress[39:8] (also see Byte 3 bits 2-7) This field provides the upper bits of the 40 bit HyperTransport target start address.
RdSized And WrSized Requests: Transaction Limits
Using the various request packet option bits when constructing RdSized and WrSized transactions makes it possible to perform byte and dword read and write transfers in a number of variations. The following section describes some of the key limits associated with RdSized and WrSized requests.
RdSized And WrSized (Dword) Transactions
Sized dword read and write transactions can transfer any number of contiguous dwords within a 64 byte, address-aligned block. The request packet Mask/Count field provides the number of dwords to be transferred, beginning at the start address and indexing addresses sequentially upward until the limit defined by the Mask/Count field is reached. All bytes in the range are considered valid. Dword read and write start addresses must be dword aligned. If the start address is 64 byte aligned, the transfer may include the entire 64 byte (16 dword) region; if the start address is not 64 byte aligned, the transfer can only go to the end of the current 64-byte address-aligned block. Dword requests which would cross 64 byte address boundaries must be broken into multiple transactions.
RdSized (Byte) Transactions
Sized byte read transactions can transfer any combination of bytes within one address-aligned dword; requests which would cross an aligned dword address boundary must be broken into multiple transactions. The request packet Mask/Count field provides the "byte enable" mask pattern, indicating which bytes are valid. Mask[0] qualifies byte 0, Mask[1] qualifies byte 1, etc. Any mask pattern is legal; mask bits can be ignored by targets reading from "pre-fetchable" locations (all four bytes in the target dword are always returned).
WrSized (Byte) Transactions
Sized byte write transactions can transfer any combination of bytes within a 32-byte address-aligned region. The request packet Mask/Count field provides the total number of dwords to be transferred including the required single dword "write mask" pattern. The mask itself is sent just ahead of the data byte payload, and indicates which of the data bytes that follow are valid. Mask bit[0] qualifies byte 0, Mask bit [31] qualifies byte 31, etc. Byte write start address must be dword aligned. If the start address is 32 byte aligned, the write transfer may be as large as the entire 32 byte (8 dword) region; if the start address is not 32 byte aligned, the transfer can only go to the end of the current 32 byte address-aligned block. Basically, start address bits [4:2] identify the first the valid dword of data within the 32-byte region defined by start address bits [39:5]. Byte write requests which would cross 32 byte address boundaries must be broken into multiple transactions. A couple of subtle things about these transfers:
The entire dword (32 bit) mask is always sent ahead of the data payload, regardless of start address and number of bytes being transferred. Mask bit fields are cleared for all invalid bytes in the 32-byte region ahead of the start address, for all invalid bytes within the transfer range itself, and for all unsent bytes remaining in the 32-byte region beyond the transfer limit implied by the Mask/Count field.
While it isn't illegal to send invalid dwords at the front and back of a WrSized (Byte) transfer, it is more efficient to adjust the start address and Mask/Count field to trim off completely invalid dwords in front of the first and after the last dwords containing at least one valid byte in the 32 byte aligned region.
RdSized And WrSized Requests: Other Notes
Coherency
The coherency bit in the Command field of RdSized and WrSized request packets (Byte 0, bit 0) indicates whether host cache coherency is a concern when HyperTransport RdSized and WrSized requests target host memory. Some buses, such as PCI, require coherency enforcement any time a transaction originating in the I/O subsystem targets main memory. This can represent a serious performance hit as processors spend much of their time snooping internal caches for accesses which they may not cache anyway.
HyperTransport uses the coherency bit in the Command field of the request packet to inform the system whether coherency actions are required. If the coherency bit is set:
All HyperTransport writes targeting host memory result in the CPU updating or invalidating the relevant cache line.
All HyperTransport reads targeting main memory must result in the latest copy being returned to the requestor. If the CPU has a modified cache line, the system must assure that this is the one returned to the requestor.
If a device has no particular requirement for coherency, it may chose to keep the coherency bit cleared. In this case, the request will complete without any coherency events.
Special Case: Forcing A Coherency Event. A RdSized (byte) targeting host memory with all Mask/Count bits set = 0 (no valid bytes) and coherency bit set = 1 in the request packet Command field causes a host coherency action, using the address provided in the read. One dword of invalid data will be returned.
WrSized Requests And The Posted Bit
Sized write request packets may or may not set the posted bit (bit 5 of the CMD field). The implications of this bit are as follows:
If set, the bit indicates the write request will travel in the posted request virtual channel and that there will not be a response from the target. Each device in the transaction path may de-allocate its buffers as soon as the posted request is transmitted. This also means that the SrcTag field is not used (reserved) because posted writes have no outstanding responses to track. This is in contrast to non- posted requests which require a unique SrcTag field for each request issued.
It the posted bit is not set, the requestor expects a confirmation that the data written has reached the destination — and is willing to suffer the performance penalty and wait for it. Eventually, a Target Done response will be routed back to the original requestor. In HyperTransport, certain address ranges require non-posted writes; this includes configuration and I/O cycles.
Errors During RdSized Transactions
In the event of a read error (SizedRd command), a response and all requested data is returned to the requestor, even though some or all of the data is not valid. Proceeding with a "dummy" read of invalid data is mainly for the benefit of devices in the transaction path that have already allocated flow control buffer space for the returning data. These devices use the return of each byte to simplify de-allocation of buffer space.
PassPW and Response May Pass Posted Requests bits
HyperTransport supports the strict producer-consumer ordering model found in PCI systems. There are occasions when strict producer/consumer ordering may not be required. In these cases, devices are allowed some flexibility in reordering of posted and non-posted request packets, as well as response packets. Ordering rules, including relaxed ordering, are described in more detail in the chapter entitled Ordering. Relaxing ordering rules is application-specific, and may provide better system performance in some cases.
The source of a transaction indicates whether or non relaxed ordering is permitted through the setting or clearing of two bits in a request:
PassPW bit. The PassPW request packet bit (Byte 1, bit 7) is programmed in the request packet and affects how ordering rules are applied to request as it moves toward the target. If set = 1, relaxed ordering is enabled; if PassPW is clear, relaxed ordering is not allowed.
Response May Pass Posted Requests bit. For RdSized transactions, there is also a bit in the Command field of the RdSized request packet called Response May Pass Posted Requests (Byte 0, bit 3). This bit state will be replicated in the PassPW bit of the returning response and affects how ordering rules are applied to response as it moves back to the original source. The Response May Pass Posted Requests bit does not apply to commands other than RdSized. For reads, the bit should be cleared if the strict producer/consumer ordering model is required; otherwise this bit and the PassPW bit should both be set in the request.
Compatibility Bit
In keeping with PCI subtractive decoding, HyperTransport may use the Compat bit in RdSized and WrSized request packets (Byte 2, bit 5) to enable them to reach legacy hardware (e.g. boot firmware) behind the system subtractive decoder. When the Compat bit is set, all system devices should pass the request downstream through the "compatibility chain" to the subtractive decoder. Only the subtractive decoder may claim these transactions. The Compat bit is reserved and must not be set for upstream requests or configuration cycles.
Broadcast Message Requests
The eight-byte Broadcast Message request initiates a global message to all enabled HyperTransport devices. They are issued by host bridges, and travel only in the downstream direction. Implementation of Broadcast Message schemes are system-specific, so the use of address and many other fields is left to designers. Basic format is shown in Figure 4-9 on page 82. Table 4-5 on page 83 summarizes the usage of each defined bit field.
Figure 4-9. Control Packets: Broadcast Message Request
Table 4-5. HyperTransport Broadcast Message Packet Bit Assignments Byte
Bit
Function
0
5:0
Broadcast Message Request Command Code. This is the six bit command code for a Broadcast Message request packet. Value = 111010b.
0
7:6
SeqID[3:2]. (also see Byte 1, bits 5,6). This field tags groups of requests that are part of a strongly ordered sequence. The SeqID value is assigned by the requestor; all transactions within the same transaction stream and virtual channel, and having the same non-zero SeqID value must have their ordering maintained. The SeqID value of 0 is reserved, and indicates a transaction is not part of an ordered sequence.
1
4:0
UnitID[4:0]. Must be 0. In a request, this field identifies the source of a transaction. UnitID of 0 is used by host bridges; non-zero UnitIDs are for interior devices. Because of this convention, requests with UnitID = 0 (such as Broadcast Message) only move downstream.
1
6:5
SeqID[1:0]. (also see Byte 0, bits 6,7). This is the other half of the 4-bit field used to tag groups of requests that are part of a strongly ordered sequence. The SeqID value of 0 is reserved, and indicates a transaction is not part of an ordered sequence.
1
7
PassPW. Reserved because Broadcast Message always travels in posted virtual channel so a response is not required.
2
7:0
These bits are reserved for a Broadcast Message because SrcTag isn't needed (posted request), Mask/Count isn't needed (no data packet), and the Compatibility bit is never set for these messages.
3
1:0
SeqID[1:0]. (also see Byte 0, bits 6,7). This is the other half of the 4-bit field used to tag groups of requests that are part of a strongly ordered sequence. The SeqID value is assigned by the requestor; all transactions within the same transaction stream and virtual channel, and having the same non-zero SeqID value must have their ordering maintained. The SeqID value of 0 is reserved, and indicates a transaction is not part of an ordered sequence.
3
1:0
Reserved. Mask/Count isn't needed for Broadcast Messages (no data packet)
3
7:2
Start Address[7:2] (also see Bytes 4-7 bits 0-7) This field provides the lowest bits of the dword-aligned, 40 bit HyperTransport target start address. Broadcast Message usage of this field is system specific.
7:4
7:0
Start Address[39:8] (also see Byte 3 bits 2-7) This field provides the upper bits of the 40 bit HyperTransport target start address. Broadcast Message usage of this field is system specific
Flush Requests
One of the hazards of posted write buffers is that there is no certainty about when the data actually arrives at the destination because no response is ever expected (or sent). The four-byte Flush request guarantees that all previous posted writes within the same transaction stream are "globally visible" in host memory. Flush behaves like a dummy read operation in that it is a non-posted request followed by a response (Target Done) which simply indicates that the Flush operation is complete all of the way to the host bridge.
The Flush request format is shown in Figure 4-10 on page 85. Table 4-6 immediately following summarizes the usage of each defined bit field.
Figure 4-10. Control Packets: Flush Request
Table 4-6. HyperTransport Flush Packet Bit Assignments Byte
Bit
Function
0
5:0
Flush Request Command Code. This is the six bit command code for a Flush request packet. Value = 000010b.
0
7:6
SeqID[3:2]. (also see Byte 1, bits 5,6). Must be 0. This is half of the 4-bit field used to tag groups of requests that are part of an ordered sequence within a particular transaction stream and virtual channel. The SeqID value must be 0 for Flush requests because they are never part of an ordered sequence.
1
4:0
UnitID[4:0]. This field identifies the source of the Flush request.
1
6:5
SeqID[1:0]. (also see Byte 0, bits 6,7). Must be 0. This is the other half of the 4-bit field used to tag groups of requests that are part of an ordered sequence within a particular transaction stream and virtual channel. The SeqID value must be 0 for Flush requests because they are never part of an ordered sequence.
1
7
PassPW. Must be 0 in a Flush operation in order for the Flush to accomplish its task of pushing posted writes ahead of it.
2
4:0
SrcTag[4:0]. This 5-bit field is used as a transaction tag that uniquely identifies all transactions in progress by the same UnitID. Each UnitID may have up to 32 outstanding transactions at a time. The UnitID and SrcTag values together uniquely identify non-posted requests in a particular transaction stream, including Flush.
2
7:5
Reserved. Mask/Count and Compat bits are reserved in Flush request packets because no data is returned with the Target Done response and these requests never target the compatibility bus.
3
7:0
Reserved.
Flush Requests: Transaction Limits
The Flush request is a tool used to manage posted writes headed toward host memory. Two important limitations of the Flush request are:
If the posted writes target memory other than host memory (e.g. peer-to-peer transfers), then the flush request and response only guarantee that the posted writes have reached the destination host bridge, not the ultimate target. After the host bridge re-issues all peer-to-peer requests downstream towards the intended targets, it sends the target done response back to the original requestor; it is entirely possible the flush response (target done) will reach the original requestor before the request is seen at the target.
Flushes have no impact on the isochronous virtual channels. If isochronous flow control is not enabled on a link, then packets which do have the Isoc bit set actually travel in the normal virtual channels and will be affected by Flush requests.
Fence Requests
Fence Requests
Another tool in the management of posted write transactions is the HyperTransport Fence command. The main features of the Fence request are:
A Fence request provides a barrier between posted writes which applies to all UnitID's (transaction streams). This is different from the Flush which is specific to the posted writes associated with a single transaction stream. When the Fence is decoded by the bridge, it sends any previously posted writes in its buffers toward memory. As always, ordering is maintained for posted writes within individual single transaction streams, but no particular ordering is required for different streams.
The Fence request travels in the posted virtual channel, meaning that there is no response expected or sent.
The Fence request format is shown in Figure 4-11 on page 87. Table 4-7 immediately following summarizes the usage of each defined bit field.
Figure 4-11. Control Packets: Fence Request
Table 4-7. HyperTransport Fence Packet Bit Assignments Byte
Bit
Function
0
5:0
Fence Request Command Code. This is the six bit command code for a Fence request packet. Value = 000010b.
0
7:6
SeqID[3:2]. (also see Byte 1, bits 5,6). Must be 0. This is half of the 4-bit field used to tag groups of requests that are part of an ordered sequence within a particular transaction stream and virtual channel. The SeqID value must be 0 for Fence requests because they are never part of an ordered sequence.
1
4:0
UnitID[4:0]. This field identifies the source of the Fence request.
1
6:5
SeqID[1:0]. (also see Byte 0, bits 6,7). Must be 0. This is the other half of the 4-bit field used to tag groups of requests that are part of an ordered sequence within a particular transaction stream and virtual channel. The SeqID value must be 0 for Fence requests because they are never part of an ordered sequence.
1
7
PassPW. Must be 0 in a Fence operation in order for the Fence to accomplish its task of pushing all previously posted writes ahead of it.
2
7:0
Reserved. SrcTag, Mask/Count and Compat bits are reserved in Fence request packets because posted requests don't use SrcTags, no data is associated with the Fence request, and these requests never target the compatibility bus.
3
7:0
Reserved.
Fence Requests: Transaction Limits
The Fence request is a tool used to manage posted writes headed toward host memory from all transaction streams. Limitations of the Fence request include:
Fence requests are issued from a device to a host bridge, or from one host bridge to another. While a tunnel forwards fence requests it sees, tunnels and single-link cave devices are never the target of a fence request and are never required to perform the fence function internally.
Fences have no impact on the isochronous virtual channels. If isochronous flow control is not enabled, then other packets which do have the Isoc bit set actually travel in the normal virtual channels and will be affected by fence requests.
If a fence request is seen by an end-of-chain device, it decodes the transaction and drops it. It may optionally choose to log the event as an end-of-chain error.
Atomic Read-Modify-Write Requests
While sized read and sized write requests can handle most general purpose HyperTransport data transfers, there are times when a combined, or atomic, read/write command is needed.
Two Problems In Shared Memory Schemes
Two problems related to shared memory schemes include:
A memory location may be used for storing a "semaphore" to be checked by multiple devices (e.g. CPUs or I/O masters) before using a shared system resource. If the contents of the semaphore location indicate the resource is available, the device which reads it then over-writes the semaphore value to indicate the resource is now busy. If another agent reads the semaphore location and sees it is busy, it must wait until the agent using it clears the semaphore location, thus indicating it is again free. The problem arises when a sharing agent has read the semaphore and found the device is not busy. Before it over-writes the data value to claim the resource, another agent reads the semaphore location and also concludes the device is not busy. Now there is a race condition which can result in both devices attempting to over-write the semaphore and use the resource.
The second problem is simpler. If a shared memory location is being used as an accumulator, agents will periodically read the current value, add a constant to it, and write the result back. Again, there is a hazard that the location will be read by one agent and before it can modify it and write it back, another agent may read it with a similar intention. In this case, one of the addends may be lost from the sum.
Most modern bus protocols that support shared memory include a mechanism to avoid the conditions just described. HyperTransport uses the Atomic Read-Modify-Write request for this purpose. The purpose of the Atomic RMW is to force a one-qword (8 byte) memory location to remain "locked" for the duration of the read/modify/write operation required to check and change the targeted location. No other agent is allowed to access the address carried by the Atomic RMW request packet until the entire transaction completes. It is the responsibility of the bridge managing the memory to enforce the locking mechanism.
As a transaction, the Atomic RMW behaves like non-posted write that generates a read response. The read response is accompanied by a single qword of data — the value read from the targeted memory location before any changes are made.
Atomic RMW Variants
The Atomic Read-Modify-Write request has two variants that are designed to address the two cases just described.
Compare And Swap
The Compare and Swap variant of the Atomic RMW sends two qwords of data with the request. One qword (the compare value) is to be checked against the current value in memory; the other qword (the input value) is the data to be written to the memory location if the compare value is equal to the current value. If the compare value is not equal to the current value, the input value is not written to memory. In either case, a read response will be returned accompanied by the original qword read from memory.
Fetch And Add
The Fetch and Add variant of Atomic RMW sends a single qword (the input value) of data with the request. When the Atomic RMW reaches the bridge to main memory, the bridge unconditionally reads the current value from memory, adds the input value to it, and writes the result back to memory. The memory location remains locked to other transactions while the read-modify-write is in progress. A read response is then returned to the requestor, accompanied by the original qword read from memory.
The Atomic RMW request format is shown in Figure 4-12 on page 91. Table 4-8 on page 91 summarizes the usage of each defined bit field.
Figure 4-12. Control Packets: Atomic Read-Modify-Write Request
Table 4-8. HyperTransport Atomic Read — Modify-Write Packet Bit Assignments Byte
Bit
Function
0
5:0
Atomic RMW Request Command Code. This is the six bit command code for a Atomic Read-Modify-Write request packet. Value = 111101b.
0
7:6
SeqID[3:2]. (also see Byte 1, bits 5,6). This field tags groups of requests that are part of a strongly ordered sequence. The SeqID value is assigned by the requestor; all transactions within the same transaction stream and virtual channel, and having the same non-zero SeqID value must have their ordering maintained. The SeqID value of 0 is reserved, and indicates a transaction is not part of an ordered sequence.
1
4:0
UnitID[4:0]. This field identifies the source of the Atomic RMW request.
1
6:5
SeqID[1:0]. (also see Byte 0, bits 6,7). This is the other half of the 4-bit field that tags groups of requests that are part of a strongly ordered sequence. The SeqID value is assigned by the requestor; all transactions within the same transaction stream and virtual channel and having the same SeqID value must have their ordering maintained.
1
7
PassPW. Must be 0 in an Atomic RMW operation.
2
4:0
SrcTag[4:0]. This 5-bit field is used a transaction tag that uniquely identifies all transactions in progress by the same UnitID. Each UnitID may have up to 32 outstanding transactions at a time. The UnitID and SrcTag values together uniquely identify non-posted requests in a particular transaction stream, including Flush.
2
5
Compat. Normally 0. When set, this bit indicates that this packet should only be claimed by the system subtractive decode device which is responsible for forwarding transactions to legacy devices (e.g. compatibility bridge). Atomic RMW transactions normally target host bridges, so this bit is clear.
2
7:6
Mask/Count[1:0]. (also see Byte 3 bits 0,1). This is the lower half of the 4-bit field used to define which bytes are valid during a transfer. The value programmed in the count field depends on the variant of Atomic RMW request:
For Fetch And Add RMW: Count field is set = 1 which indicates 2 dwords (1 qword of data sent with request).
For Compare And Swap RMW: This field is set = 3 which indicates 4 dwords (2 qwords of data sent with request).
3
1:0
Mask/Count[3:2]. (also see Byte 2 bits 6,7). This is the upper half of the 4-bit field that defines which bytes are valid during a transfer. The value programmed in the count field depends on the variant of Atomic RMW request:
For Fetch And Add RMW: Count field is set = 1 which indicates 2 dwords (1 qword of data sent with request).
For Compare And Swap RMW: This field is set = 3 which indicates 4 dwords (2 qwords of data sent with request).
3
7:3
Start Address[7:3] (also see Bytes 4-7 bits 0-7) This field provides the lowest bits of the dword-aligned, 40 bit HyperTransport target start address. For an Atomic RMW, a qword aligned start address must be provided.
7:4
7:0
Start Address[39:8] (also see Byte 3 bits 2-7) This field provides the upper bits of the 40 bit HyperTransport target start address. (See previous field).
Atomic RMW Requests: Transaction Limits
The Atomic RMW request locks a qword memory address block while a read-modify-write operation is performed. Limitations of the Atomic RMW request include:
The request transfer size, as indicated in the Mask/Count field, is restricted to either one or two qwords. Following the request, a read response returns a single qword of data from memory.
These transactions are designed to be generated by I/O devices or bridges, and target system memory. Other than the host bridge, no HyperTransport devices are expected to support atomic operations. If a target detects an unsupported RMW, it may return a one qword read response with the error bit set or perform a non-atomic read-modify-write. The current HyperTransport Specification does not require peer-to-peer reflection of Atomic RMW.
Control Packets: Responses
There are two response types used in HyperTransport: Read Response and Target Done. Responses are returned by target devices following a non-posted request, and much of the response packet field information is extracted from the requests that caused them. Because responses are routed back to the original requestor either implicitly or based on UnitID, they don't require a 40 bit address field like requests do. All response packets are four bytes.
Read Responses
The four-byte read response is returned when data requests are made, including RdSized and Atomic RMW requests. All HyperTransport read transactions are non-posted and split; this means that data is never returned immediately as it generally is on buses such as PCI. The advantage of split reads is that the latency involved, in waiting for a target to access its internal memory before returning read data, can be minimized by sending the request, releasing the bus, and waiting for the target to initiate the return of data when it has it.
In HyperTransport, the read response is used by the target to indicate the return of previously requested data. The read response immediately precedes the data, and contains the following general information:
The response packet type.
Whether the response should travel in the standard or isochronous virtual channel.
UnitID which acts as an address for responses.
A direction bit indicating whether the response is moving upstream or downstream.
Whether relaxed ordering may be used for this response relative to posted writes moving in the same stream.
Error bits indicating whether or not the returning data can be considered valid; if it is invalid, error bits indicate whether the error occurred at the target or if the request inadvertently reached an end-of-chain device.
Figure 4-13 on page 95 depicts the various fields of the four-byte read response packet. Table 4-9 on page 95 summarizes the usage of each bit field.
Figure 4-13. Control Packets: Read Response
Table 4-9. HyperTransport Read Response Packet Bit Assignments Byte
Bit
Function
0
5:0
Command Code. This is the six bit command code for the Read Response packet. Value: 110000b
0
6
Reserved.
0
7
Isoc. If set = 1, this response should travel in the isochronous virtual channels for responses and response data. This bit is set in the target response if the Isoc bit was set in the request (Command field) that caused it. Note: The state of this bit should be preserved even when passing through tunnel devices with isochronous flow control disabled.
1
4:0
UnitID[4:0]. (also see Bridge bit below). This field helps route the responses and is programmed in two different ways:
For Upstream Responses (Bridge = 0):
This field contains the UnitID of the node that generated the response (original target)
For Downstream Responses (Bridge = 1):
This field contains the UnitID of the original requestor
1
6
Bridge. This bit is set by host bridges to indicate responses which are traveling downstream. Interior devices use Bridge bit and UnitID to claim returning responses. Upstream responses from interior devices have the Bridge bit cleared and carry the UnitID of the responder, meaning that they are routed implicitly to host bridge based only on the fact that the Bridge bit = 0.
1
7
PassPW. This bit will be set in the read response if response may pass posted requests bit was set in the command field of the read request that caused it. If set, relaxed ordering may be applied.
2
4:0
SrcTag[4:0]. This field is copied from the request packet.
2
5
Error. When set, this bit indicates that an error occurred during the read transaction. All of the requested data is returned, even if there is an error.
2
7:6
Count[1:0]. (also see Byte 3 bits 0,1). This is the lower half of the 4-bit field that indicates the quantity of returning data:
For Dword Read transfers:
This field is a copy of the count field in the request packet
For Byte Read transfers: Count field is always set = 0 (1 dword)
For Atomic RMW transfers: Count field is always set = 1 (2 dwords =1 qword).
3
1:0
Count[3:2]. (also see Byte 2 bits 6,7). This is the upper half of the 4-bit field that indicates the quantity of returning data:
For Dword Read transfers:
This field is a copy of the count field in the request packet
For Byte read transfers: Count field is always set = 0 (1 dword)
For Atomic RMW transfers: Count field is always set = 1 (2 dwords = 1 qword).
3
4:2
(Reserved.
3
5
NXA (Non-Existent Address) This bit is only valid if Error bit (Byte 2, bit 5) is set. If NXA and Error are both set = 1, error occurred at end-of-chain device due to a non-existent address problem. If NXA = 0 and Error is set = 1, then error occurred at target.
3
7:6
Reserved.
Target Done Responses
The four-byte target done response is returned when non-posted WrSized or Flush requests are made. As no data is returned with the target done response, it is routed back to the original requestor as a way to confirm the completion of a write transaction or a Flush operation. The contents of the target done response packet are very similar to the read response packet except that no mask/count information is required because there is no data to transfer.
Figure 4-14 on page 97 depicts the various fields of the four-byte read response packet. Table 4-10 summarizes the usage of each bit field.
Figure 4-14. Control Packets: Target Done Response
Table 4-10. HyperTransport Target Done Response Packet Bit Assignments Byte
Bit
Function
0
5:0
Command Code. This is the six bit command code for the Target Done Response packet. Value: 110011b
0
6
Reserved.
0
7
Isoc. If set = 1, this response should travel in the isochronous virtual channels for responses and response data. This bit is set in the target done response if the Isoc bit was set in the request (Command field) that caused it. Note: The state of this bit should be preserved even when passing through tunnel devices with isochronous flow control disabled.
1
4:0
UnitID[4:0]. (also see Bridge bit below). This field helps route the responses, and is programmed in two different ways:
For Upstream Responses (Bridge = 0):
This field contains the UnitID of the node which generated the response (original target)
For Downstream Responses (Bridge = 1):
This field contains the UnitID of the original requestor
1
6
Bridge. This bit is set by host bridges to indicate responses which are traveling downstream. Interior devices use Bridge bit and UnitID to claim returning responses. Upstream responses from interior devices have the Bridge bit cleared and carry the UnitID of the responder, meaning that they are routed implicitly to host bridge based only on the fact that the Bridge bit = 0.
1
7
PassPW. This bit is set in the target done response if relaxed ordering of the target done response is permitted. As there is no response may pass posted requests bit in write requests, it is device-specific whether this response packet bit is set or not. Generally, it is expected to be set.
2
4:0
SrcTag[4:0]. This field is copied from the request that caused this target done response.
2
5
Error. When set, this bit indicates that an error occurred during the transaction.
2
7:6
Reserved
3
4:0
Reserved
3
5
NXA (Non-Existent Address) This bit is only valid if Error bit (Byte 2, bit 5) is set. If NXA and Error are both set = 1, error occurred at end-of-chain device due to a non-existent address problem. If NXA = 0 and Error is set = 1, then error occurred at target.
3
7:6
Reserved.
Monday, December 10, 2007
HyperTransport Core Topics
The Signal Groups
As illustrated in Figure 3-1 on page 54, the high-speed HyperTransport signals on each link consist of an outbound (transmit) set of signals and an inbound (receive) set of signals for each device; these are routed point-to-point. Having two sets of uni-directional signals allows concurrent traffic. In addition, there is one set of low speed signals that may be bused to multiple devices.
The High Speed Signals (One Set In Each Direction)
Each high-speed signal is actually a differential signal pair. CAD (Command/Address/Data) information consists of the two basic types of HyperTransport packets: control and data. When a link transmitter sends packets on the CAD bus, the receive side of the interface uses the CLK and CTL signals, also supplied by the transmitter, to latch in packet information during each bit time. CTL distinguishes control packets from data packets.
The CAD Signal Group
The CAD bus is always driven by the transmitter side of a link, and is comprised of signal pairs that carry HyperTransport requests, responses, and data. Each CAD bus may consist of between 2 bits (two differential signal pairs) and 32 bits (thirty-two differential signal pairs). The HyperTransport specification permits the CAD bus width to be different (asymmetrical) for the two directions. To enable the corresponding receiver to make a distinction as to the type of information currently being sent over the CAD bus, the transmitter also drives the CTL signal (see the following description).
Control Signal (CTL)
This signal pair is driven by the transmitter to qualify the information being sent concurrently over the CAD signals. If this signal is asserted (high), the transmitter is indicating that it is sending a control packet; if deasserted, the transmitter is sending a data packet. The receiver uses this information when routing incoming CAD information to appropriate request queues, data buffers, etc. There is one (and only one) CTL signal for each link direction, regardless of the width of the CAD bus.
Clock Signal(s) (CLK)
As a source-synchronous connection, each HyperTransport transmitter sends a differential clock signal along with CAD and CTL signals to the receiver at the other end of the link. There is one CLK signal pair for each byte of CAD width. While the timing on each clock pair is the same, replicating clocks help in routing of CAD signal pairs with respect to their clock signals. The current HyperTransport specification allows clock speeds from 200MHz (default) to 800MHz.
Scaling Hazards: Burden Is On The Transmitter
It is a requirement in HyperTransport that the transmitter side of each link must be aware of the capabilities of its corresponding receiver and avoid the double hazard of a scalable bus: running at a faster clock rate than the receiver can handle or using a wider data path than the receiver supports. Because the link is not a shared bus, the transmitter side of each device is concerned with the capabilities of only one target. Refer to "Link Initialization" on page 282 for a description of how HyperTransport links are initialized and configured to avoid these problems.
The Low Speed Signals
Power OK (PWROK) AndReset (RESET#)
PWROK used with RESET# indicates to HyperTransport devices whether a Cold or Warm Reset is in progress. Which system logic component is responsible for managing the PWROK and RESET# signals is beyond the scope of the HyperTransport specification, but timing and use of the signals are defined. The basic use of the signals includes:
At power up, PWROK is asserted by system logic when it can be guaranteed that system power and clocks related to HyperTransport are within proper limits.
RESET# is asserted by system logic to indicate that a reset is required. The state of PWROK when RESET# is seen asserted indicates the type of reset to be performed. PWROK and RESET# both asserted is a warm reset; PWROK deasserted and RESET# asserted indicates cold reset.
After initial system power up, reset, and initialization, a cold or warm reset may also be generated under software control writing configuration registers in the host bridge.
The HyperTransport specification describes the actions to be taken by devices during either type of reset event. Refer to Chapter 12, entitled "Reset & Initialization," on page 275 for a thorough discussion of how PWROK and RESET are used during system power-up and initialization.
LDTSTOP#
(Note: the signal names LDTSTOP# and LDTREQ# were carried forward from the earlier name AMD assigned to HyperTransport technology — Lightning Data Transfer).
LDTSTOP# is an input to HyperTransport devices which is asserted by system logic to enable and disable link activity during power management state transitions. Support for this signal is optional for HyperTransport devices.
A transmitter which detects LDTSTOP# asserted finishes sending any control packet in progress, then commences a disconnect NOP sequence followed by disabling its output drivers (if so enabled in the transmitter's Configuration Space Tri-State Enable Bit). Upon receipt of the disconnect NOP sequence, the target also turns off its input receivers (if similarly enabled in it's Configuration Space Tri-State Enable Bit).
Later, when the transmitter detects LDTSTOP# deasserted, it re-enables its drivers and begins the initialization sequence. A receiver that responds to LDTSTOP# deasserted turns its input receivers on.
LDTREQ#
LDTREQ# is a wire-or'd output from HyperTransport devices that is used to request system logic to re-enable links previously disabled using the LDTSTOP# mechanism. Upon receipt of the LDTREQ# signal from one or more HyperTransport devices, system logic (typically the South Bridge) deasserts LDTSTOP# which triggers the sequence described previously. Specifically, the LDTREQ# signal indicates that a HyperTransport transaction is required somewhere in a system that is currently in the ACPI C3 state; the system is required to transition to the C0 state. Support for this signal is optional for HyperTransport devices.
Where Are The Interrupt, Error, And Wait State Signals?
The HyperTransport specification eliminates a number of control signals that are commonly found on other buses. While devices are not prohibited from implementing signals beyond those defined in the specification, HyperTransport is a generic, simple interface and handles interrupts, errors, and data wait states in the following general way:
Interrupt Signaling
Interrupts are conveyed in HyperTransport as messages sent over the link in the posted request channel. This eliminates the need for dedicated interrupt signal traces. Depending on the architecture, it may also eliminate the need for a separate interrupt controller (e.g. IOAPIC). Refer to Chapter 8, entitled "HT Interrupts," on page 199 for a discussion of HyperTransport interrupt management.
Error Signaling
HyperTransport error handling employs CRC checking of bit traffic across each link interface. In the event of an error, there are several possible handling schemes. All of this is done without any dedicated error signals. Refer to Chapter 10, entitled "Error Detection And Handling," on page 229 for a discussion of HyperTransport error detection and handling.
Wait State Signaling
Wait states during transmission of data are a problem on any bus because they represent wasted time on the part of the devices performing the transfer and for other devices waiting to perform subsequent transfers. In HyperTransport, wait state, disconnect, and retry mechanisms used on other buses are eliminated. This is made possible through a coupon-based flow control scheme that guarantees that no transfer will be started by a transmitter which cannot be immediately accepted by the corresponding receiver on the other side of the link. Dynamic flow control information concerning buffer availability is embedded in NOP packets sent by each device — removing the need for dedicated transmitter and receiver ready signals. Refer to Chapter 5, entitled "Flow Control," on page 99 for a discussion of HyperTransport flow control.
No Arbitration Signals Either
Unlike a shared bus such as PCI or PCI-X with multiple masters, HyperTransport links are point-to-point connections with only one possible transmitter and one receiver for each direction. Because of this, arbitration signals and related arbitration latency are eliminated; as long as flow control for a link is not violated, a transmitter simply starts a transaction whenever it is required.
As illustrated in Figure 3-1 on page 54, the high-speed HyperTransport signals on each link consist of an outbound (transmit) set of signals and an inbound (receive) set of signals for each device; these are routed point-to-point. Having two sets of uni-directional signals allows concurrent traffic. In addition, there is one set of low speed signals that may be bused to multiple devices.
The High Speed Signals (One Set In Each Direction)
Each high-speed signal is actually a differential signal pair. CAD (Command/Address/Data) information consists of the two basic types of HyperTransport packets: control and data. When a link transmitter sends packets on the CAD bus, the receive side of the interface uses the CLK and CTL signals, also supplied by the transmitter, to latch in packet information during each bit time. CTL distinguishes control packets from data packets.
The CAD Signal Group
The CAD bus is always driven by the transmitter side of a link, and is comprised of signal pairs that carry HyperTransport requests, responses, and data. Each CAD bus may consist of between 2 bits (two differential signal pairs) and 32 bits (thirty-two differential signal pairs). The HyperTransport specification permits the CAD bus width to be different (asymmetrical) for the two directions. To enable the corresponding receiver to make a distinction as to the type of information currently being sent over the CAD bus, the transmitter also drives the CTL signal (see the following description).
Control Signal (CTL)
This signal pair is driven by the transmitter to qualify the information being sent concurrently over the CAD signals. If this signal is asserted (high), the transmitter is indicating that it is sending a control packet; if deasserted, the transmitter is sending a data packet. The receiver uses this information when routing incoming CAD information to appropriate request queues, data buffers, etc. There is one (and only one) CTL signal for each link direction, regardless of the width of the CAD bus.
Clock Signal(s) (CLK)
As a source-synchronous connection, each HyperTransport transmitter sends a differential clock signal along with CAD and CTL signals to the receiver at the other end of the link. There is one CLK signal pair for each byte of CAD width. While the timing on each clock pair is the same, replicating clocks help in routing of CAD signal pairs with respect to their clock signals. The current HyperTransport specification allows clock speeds from 200MHz (default) to 800MHz.
Scaling Hazards: Burden Is On The Transmitter
It is a requirement in HyperTransport that the transmitter side of each link must be aware of the capabilities of its corresponding receiver and avoid the double hazard of a scalable bus: running at a faster clock rate than the receiver can handle or using a wider data path than the receiver supports. Because the link is not a shared bus, the transmitter side of each device is concerned with the capabilities of only one target. Refer to "Link Initialization" on page 282 for a description of how HyperTransport links are initialized and configured to avoid these problems.
The Low Speed Signals
Power OK (PWROK) AndReset (RESET#)
PWROK used with RESET# indicates to HyperTransport devices whether a Cold or Warm Reset is in progress. Which system logic component is responsible for managing the PWROK and RESET# signals is beyond the scope of the HyperTransport specification, but timing and use of the signals are defined. The basic use of the signals includes:
At power up, PWROK is asserted by system logic when it can be guaranteed that system power and clocks related to HyperTransport are within proper limits.
RESET# is asserted by system logic to indicate that a reset is required. The state of PWROK when RESET# is seen asserted indicates the type of reset to be performed. PWROK and RESET# both asserted is a warm reset; PWROK deasserted and RESET# asserted indicates cold reset.
After initial system power up, reset, and initialization, a cold or warm reset may also be generated under software control writing configuration registers in the host bridge.
The HyperTransport specification describes the actions to be taken by devices during either type of reset event. Refer to Chapter 12, entitled "Reset & Initialization," on page 275 for a thorough discussion of how PWROK and RESET are used during system power-up and initialization.
LDTSTOP#
(Note: the signal names LDTSTOP# and LDTREQ# were carried forward from the earlier name AMD assigned to HyperTransport technology — Lightning Data Transfer).
LDTSTOP# is an input to HyperTransport devices which is asserted by system logic to enable and disable link activity during power management state transitions. Support for this signal is optional for HyperTransport devices.
A transmitter which detects LDTSTOP# asserted finishes sending any control packet in progress, then commences a disconnect NOP sequence followed by disabling its output drivers (if so enabled in the transmitter's Configuration Space Tri-State Enable Bit). Upon receipt of the disconnect NOP sequence, the target also turns off its input receivers (if similarly enabled in it's Configuration Space Tri-State Enable Bit).
Later, when the transmitter detects LDTSTOP# deasserted, it re-enables its drivers and begins the initialization sequence. A receiver that responds to LDTSTOP# deasserted turns its input receivers on.
LDTREQ#
LDTREQ# is a wire-or'd output from HyperTransport devices that is used to request system logic to re-enable links previously disabled using the LDTSTOP# mechanism. Upon receipt of the LDTREQ# signal from one or more HyperTransport devices, system logic (typically the South Bridge) deasserts LDTSTOP# which triggers the sequence described previously. Specifically, the LDTREQ# signal indicates that a HyperTransport transaction is required somewhere in a system that is currently in the ACPI C3 state; the system is required to transition to the C0 state. Support for this signal is optional for HyperTransport devices.
Where Are The Interrupt, Error, And Wait State Signals?
The HyperTransport specification eliminates a number of control signals that are commonly found on other buses. While devices are not prohibited from implementing signals beyond those defined in the specification, HyperTransport is a generic, simple interface and handles interrupts, errors, and data wait states in the following general way:
Interrupt Signaling
Interrupts are conveyed in HyperTransport as messages sent over the link in the posted request channel. This eliminates the need for dedicated interrupt signal traces. Depending on the architecture, it may also eliminate the need for a separate interrupt controller (e.g. IOAPIC). Refer to Chapter 8, entitled "HT Interrupts," on page 199 for a discussion of HyperTransport interrupt management.
Error Signaling
HyperTransport error handling employs CRC checking of bit traffic across each link interface. In the event of an error, there are several possible handling schemes. All of this is done without any dedicated error signals. Refer to Chapter 10, entitled "Error Detection And Handling," on page 229 for a discussion of HyperTransport error detection and handling.
Wait State Signaling
Wait states during transmission of data are a problem on any bus because they represent wasted time on the part of the devices performing the transfer and for other devices waiting to perform subsequent transfers. In HyperTransport, wait state, disconnect, and retry mechanisms used on other buses are eliminated. This is made possible through a coupon-based flow control scheme that guarantees that no transfer will be started by a transmitter which cannot be immediately accepted by the corresponding receiver on the other side of the link. Dynamic flow control information concerning buffer availability is embedded in NOP packets sent by each device — removing the need for dedicated transmitter and receiver ready signals. Refer to Chapter 5, entitled "Flow Control," on page 99 for a discussion of HyperTransport flow control.
No Arbitration Signals Either
Unlike a shared bus such as PCI or PCI-X with multiple masters, HyperTransport links are point-to-point connections with only one possible transmitter and one receiver for each direction. Because of this, arbitration signals and related arbitration latency are eliminated; as long as flow control for a link is not violated, a transmitter simply starts a transaction whenever it is required.
HT Architectural Overview
The Previous Chapter
To understand why HT was developed, it is helpful to review the previous generation of I/O buses and interconnects. This chapter review the factors that limit the ability of older generation buses to keep pace with the increasing demands of new applications. Finally, this chapter discusses the key factors of the HT technology that provides its improved capability.
This Chapter
This chapter provides an overview of the HT architecture that defines the primary elements of HT technology and the relationship between these elements. This chapter summarizes the features, capabilities, and limitation of HT and provides the background information necessary for in-depth discussions of the various HT topics in later chapters.
The Next Chapter
The next chapter describes the function of each signal in the high- and low- speed HyperTransport signal groups.
General
HyperTransport provides a point-to-point interconnect that can be extended to support a wide range of devices. Figure 2-1 on page 21 illustrates a sample HT system with four internal links. HyperTransport provides a high-speed, high-performance, point-to-point dual simplex link for interconnecting IC components on a PCB. Data is transmitted from one device to another across the link.
Figure 2-1. Example HyperTransport System
The width of the link along with the clock frequency at which data is transferred are scalable:
Link width ranges from 2 bits to 32-bits
Clock Frequency ranges from 200MHz to 800MHz (and 1GHz in the future)
This scalability allows for a wide range of link performance and potential applications with bandwidths ranging from 200MB/s to 12.8GB/s.
At the current revision of the spec, 1.04, there is no support for connectors implying that all HyperTransport (HT) devices are soldered onto the motherboard. HyperTransport is technically an "inside-the-box" bus. In reality, connectors have been designed for systems that require board to board connections, and where analyzer interfaces are desired for debug.
Once again referring to Figure 2-1, the HT bus has been extended in the sample system via a series of devices known as tunnels. A tunnel is merely an HT device that performs some function, but in addition it contains a second HT interface that permits the connection of another HT device. In Figure 2-1, the tunnel devices provide connections to other I/O buses:
Infiniband
PCI-X
Ethernet
The end device is termed a cave, which always represents the termination of a chain of devices that all reside on the same HT bus. Cave devices include a function, but no additional HT connection. The series of devices that comprise an HT bus is sometimes simply referred to as an HT chain.
Additional HT buses (i.e. chains) may be implemented in a given system by using a HT-to-HT bridge. In this way, a fabric of HT devices may be implemented. Refer to section entitled, "Extending the Topology" on page 33 for additional detail.
Transfer Types Supported
HT supports two types of addressing semantics:
legacy PC, address-based semantics
messaging semantics common to networking environments
The first part of this book discusses the address-based semantics common to compatible PC implementations. Message-passing semantics are discussed in Chapter 19, entitled "Networking Extensions Overview," on page 443.
Address-Based Semantics
The HT bus was initially implemented as a PC compatible solution that by definition uses Address-based semantics. This includes a 40-bit, or 1 Terabye (TB) address space. Transactions specify locations within this address space that are to be read from or written to. The address space is divided into blocks that are allocated for particular functions, listed in Figure 2-2 on page 23.
Figure 2-2. HT Address Map
HyperTransport does not contain dedicated I/O address space. Instead, CPU I/O space is mapped to high memory address range (FD_FC00_0000h—FD_FDFF_FFFFh). Each HyperTransport device is configured at initialization time by the boot ROM configuration software to respond to a range of memory address spaces. The devices are assigned addresses via the base address registers contained in the configuration register header. Note that these registers are based on the PCI Configuration registers, and are also mapped to memory space (FD_FE00_0000h—FD_FFFF_FFFFh. Unlike the PCI bus, there is no dedicated configuration address space.
Read and write request command packets contain a 40-bit address Addr[39:2]. Additional memory address ranges are used for interrupt signaling and system management messages. Details regarding the use of each range of address space is discussed in subsequent chapters that cover the related topic. For example, a detailed discussion of the configuration address space can be found in Chapter 13, entitled "Device Configuration," on page 305.
Data Transfer Type and Transaction Flow
The HT architecture supports several methods of data transfer between devices, including:
Programmed I/O
DMA
Peer-to-peer
Each method is illustrated and described below. An overview of packet types and transactions is discussed later in this chapter.
Programmed I/O Transfers
Transfers that originate as a result of executing code on the host CPU are called programmed I/O transfers. For example, a device driver for a given HT device might execute a read transaction to check its device status. Transactions initiated by the CPU are forwarded to the HT bus via the Host HT Bridge as illustrated in Figure 2-3. The example transaction is a write that is posted by the host bridge; thus no response is returned to from the target device. Non-posted operations of course require a response.
Figure 2-3. Transaction Flow During Programmed I/O Operation
DMA Transfers
HT devices may wish to perform a direct memory access (DMA) by simply initiating a read or write transfer. Figure 2-4 illustrates a master performing a DMA read operation from main DRAM. In this example, a response is required to return data back to the source HT device.
Figure 2-4. Transaction Flow During DMA Operation
Peer-to-Peer Transfers
Figure 2-5 on page 26 illustrates the initial request to read data from the target device residing on the same bus. Note that even though the target device resides on the same bus, it ignores the request moving in the upstream direction (toward the host processor). When the request reaches the upstream bridge, it is turned around and sent in the downstream direction toward the target device. This time the target device detects the request and returns the requested data in a response packet.
Figure 2-5. Peer-to-Peer Transaction Flow
The peer-to-peer transfer does not occur directly between the requesting and responding devices as might be expected. Rather, the upstream bridge is involved in handling both the request and response to ensure that the transaction ordering requirements are managed correctly. This requirement exist to support PCI-compliant ordering. True, or direct, peer-to-peer transfers are supported when PCI ordering is not required as defined by the networking extensions. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for details.
HT Signals
The HT signals can be grouped into two broad categories (See Figure 2-6 on page 27):
The link signal group — used to transfer packets in both directions (High-Speed Signals).
The support signal group — that provides required resources such as power and reset, as well as other signals to support optional features such power management (Low-Speed Signals).
Figure 2-6. Primary HT Signal Groups
Link Packet Transfer Signals
The high-speed signals used for packet transfer in both directions across an HT link include:
CAD (command, address, data). Multiplexed signals that carry control packets (request, response, information) and data packets. Note that the width of the CAD bus is scalable from 2-bits to 32-bits. (See "Scalable Performance" on page 30.)
CLK (clock). Source-synchronous clock for CAD and CTL signals. A separate clock signal is required for each byte lane supported by the link. Thus, the number of CLK signals required is directly proportional to the number of bytes that can be transferred across the link at one time.
CTL (control). Indicates whether a control packet or data packet is currently being delivered via the CAD signals.
Figure 2-7 illustrates these signals and defines various widths of data bus supported. The variables "n" and "m" define the scaling option implemented. Refer to "Link Initialization" on page 282 for details regarding HT data width and clock speed scaling.
Figure 2-7. Link Signals Used to Transfer Packets
Link Support Signals
The low-speed link support signals consist of power- and initialization-related signals and power management signals. Power- and initialization-related signals include:
VLDT & Ground — The 1.2 volt supply that powers HT drivers and receivers
PWROK — Indicates to devices residing in the HT fabric that power and clock are stable.
RESET# — Used to reset and initialize the HT interface within devices and perhaps their internal logic (device specific).
Power management signals
LDTREQ# — Requests re-enabling links for normal operation.
LDTSTOP# — Enables and disables links during system state transitions.
Figure 2-8. Link Support Signals
To understand why HT was developed, it is helpful to review the previous generation of I/O buses and interconnects. This chapter review the factors that limit the ability of older generation buses to keep pace with the increasing demands of new applications. Finally, this chapter discusses the key factors of the HT technology that provides its improved capability.
This Chapter
This chapter provides an overview of the HT architecture that defines the primary elements of HT technology and the relationship between these elements. This chapter summarizes the features, capabilities, and limitation of HT and provides the background information necessary for in-depth discussions of the various HT topics in later chapters.
The Next Chapter
The next chapter describes the function of each signal in the high- and low- speed HyperTransport signal groups.
General
HyperTransport provides a point-to-point interconnect that can be extended to support a wide range of devices. Figure 2-1 on page 21 illustrates a sample HT system with four internal links. HyperTransport provides a high-speed, high-performance, point-to-point dual simplex link for interconnecting IC components on a PCB. Data is transmitted from one device to another across the link.
Figure 2-1. Example HyperTransport System
The width of the link along with the clock frequency at which data is transferred are scalable:
Link width ranges from 2 bits to 32-bits
Clock Frequency ranges from 200MHz to 800MHz (and 1GHz in the future)
This scalability allows for a wide range of link performance and potential applications with bandwidths ranging from 200MB/s to 12.8GB/s.
At the current revision of the spec, 1.04, there is no support for connectors implying that all HyperTransport (HT) devices are soldered onto the motherboard. HyperTransport is technically an "inside-the-box" bus. In reality, connectors have been designed for systems that require board to board connections, and where analyzer interfaces are desired for debug.
Once again referring to Figure 2-1, the HT bus has been extended in the sample system via a series of devices known as tunnels. A tunnel is merely an HT device that performs some function, but in addition it contains a second HT interface that permits the connection of another HT device. In Figure 2-1, the tunnel devices provide connections to other I/O buses:
Infiniband
PCI-X
Ethernet
The end device is termed a cave, which always represents the termination of a chain of devices that all reside on the same HT bus. Cave devices include a function, but no additional HT connection. The series of devices that comprise an HT bus is sometimes simply referred to as an HT chain.
Additional HT buses (i.e. chains) may be implemented in a given system by using a HT-to-HT bridge. In this way, a fabric of HT devices may be implemented. Refer to section entitled, "Extending the Topology" on page 33 for additional detail.
Transfer Types Supported
HT supports two types of addressing semantics:
legacy PC, address-based semantics
messaging semantics common to networking environments
The first part of this book discusses the address-based semantics common to compatible PC implementations. Message-passing semantics are discussed in Chapter 19, entitled "Networking Extensions Overview," on page 443.
Address-Based Semantics
The HT bus was initially implemented as a PC compatible solution that by definition uses Address-based semantics. This includes a 40-bit, or 1 Terabye (TB) address space. Transactions specify locations within this address space that are to be read from or written to. The address space is divided into blocks that are allocated for particular functions, listed in Figure 2-2 on page 23.
Figure 2-2. HT Address Map
HyperTransport does not contain dedicated I/O address space. Instead, CPU I/O space is mapped to high memory address range (FD_FC00_0000h—FD_FDFF_FFFFh). Each HyperTransport device is configured at initialization time by the boot ROM configuration software to respond to a range of memory address spaces. The devices are assigned addresses via the base address registers contained in the configuration register header. Note that these registers are based on the PCI Configuration registers, and are also mapped to memory space (FD_FE00_0000h—FD_FFFF_FFFFh. Unlike the PCI bus, there is no dedicated configuration address space.
Read and write request command packets contain a 40-bit address Addr[39:2]. Additional memory address ranges are used for interrupt signaling and system management messages. Details regarding the use of each range of address space is discussed in subsequent chapters that cover the related topic. For example, a detailed discussion of the configuration address space can be found in Chapter 13, entitled "Device Configuration," on page 305.
Data Transfer Type and Transaction Flow
The HT architecture supports several methods of data transfer between devices, including:
Programmed I/O
DMA
Peer-to-peer
Each method is illustrated and described below. An overview of packet types and transactions is discussed later in this chapter.
Programmed I/O Transfers
Transfers that originate as a result of executing code on the host CPU are called programmed I/O transfers. For example, a device driver for a given HT device might execute a read transaction to check its device status. Transactions initiated by the CPU are forwarded to the HT bus via the Host HT Bridge as illustrated in Figure 2-3. The example transaction is a write that is posted by the host bridge; thus no response is returned to from the target device. Non-posted operations of course require a response.
Figure 2-3. Transaction Flow During Programmed I/O Operation
DMA Transfers
HT devices may wish to perform a direct memory access (DMA) by simply initiating a read or write transfer. Figure 2-4 illustrates a master performing a DMA read operation from main DRAM. In this example, a response is required to return data back to the source HT device.
Figure 2-4. Transaction Flow During DMA Operation
Peer-to-Peer Transfers
Figure 2-5 on page 26 illustrates the initial request to read data from the target device residing on the same bus. Note that even though the target device resides on the same bus, it ignores the request moving in the upstream direction (toward the host processor). When the request reaches the upstream bridge, it is turned around and sent in the downstream direction toward the target device. This time the target device detects the request and returns the requested data in a response packet.
Figure 2-5. Peer-to-Peer Transaction Flow
The peer-to-peer transfer does not occur directly between the requesting and responding devices as might be expected. Rather, the upstream bridge is involved in handling both the request and response to ensure that the transaction ordering requirements are managed correctly. This requirement exist to support PCI-compliant ordering. True, or direct, peer-to-peer transfers are supported when PCI ordering is not required as defined by the networking extensions. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for details.
HT Signals
The HT signals can be grouped into two broad categories (See Figure 2-6 on page 27):
The link signal group — used to transfer packets in both directions (High-Speed Signals).
The support signal group — that provides required resources such as power and reset, as well as other signals to support optional features such power management (Low-Speed Signals).
Figure 2-6. Primary HT Signal Groups
Link Packet Transfer Signals
The high-speed signals used for packet transfer in both directions across an HT link include:
CAD (command, address, data). Multiplexed signals that carry control packets (request, response, information) and data packets. Note that the width of the CAD bus is scalable from 2-bits to 32-bits. (See "Scalable Performance" on page 30.)
CLK (clock). Source-synchronous clock for CAD and CTL signals. A separate clock signal is required for each byte lane supported by the link. Thus, the number of CLK signals required is directly proportional to the number of bytes that can be transferred across the link at one time.
CTL (control). Indicates whether a control packet or data packet is currently being delivered via the CAD signals.
Figure 2-7 illustrates these signals and defines various widths of data bus supported. The variables "n" and "m" define the scaling option implemented. Refer to "Link Initialization" on page 282 for details regarding HT data width and clock speed scaling.
Figure 2-7. Link Signals Used to Transfer Packets
Link Support Signals
The low-speed link support signals consist of power- and initialization-related signals and power management signals. Power- and initialization-related signals include:
VLDT & Ground — The 1.2 volt supply that powers HT drivers and receivers
PWROK — Indicates to devices residing in the HT fabric that power and clock are stable.
RESET# — Used to reset and initialize the HT interface within devices and perhaps their internal logic (device specific).
Power management signals
LDTREQ# — Requests re-enabling links for normal operation.
LDTSTOP# — Enables and disables links during system state transitions.
Figure 2-8. Link Support Signals
Introduction to HyperTransport
Background: I/O Subsystem Bottlenecks
New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations. Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.
Server Or Desktop Computer: Three Subsystems
A server or desktop computer system is comprised of three major subsystems:
Processor (in servers, there may be more than one)
Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.
I/O (Input/Output devices). Generally, all components which are not processors or DRAM are lumped together in this subsystem group. This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.
CPU Speed Makes Other Subsystems Appear Slow
Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.
Multiple CPUs Aggravate The Problem
In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.
DRAM Memory Keeps Up Fairly Well
Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.
In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.
I/O Bandwidth Has Not Kept Pace
While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.
This Slows Down The Processor
Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.
It Also Hurts Fast Peripherals
Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.
Reducing I/O Bottlenecks
Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.
The Shared Bus Approach
Figure 1-1 on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.
Figure 1-1. Typical PCI North-South Bridge System
Until recently, the topology shown in Figure 1-1 on page 12 has been very popular in desktop systems for a number of reasons, including:
A shared bus reduces the number of traces on the motherboard to a single set.
All of the devices located on the PCI bus are only one bridge interface away from the principal target of their transactions — main DRAM memory.
A single, very popular protocol (PCI) can be used for all embedded devices, add-in cards, and chipset components attached to the bus.
Unfortunately, some of the things that made this topology so popular also have made it difficult to fix the I/O bandwidth problems which have become more obvious as processors and memory have become faster.
A Shared Bus Runs At Limited Clock Speeds
The fact that multiple devices (including PCB connectors) attach to a shared bus means that trace lengths and electrical complexity will limit the maximum usable clock speed. For example, a generic PCI bus has a maximum clock speed of 33MHz; the PCI Specification permits increasing the clock speed to 66MHz, but the number of devices/connectors on the bus is very limited.
A Shared Bus May Be Host To Many Device Types
The requirements of devices on a shared bus may vary widely in terms of bandwidth needed, tolerance for bus access latency, typical data transfer size, etc. All of this complicates arbitration on the bus when multiple masters wish to initiate transactions.
Backward Compatibility Prevents Upgrading Performance
If a critical shared bus is based on an open architecture, especially one that defines user "add-in" connectors, then another problem in upgrading bus bandwidth is the need to maintain backward compatibility with all of the devices and cards already in existence. If the bus protocol is enhanced and a user installs an "older generation card", then the bus must either revert back to the earlier protocol or lose its compatibility.
Special Problems If The Shared Bus Is PCI
As popular as it has been, PCI presents additional problems that contribute to performance limits:
PCI doesn't support split transactions, resulting in inefficient retries.
Transaction size (there is no limit) isn't known, which makes it difficult to size buffers and causes frequent disconnects by targets. Devices are also allowed to insert numerous wait states during each data phase.
All PCI transactions by I/O devices targeting main memory generally require a "snoop" cycle by CPUs to assure coherency with internal caches. This impacts both CPU and PCI performance.
Its data bus scalability is very limited (32/64 bit data)
Because of the PCI electrical specification (low-power, reflected wave signals), each PCI bus is physically limited in the number of ICs and connectors vs. PCI clock speed
PCI bus arbitration is vaguely specified. Access latencies can be long and difficult to quantify. If a second PCI bus is added (using a PCI-PCI bridge), arbitration for the secondary bus typically resides in the new bridge. This further complicates PCI arbitration for traffic moving vertically to memory.
A Note About PCI-X
Other than scalability and the number of devices possible on each bus, the PCI-X protocol has resolved many of the problems just described with PCI. For third-party manufacturers of high performance add-in cards and embedded devices, the shared bus PCI-X is a straightforward extension of PCI which yields huge bandwidth improvements (up to about 2GB/s with PCI-X 2.0).
ThePoint-to-Point Interconnect Approach
An alternative to the shared I/O bus approach of PCI or PCI-X is having point-to-point links connecting devices. This method is being used in a number of new bus implementations, including HyperTransport technology. A common feature of point-to-point connections is much higher bandwidth capability; to achieve this, point-to-point protocols adopt some or all of the following characteristics:
only two devices per connection.
low voltage, differential signaling on the high speed data paths
source-synchronous clocks, sometimes using double data rate (DDR)
very tight control over PCB trace lengths and routing
integrated termination and/or compensation circuits embedded in the two devices which maintain signal integrity and account for voltage and temperature effects on timing.
dual simplex interfaces between the devices rather than one bi-directional bus; this enables duplex operations and eliminates "turn around" cycles.
sophisticated protocols that eliminate retries, disconnects, wait-states, etc.
A Note About Connectors
While connectors may or may not be defined in a point-to-point link specification, they may be designed into some implementations to connect from board-board or for the attachment of diagnostic equipment. There is no definition of a peripheral add-in card connector for HyperTransport as there is in PCI or PCI-X.
What HT Brings
HyperTransport is a point-to-point, high-performance, "inside-the-box" motherboard interconnect bus. It targets IT, Telecom, and other applications requiring high bandwidth, scalability, and low latency access. Figure 1-2 on page 15 illustrates a single HT bus implementation with a variety of functional devices attached.
Figure 1-2. Sample HT-based System
Key Features Of HyperTransport Protocol
The key characteristics of the HT technology include:
Open architecture, non-proprietary bus
One or more fast, point-to-point links
Scaling of individual link width and clock speed to suit cost/performance targets
Split-transaction protocol eliminates retries, disconnects, and wait-states.
Standard and optional isochronous traffic support
PCI compatible; designed for minimal impact on OS and driver software
CRC error generation and checking
Programmable error handling strategy for CRC, protocol, and other errors
Message signalled interrupts
System Management features
Support for bridges to legacy busses
x86 compatibility features
Device types including tunnels, bridges, and end devices permit construction of a system fabric comprised of independent, customized links.
Formerly known as AMD's Lightning Data Transport (LDT), HyperTransport is backed by a consortium of developers. See www.hypertransport.org.
The Cost Factor
In addition to technology-related issues, there is always pressure on the platform designer to increase performance and other capabilities with each new generation, but to do so at a lower cost than the previous one. One popular method of measuring the success of this effort is to compare the bandwidth of one I/O bus to another, and the number of signals required to achieve it. This bandwidth-per-pin comparison works fairly well because I/O bus bandwidth is a critical factor in determining if system data bottlenecks exist, and a lower pin count translates directly into cost savings due to smaller IC packages, lower power, simplified motherboard routing, etc.
An example:
The bandwidth-per-pin for a generic 32-bit PCI bus during a burst transfer is approximately 3.5 MB/s (132 MB/s [33MHz x 4 bytes]/38 pins [32 data signals + 5 control lines + 1 clock]). By comparison, a 32 bit HyperTransport interface running at the lowest clock speed of 200MHz yields a per-pin burst bandwidth of approximately 22 MB/s (1600 MB/s [200Mhz x 2 DDR x 4 bytes]/74 pins [32 CAD signal pairs + 4 clock pairs + 1 CTL pair]).
Networking Support
Finally, at the time of the writing of this book, the HyperTransport I/O Link Specification is at revision 1.04. This specification revision mainly targets I/O subsystem improvements in conventional desktop and server platforms.
A growing number of applications require architectures that integrate well with networking environments. In many of these systems, unlike desktops and servers, processing may be decentralized and features such as message streaming, peer-peer transfers, and assigned isochronous bandwidth become important. In addition, device types such as switches help in building topologies suited to communications networking. To accommodate networking applications, work is well underway on the 1.05 and 1.1 revisions of the HyperTransport I/O Link Specification. The 1.05 specification includes the HyperTransport switch specification and the 1.1 specification incorporates the networking extensions specification. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for a summary of the major features expected to be included in the 1.05 and 1.1 specification revisions.
Visit www.hypertransport.org for up-to-date information on all on-going specification revisions.
Also, visit MindShare's website at www.mindshare.com for updates to this book relating to this and other HyperTransport topics. Information will be available for free download when the new specification revisions are released and details become publicly available.
New I/O buses are typically developed in response to changing system requirements and to promote lower cost implementations. Current-generation I/O buses such as PCI are rapidly falling behind the capabilities of other system components such as processors and memory. Some of the reasons why the I/O bottlenecks are becoming more apparent are described below.
Server Or Desktop Computer: Three Subsystems
A server or desktop computer system is comprised of three major subsystems:
Processor (in servers, there may be more than one)
Main DRAM Memory. There are a number of different synchronous DRAM types, including SDRAM, DDR, and Rambus.
I/O (Input/Output devices). Generally, all components which are not processors or DRAM are lumped together in this subsystem group. This would include such things as graphics, mass storage, legacy hardware, and the buses required to support them: PCI, PCI-X, AGP, USB, IDE, etc.
CPU Speed Makes Other Subsystems Appear Slow
Because of improvements in CPU internal execution speed, processors are more demanding than ever when they access external resources such as memory and I/O. Each external read or write by the processor represents a huge performance hit compared to internal execution.
Multiple CPUs Aggravate The Problem
In systems with multiple CPUs, such as servers, the problem of accessing external devices becomes worse because of competition for access to system DRAM and the single set of I/O resources.
DRAM Memory Keeps Up Fairly Well
Although it is external to the processor(s), system DRAM memory keeps up fairly well with the increasing demands of CPUs for a couple of reasons. First, the performance penalty for accessing external memory is mitigated by the use of internal processor caches. Modern processors generally implement multiple levels of internal caches that run at the full CPU clock rate and are tuned for high "hit rates". Each fetch from an internal cache eliminates the need for an external bus cycle to memory.
In addition, in cases where an external memory fetch is required, DRAM technology and the use of synchronous bus interfaces to it (e.g. DDR, RAMBUS, etc.) have allowed it to maintain bandwidths comparable with the processor external bus rates.
I/O Bandwidth Has Not Kept Pace
While the processor internal speed has raced forward, and memory access speed has managed to follow along reasonably well with the help of caches, I/O subsystem evolution has not kept up.
This Slows Down The Processor
Although external DRAM accesses by processors can be minimized through the use of internal caches, there is no way to avoid external bus operations when accessing I/O devices. The processor must perform small, inefficient external transactions which then must find their way through the I/O subsystem to the bus hosting the device.
It Also Hurts Fast Peripherals
Similarly, bus master I/O devices using PCI or other subsystem buses to reach main memory are also hindered by the lack of bandwidth. Some modern peripheral devices (e.g. SCSI and IDE hard drives) are capable of running much faster than the busses they live on. This represents another system bottleneck. This is a particular problem in cases where applications are running that emphasize time-critical movement of data through the I/O subsystem over CPU processing.
Reducing I/O Bottlenecks
Two important schemes have been used to connect I/O devices to main memory. The first is the shared bus approach, as used in PCI and PCI-X. The second involves point-to-point component interconnects, and includes some proprietary busses as well as open architectures such as HyperTransport. These are described here, along with the advantages and disadvantages of each.
The Shared Bus Approach
Figure 1-1 on page 12 depicts the common "North-South" bridge PCI implementation. Note that the PCI bus acts as both an "add-in" bus for user peripheral cards and as an interconnect bus to memory for all devices residing on or below it. Even traffic to and from the USB and IDE controllers integrated in the South Bridge must cross the PCI bus to reach main memory.
Figure 1-1. Typical PCI North-South Bridge System
Until recently, the topology shown in Figure 1-1 on page 12 has been very popular in desktop systems for a number of reasons, including:
A shared bus reduces the number of traces on the motherboard to a single set.
All of the devices located on the PCI bus are only one bridge interface away from the principal target of their transactions — main DRAM memory.
A single, very popular protocol (PCI) can be used for all embedded devices, add-in cards, and chipset components attached to the bus.
Unfortunately, some of the things that made this topology so popular also have made it difficult to fix the I/O bandwidth problems which have become more obvious as processors and memory have become faster.
A Shared Bus Runs At Limited Clock Speeds
The fact that multiple devices (including PCB connectors) attach to a shared bus means that trace lengths and electrical complexity will limit the maximum usable clock speed. For example, a generic PCI bus has a maximum clock speed of 33MHz; the PCI Specification permits increasing the clock speed to 66MHz, but the number of devices/connectors on the bus is very limited.
A Shared Bus May Be Host To Many Device Types
The requirements of devices on a shared bus may vary widely in terms of bandwidth needed, tolerance for bus access latency, typical data transfer size, etc. All of this complicates arbitration on the bus when multiple masters wish to initiate transactions.
Backward Compatibility Prevents Upgrading Performance
If a critical shared bus is based on an open architecture, especially one that defines user "add-in" connectors, then another problem in upgrading bus bandwidth is the need to maintain backward compatibility with all of the devices and cards already in existence. If the bus protocol is enhanced and a user installs an "older generation card", then the bus must either revert back to the earlier protocol or lose its compatibility.
Special Problems If The Shared Bus Is PCI
As popular as it has been, PCI presents additional problems that contribute to performance limits:
PCI doesn't support split transactions, resulting in inefficient retries.
Transaction size (there is no limit) isn't known, which makes it difficult to size buffers and causes frequent disconnects by targets. Devices are also allowed to insert numerous wait states during each data phase.
All PCI transactions by I/O devices targeting main memory generally require a "snoop" cycle by CPUs to assure coherency with internal caches. This impacts both CPU and PCI performance.
Its data bus scalability is very limited (32/64 bit data)
Because of the PCI electrical specification (low-power, reflected wave signals), each PCI bus is physically limited in the number of ICs and connectors vs. PCI clock speed
PCI bus arbitration is vaguely specified. Access latencies can be long and difficult to quantify. If a second PCI bus is added (using a PCI-PCI bridge), arbitration for the secondary bus typically resides in the new bridge. This further complicates PCI arbitration for traffic moving vertically to memory.
A Note About PCI-X
Other than scalability and the number of devices possible on each bus, the PCI-X protocol has resolved many of the problems just described with PCI. For third-party manufacturers of high performance add-in cards and embedded devices, the shared bus PCI-X is a straightforward extension of PCI which yields huge bandwidth improvements (up to about 2GB/s with PCI-X 2.0).
ThePoint-to-Point Interconnect Approach
An alternative to the shared I/O bus approach of PCI or PCI-X is having point-to-point links connecting devices. This method is being used in a number of new bus implementations, including HyperTransport technology. A common feature of point-to-point connections is much higher bandwidth capability; to achieve this, point-to-point protocols adopt some or all of the following characteristics:
only two devices per connection.
low voltage, differential signaling on the high speed data paths
source-synchronous clocks, sometimes using double data rate (DDR)
very tight control over PCB trace lengths and routing
integrated termination and/or compensation circuits embedded in the two devices which maintain signal integrity and account for voltage and temperature effects on timing.
dual simplex interfaces between the devices rather than one bi-directional bus; this enables duplex operations and eliminates "turn around" cycles.
sophisticated protocols that eliminate retries, disconnects, wait-states, etc.
A Note About Connectors
While connectors may or may not be defined in a point-to-point link specification, they may be designed into some implementations to connect from board-board or for the attachment of diagnostic equipment. There is no definition of a peripheral add-in card connector for HyperTransport as there is in PCI or PCI-X.
What HT Brings
HyperTransport is a point-to-point, high-performance, "inside-the-box" motherboard interconnect bus. It targets IT, Telecom, and other applications requiring high bandwidth, scalability, and low latency access. Figure 1-2 on page 15 illustrates a single HT bus implementation with a variety of functional devices attached.
Figure 1-2. Sample HT-based System
Key Features Of HyperTransport Protocol
The key characteristics of the HT technology include:
Open architecture, non-proprietary bus
One or more fast, point-to-point links
Scaling of individual link width and clock speed to suit cost/performance targets
Split-transaction protocol eliminates retries, disconnects, and wait-states.
Standard and optional isochronous traffic support
PCI compatible; designed for minimal impact on OS and driver software
CRC error generation and checking
Programmable error handling strategy for CRC, protocol, and other errors
Message signalled interrupts
System Management features
Support for bridges to legacy busses
x86 compatibility features
Device types including tunnels, bridges, and end devices permit construction of a system fabric comprised of independent, customized links.
Formerly known as AMD's Lightning Data Transport (LDT), HyperTransport is backed by a consortium of developers. See www.hypertransport.org.
The Cost Factor
In addition to technology-related issues, there is always pressure on the platform designer to increase performance and other capabilities with each new generation, but to do so at a lower cost than the previous one. One popular method of measuring the success of this effort is to compare the bandwidth of one I/O bus to another, and the number of signals required to achieve it. This bandwidth-per-pin comparison works fairly well because I/O bus bandwidth is a critical factor in determining if system data bottlenecks exist, and a lower pin count translates directly into cost savings due to smaller IC packages, lower power, simplified motherboard routing, etc.
An example:
The bandwidth-per-pin for a generic 32-bit PCI bus during a burst transfer is approximately 3.5 MB/s (132 MB/s [33MHz x 4 bytes]/38 pins [32 data signals + 5 control lines + 1 clock]). By comparison, a 32 bit HyperTransport interface running at the lowest clock speed of 200MHz yields a per-pin burst bandwidth of approximately 22 MB/s (1600 MB/s [200Mhz x 2 DDR x 4 bytes]/74 pins [32 CAD signal pairs + 4 clock pairs + 1 CTL pair]).
Networking Support
Finally, at the time of the writing of this book, the HyperTransport I/O Link Specification is at revision 1.04. This specification revision mainly targets I/O subsystem improvements in conventional desktop and server platforms.
A growing number of applications require architectures that integrate well with networking environments. In many of these systems, unlike desktops and servers, processing may be decentralized and features such as message streaming, peer-peer transfers, and assigned isochronous bandwidth become important. In addition, device types such as switches help in building topologies suited to communications networking. To accommodate networking applications, work is well underway on the 1.05 and 1.1 revisions of the HyperTransport I/O Link Specification. The 1.05 specification includes the HyperTransport switch specification and the 1.1 specification incorporates the networking extensions specification. See Chapter 19, entitled "Networking Extensions Overview," on page 443 for a summary of the major features expected to be included in the 1.05 and 1.1 specification revisions.
Visit www.hypertransport.org for up-to-date information on all on-going specification revisions.
Also, visit MindShare's website at www.mindshare.com for updates to this book relating to this and other HyperTransport topics. Information will be available for free download when the new specification revisions are released and details become publicly available.
Subscribe to:
Comments (Atom)