There is a very popular and important question about Windows Packet Filter: “Can I handle Gigabit traffic in WinpkFilter user-mode application without noticeable performance degradation?”. I was asked rather frequently and usually, my answer starts with “that depends…” followed up by various performance related considerations and ends with a sentence “if you need maximum possible performance then consider the possibility of putting your packet filtering code directly into the WinpkFilter driver. And yes, there are at least several important things to take into consideration:
- First, and the most important one, is CPU performance. The faster CPU the better because fetching a packet from the kernel, processing it and re-injecting back into the network stack ends in extra CPU cycles
- Application architecture. This is strongly connected to previous topic, but even with high-end incredibly fast CPU poor application design has good chances to slow down your network
- Network latency. You may be already aware that TCP protocol performance is very sensitive to network latency. This parameter is called TCP round-trip time (RTT) and this is the length of time it takes for a packet to be sent plus the length of time it takes for an acknowledgment of that packet to be received. Intentionally, TCP limits the amount of data which can be sent without confirmation (TCP window size) and taking this into consideration theoretical Throughput (of single TCP session) = TCP maximum receive window size / RTT. And this also highlights the importance of previous topic, fetching network packet from kernel and back not only takes CPU cycles, it also takes time and thus increases RTT. If your packet processing code is ineffective and slow, it adds even more. And all of these sequentially may decrease single TCP session throughput.
So, it looks reasonable to make some tests to see what throughput can be achieved using user-mode packet filtering. As a side effect of this research, several performance optimizations were implemented in WinpkFilter driver code (version 3.2.7).
Prerequisites
Hardware
The configuration is very common, two “far from new” average performance mobile workstations with built-in Gigabit LAN interfaces connected by UTP-5E cables to a Gigabit home router:
- Core i7-2760QM 2.40 GHz 16GB RAM Windows 10 x64 Pro
- Core i7-Q720 1.60 GHz 16GB RAM Windows 10 x64 Pro
Software
First, we need a tool with capabilities to transfer data between our hosts and to measure the effective performance. I have got two favorite utilities: NetCPS and iPerf3. The second one is more advanced and allows running more than one TCP session simultaneously. So iPerf3 (people like things starting from ‘i’) is the choice of the day.
We also require a simple WinpkFilter based network filter application to pass through network packets and may be gathering some basic statistics. There are two samples we could basically use for testing: PassThru and PackThru. However, they are designed to be as simple as possible to demonstrate basic API usage and also heavily use console I/O to dump packets headers. Console I/O is expensive from the performance point, and we can’t use it in high load test. So let’s code something similar, but more performance optimized.
#include "stdafx.h"
//
// Maximum number of packets to fetch from the driver per single operation
//
#define MAX_PACKET_CHUNK 256
//
// Some global definitions
//
TCP_AdapterList g_AdList;
DWORD g_iIndex;
CNdisApi g_api;
HANDLE g_hEvent;
BOOLEAN g_bIsRunning = TRUE;
volatile unsigned long long g_llPacketFiltered = 0;
volatile unsigned g_dwReadOps = 1;
Instead of reading and forwarding packets in method main() we create a dedicated thread for reading and forwarding packets. Having this thread also allows us to experiment with creating multiply threads to filter the single network interface. Soon we will see if having more than one thread per interface makes any sense.MAX_PACKET_CHUNK
defines the maximum number of network packets to read from the driver through the single API call. Please note that it is important to use ReadPackets
/SendPacketsToAdapter
/SendPacketsToMstcp
to read/write a bulk of packets in one API call. You save many CPU cycles required for user/kernel context switches if compare to single packet versions ReadPacket
/SendPacketToAdapter
/SendPacketToMstcp
. The value of 256 for MAX_PACKET_CHUNK
was chosen experimentally, in real application you can make this a dynamic parameter and adjust according to the average number of packets you read/write.
//
// Working thread routine. Started one thread per CPU core
//
unsigned __stdcall WorkingThread(void * index)
{
PETH_M_REQUEST ReadRequest;
PETH_M_REQUEST ToMstcpRequest;
PETH_M_REQUEST ToAdapterRequest;
INTERMEDIATE_BUFFER PacketBuffer[MAX_PACKET_CHUNK];
unsigned dwMaxRead = 0;
ULONG_PTR dwThreadIndex = reinterpret_cast(index);
printf("Thread: %I64d started\n", (ULONGLONG)dwThreadIndex);
//
// Initialize Request
//
ReadRequest = (PETH_M_REQUEST)malloc(sizeof(ETH_M_REQUEST) +
sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
ToMstcpRequest = (PETH_M_REQUEST)malloc(sizeof(ETH_M_REQUEST) +
sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
ToAdapterRequest = (PETH_M_REQUEST)malloc(sizeof(ETH_M_REQUEST) +
sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
ZeroMemory(ReadRequest, sizeof(ETH_M_REQUEST) +
sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
ZeroMemory(ToMstcpRequest, sizeof(ETH_M_REQUEST) +
sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
ZeroMemory(ToAdapterRequest, sizeof(ETH_M_REQUEST) +
sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
ZeroMemory(&PacketBuffer, sizeof(INTERMEDIATE_BUFFER)*MAX_PACKET_CHUNK);
ReadRequest->hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
ToMstcpRequest->hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
ToAdapterRequest->hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
ReadRequest->dwPacketsNumber = MAX_PACKET_CHUNK;
for (unsigned i = 0; i < MAX_PACKET_CHUNK; ++i)
{
ReadRequest->EthPacket[i].Buffer = &PacketBuffer[i];
}
while (g_bIsRunning)
{
WaitForSingleObject(g_hEvent, INFINITE);
// Reset event, as we don't need to wake up all working threads at once
if (g_bIsRunning)
ResetEvent(g_hEvent);
// Start reading packet from the driver
while (g_api.ReadPackets(ReadRequest))
{
if (ReadRequest->dwPacketsSuccess > dwMaxRead)
{
dwMaxRead = ReadRequest->dwPacketsSuccess;
printf(
"Thread: %I64d New read operation maximum %d packets from driver\n",
(ULONGLONG)dwThreadIndex,
dwMaxRead
);
}
for (unsigned i = 0; i < ReadRequest->dwPacketsSuccess; ++i)
{
InterlockedIncrement(&g_llPacketFiltered);
if (PacketBuffer[i].m_dwDeviceFlags == PACKET_FLAG_ON_SEND)
{
ToAdapterRequest->EthPacket[ToAdapterRequest->dwPacketsNumber].Buffer = &PacketBuffer[i];
ToAdapterRequest->dwPacketsNumber++;
}
else
{
ToMstcpRequest->EthPacket[ToMstcpRequest->dwPacketsNumber].Buffer = &PacketBuffer[i];
ToMstcpRequest->dwPacketsNumber++;
}
}
if (ToAdapterRequest->dwPacketsNumber)
{
g_api.SendPacketsToAdapter(ToAdapterRequest);
ToAdapterRequest->dwPacketsNumber = 0;
InterlockedIncrement(&g_dwReadOps);
}
if (ToMstcpRequest->dwPacketsNumber)
{
g_api.SendPacketsToMstcp(ToMstcpRequest);
ToMstcpRequest->dwPacketsNumber = 0;
}
ReadRequest->dwPacketsSuccess = 0;
}
}
free(ReadRequest);
free(ToMstcpRequest);
free(ToAdapterRequest);
return 0;
}
The working thread is simple. It reads bulk of packets from the driver, sorts to incoming and outgoing and then re-injects back into the network stack. It also counts the number of packets and ReadPackets
API calls to calculate an average.
int main(int argc, char* argv[])
{
if (argc < 3)
{
printf ("Command line syntax:\
\n\tPerfTest.exe index threads\
\n\tindex - network interface index.\
\n\tthreads - number of working threads.\
\n\tYou can use ListAdapters to determine correct index.\n");
return 0;
}
g_iIndex = atoi(argv[1]) - 1;
unsigned concurentThreadsSupported = atoi(argv[2]);
if(!g_api.IsDriverLoaded())
{
printf ("Driver not installed on this system of failed to load.\n");
return 0;
}
g_api.GetTcpipBoundAdaptersInfo ( &g_AdList );
if ( g_iIndex + 1 > g_AdList.m_nAdapterCount )
{
printf("There is no network interface with such index on this system.\n");
return 0;
}
ADAPTER_MODE Mode;
Mode.dwFlags = MSTCP_FLAG_SENT_TUNNEL|MSTCP_FLAG_RECV_TUNNEL;
Mode.hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
// Create notification event
g_hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
// Set event for helper driver
if ((!g_hEvent) ||
(!g_api.SetPacketEvent((HANDLE)g_AdList.m_nAdapterHandle[g_iIndex], g_hEvent)))
{
printf ("Failed to create notification event or set it for driver.\n");
return 0;
}
g_api.SetAdapterMode(&Mode);
HANDLE* phThreads = new HANDLE[concurentThreadsSupported];
for (unsigned i = 0; i < concurentThreadsSupported; ++i)
{
phThreads[i] =
(HANDLE)_beginthreadex (
NULL,
0,
WorkingThread,
reinterpret_cast<void*>((ULONG_PTR)i),
0,
NULL
);
}
printf("Press any key to stop packet filtering... \n");
_getch();
g_bIsRunning = FALSE;
SetEvent(g_hEvent);
WaitForMultipleObjects(concurentThreadsSupported, phThreads, TRUE, INFINITE);
for (unsigned i = 0; i < concurentThreadsSupported; ++i)
{
CloseHandle(phThreads[i]);
}
printf(
"Filtered %I64d packets read in %d operations. Packets per read average:%I64d \n ",
g_llPacketFiltered,
g_dwReadOps,
g_llPacketFiltered / ++g_dwReadOps
);
//
// Although we exit application and all resources will be cleaned up
// automatically let's release everything before we exit
//
Mode.dwFlags = 0;
Mode.hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
// Set NULL event to release previously set event object
g_api.SetPacketEvent(g_AdList.m_nAdapterHandle[g_iIndex], NULL);
// Close Event
if (g_hEvent)
CloseHandle(g_hEvent);
// Set default adapter mode
g_api.SetAdapterMode(&Mode);
// Empty adapter packets queue
g_api.FlushAdapterPacketQueue(g_AdList.m_nAdapterHandle[g_iIndex]);
return 0;
}
The resulted console application accepts interface index and number of working threads as parameters, starts the requested number of packet processing threads and waits for the user to terminate. Now it is time to run some tests. We will be using the latest available WinpkFilter 3.2.7 (single TCP session performance was significantly improved in this version).
Performance tests
Both mobile workstations have iPerf3 started in server mode to measure the throughput for incoming and outgoing TCP data streams. We use iPerf3 in client mode on each of the workstations and measure the network throughput using single and 16 simultaneous TCP sessions in the following software configurations:
- WinpkFilter driver not installed
- WinpkFilter installed but not used
- WinpkFilter installed and perftest.exe (utility we created above) is started on Ethernet interface having single packet processing thread
- WinpkFilter installed and perftest.exe (utility we created above) is started on Ethernet interface having two packet processing threads.
Configuration | Receive 1 TCP stream (Mbits/sec) | Send 1 TCP stream (Mbits/sec) | Receive 16 TCP streams (Mbits/sec) | Send 16 TCP streams (Mbits/sec) |
---|---|---|---|---|
NOT INSTALLED | 939 | 938 | 940 | 939 |
INSTALLED | 941 | 940 | 940 | 933 |
PERFTEST 1 THREAD | 937 | 935 | 934 | 939 |
PERFTEST 2 THREADS | 648 | 880 | 923 | 934 |
Conclusion
Well, as we can see from the test results, our simple WinpkFilter application handles 1 Gigabit for both single and for multiply TCP connections without network throughput degradation. Resulted CPU load with perftest.exe running is 17-20%, without perftest is 7-8%. This is good news and for now, we can avoid the need to move our packet processing code to the kernel driver.
However, there is one more thing to note. What happens when we start a second working thread inside perftest.exe on the same network interface? Why throughput decreases so drastically? The reason is easy to find out if you start the network sniffer and look at the order of network packets. You will notice that the order of packets in a single TCP session is sometimes broken causing DUP ACKS and retransmits and this seriously affects the TCP session throughput. So, it is important to understand, if you use more than one thread to read packets from the same network interface or have some other packet processing flow which may potentially break the original order of network packets, then you have to care to re-inject packets into the network stack in correct order.
PerfTest project source code is available on GitHub
Hi!
Will it have the same performance on Windows 7 x64?
Windows network subsystem has not changed much since Windows Vista (when NDIS 6.0 was initially released) and for that reason WinpkFilter uses the same type of driver from Windows Vista to Windows 10 (driver binaries slightly differ because they are built targeting different minor versions of NDIS which depends on the OS and have backward compatibility, e.g. you can’t load Windows 8 driver on Windows 7, but you can load Windows 7 driver on Windows 8/10). So if Windows 7 x64 on particular hardware can handle Gigabit traffic without taking 100% CPU then it will be able to do it with WinpkFilter.