I wanted to take a look at performance of sharing. I started from "the bottom", looking at TCP and SSL connection characteristics before looking at WebDAV usage patterns or performance within either the client or server codebase.
Investigation using 0.7Alpha3-rc3, creating brand-new share
Publishing a brand-new share took 9 connections, together with 9 SSL handshakes.
The 9 connections did the following work:
- OPTIONS /home/lisadu (200 OK), PROPFIND /home/lisadu/ (207 Multi-Status)
- PROPFIND /home/lisadu/ (207 Multi-Status)
- OPTIONS /home/lisadu/lisadu_s_calendar-1/ (200 OK), PROPFIND /home/lisadu/lisadu_s_calendar-1/ (501 Method Not Allowed), connection closed by server
- HEAD /home/lisadu/lisadu_s_calendar01/ (404 Not Found), MKCALENDAR /home/lisadu/lisadu_s_calendar-1/ (201 Created), PROPPATCH /home/lisadu/lisadu_s_calendar-1/ (207 Multi-Status)
- OPTIONS /home/lisdau/lisadu_s_calendar-1/.chandler/ (200 OK), PROPFIND /home/lisdau/lisadu_s_calendar-1/.chandler/ (501 Method Not Allowed), connection closed by server
- HEAD /home/lisadu/lisadu_s_calendar-1/.chandler/ (404 Not Found), MKCOL /home/lisadu/lisadu_s_calendar-1/.chandler/ (201 Created)
- PROPFIND /home/lisadu/lisadu_s_calendar-1/.chandler/ (207 Multi-Status), PUT *.xml (201 Created) -- one for each event
- PROPFIND /home/lisadu/lisadu_s_calendar-1/ (207 Multi-Status), PUT *.ics (201 Created) -- one for each event
- MKTICKET /home/lisadu/lisadu_s_calendar-1/ (200 OK), MKTICKET /home/lisadu/lisadu_s_calendar-1/ (200 OK)
This graph shows how much time was spent in each phase of the connection, for each of the 9 connections. The labels and the pale blue columns, show the total time spent in each phase.
This means that there's a lot going on besides simply using WebDAV (the part that goes in the data phase).
- To start with, the 2 seconds spent doing SSL handshake 9 times can be brought down to one-tenth the cost, simply by not starting new connections. Most of this time should be recoverable, that is, not simply replaced by time spent doing other things.
- The time spent after creating the connection and before starting the handshake (7.5 s total) is almost as large as the time spent sending data (10.9 s). It's possible that not all of this time is recoverable because it's hard to think why all that time is consumed. I don't know whether the time is consumed in Chandler, in Twisted, or in M2Crypto. Also note that this time is usually constant at about 0.6 seconds, but in the second connection 2 seconds are wasted with an established but unused connection.
- The time spent idling before starting a new connection totals 2.5 seconds. This time may be mostly unrecoverable if it's Chandler analysing data from the server or doing necessary preparation for the next connection.
- The time spent doing application data transfer is the largest chunk, but not the lowest-hanging fruit compared to the pre-handshake time.
A slightly different version of the graph shows average and a different scale overall.
And a completely different way of looking at the same data, shows graphically how nearly 10 seconds elapse before the client begins the connections where it does the actual publishing (connections 7 and 8). Some of that is necessary feature discovery, but certainly the handshakes and reconnects can be minimized.
Data Files
Investigation using May 21 version of Chandler, updating share with 22 events (not publishing new)
The publishing took five minutes. Ekr helped analyze the trace with
SSLDump (you can use it too, it's under BSD license). See below for the traces and logs.
The summary is that there are a bunch of TCP/SSL issues to investigate on the Chandler side first.
- There are 7 separate SSL connections during Publish. Only 1 should really be needed. An ignorant guess is we'd be using PyOpenSSL? classes in different places, instantiating new and separate objects rather than passing one around (or looking one up) to handle the sharing traffic.
- Even within one SSL connection, traffic is slow. Between establishing the TCP connection and starting the SSL handshake, the log shows a 0.5 second delay. The handshake takes another 0.2 second. The actual transaction takes only 0.1 seconds.
- There's something weird going on with the connection that starts at 9:35:15 in the trace. After opening the connection, it's two minutes before the client sends any data. This is rather aberrant.
- The client hung up on one of it's own connections, the same one where it delays 2 minutes. Why?
Details, see attached output files for corroboration:
- It's quite wasteful to terminate and restart SSL connections so often during sharing. I count 7 separate SSL connections (that's easiest to see in the SSLDump output), some of them going on concurrently. This isn't as bad as doing one SSL connection per PUT (I see 22 event files, plus one .chandler file and its 23 files), but it's still significant overhead. Because the connection is encrypted and we don't have the server key, we can't tell exactly what requests go in the 7 connections but I could redo this test without SSL on another server and find out.
- The SSL connection beginning at 9:38:06 in the TCPDump output shows the TCP to SSL delay: 9:38:06.23 timestamp followed by 9:38:06.74 timestamp. The 9:30:06.74 timestamp packet is the first one that has client data, e.g. it's not setting up the TCP connection, thus it's the one with the C>S handshake.
- (issues 3 and 4) This connection starts at 9:35:15 in the TCPDump output, with a SYN from the client and a server response SYN then an ACK. At 9:37 the client sends 72 bytes of data, then in the next packet it sends a FIN, closing the TCP connection. This might be client timeout (in fact I got a GUI error from Chandler with a "user timeout" error). Perhaps this is already fixed later version of Chandler isnce it shows up as a GUI error?
Files
--
LisaDusseault - 24 Jul 2006