Email from Menno Smits
From: menno@netbox.biz
Subject: [Email-SIG] Handling large emails: DiskMessage and DiskFeedParser
Date: May 24, 2004 6:15:41 PM PDT
To: email-sig@python.org
Hi all,
FeedParser is great because it doesn't load the entire message into memory during parsing (yes, I realise there are other
reasons for FeedParser exising too). However, once the message is parsed the attachment bodies are still loaded entirely
in to memory when Message instances are created and populated. This is a big problem for real world enviroments where
large messages are possible. All available memory is consumed and the machine grinds to a halt. We see large (40MB+)
emails all this time and problems start to occur when several of these are being processed simultaneously.
To cope with this problem I've created 2 classes DiskMessage and DiskFeedParser (see http://oss.netboxblue.com).
DiskMessage is a simple subclass of Message that stores message payloads to temporary files instead of RAM. Its API
is compatible with the standard Message class although to truly avoid loading the entire message in to memory you need
to use some extra methods. See the source for details.
DiskFeedParser is a hack of the current FeedParser that uses the extra methods of DiskMessage to avoid ever loading
message payloads into memory. If anyone wants to try cleanly subclassing FeedParser for this purpose instead of
just hacking it I'd like to see the results.
Some informal tests of memory usage after parsing a 25MB email (2 large attachments), Python 2.3.3:
VSZ RSS
Parser with Message: 31840 25088
DiskFeedParser with DiskMessage: 12372 6128
Note that these classes haven't been tested extensively but seem to work. Any feedback would be
greatly appreciated.
Regards,
Menno
--
Menno Smits, Senior Development Engineer
NetBox http://netbox.biz | Voice +61 500 555 357
Oxcoda http://oxcoda.com | Fax +61 500 555 358
--
BrianKirsch - 25 May 2004