r2 - 08 Jul 2005 - 14:54:46 - LisaDusseaultYou are here: OSAF >  Journal Web  >  DevelopmentHome > ServicesWorkingGroup > EmailService > BarryWarsawEmail20040309

Barry Warsaw Email

From:   barry@python.org
Subject: [Email-SIG] New FeedParser and other updates
Date: May 9, 2004 8:51:57 AM PDT
To:   email-sig@python.org
Last night I checked in a new FeedParser, along with tons of other
updates to rip out compatibility with older Pythons.  email 3.0 will
only support Python 2.3 and above.
I spent a lot of time pouring over RFC 2046 and the BNF grammar and
tried to get our Parser, Message, and Generator classes to be more
compliant.  The trickiest bit of all this is caused by the RFCs
assertion that the newline preceding a multipart boundary actually
belongs to the boundary, not to the body.  This is tricky because the
FeedParser really wants a read-a-line-at-a-time abstraction, so by the
time you've seen the boundary, you've already consumed the preceding
newline.
The hack then is to try to track where that newline lives and clean it
up afterward.  I think it's mostly going to show up in the encapsulated
message body, or in the case where the inner message is a multipart, in
the epilogue.  For leading boundaries, you need to clean the newline out
of the preamble.
Another big change is that the FeedParser will not throw parsing errors
any more.  Instead, if it finds a problem, it will populate a .defects
attribute on the current message.  This will be a list of instances of
subclasses of the new email.Errors.MessageDefect class.  The Generator
isn't currently set up to consult .defects, so that should be added.
You should also check out the BufferedSubFile abstraction in
FeedParser.py  (this used to be called FeedableLumpOfText :).
In any event, the FeedParser passes all the test_email.py tests,
although some had to be modified.  I also added a bunch more tests to
flex the current semantics of the .preamble and .epilogue.
Everything's checked in now so please feel free to test it yourself. 
The old Parser class was rewritten in to use the FeedParser, so now it's
basically just a backward compatible front-end.  I haven't thrown
Anthony's huge stress test at it yet, but I hope I'll find time soon to
do so.
One thing I know won't 'work' is parsing of a nested multipart with the
same boundary on the inner and outer messages.  That's because of the
BufferedSubFile abstraction, since the outer boundary matching regexp
will cause it to return EOF on the first inner boundary.  The message
will get a StartBoundaryNotFound defect and the rest of the message will
be parsed as its body.
I think a better solution can be found, along the lines of unreadline()
what's read up until then, pushing a different EOF matcher onto the
BufferedSubFile and trying again.  You'd still want to push a .defect
onto the message so that you knew the inner and outer messages had the
same boundary.  Also, the Generator would have to be modified to look
for that defect and calculate a different inner boundary for the
generated message (meaning it wouldn't be idempotent).
Enough babbling, enjoy.
-Barry

-- BrianKirsch - 10 May 2004

Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Open Source Applications Foundation
Except where otherwise noted, this site and its content are licensed by OSAF under an Creative Commons License, Attribution Only 3.0.
See list of page contributors for attributions.