Message encodings: "UnicodeEncodeError: 'ascii' codec can't encode..." #110
Labels
No Label
ANSIBLE
BUG
CODE
DEVELOPMENT
DOCUMENTATION
FEEDBACK
FIX
HOWTOs
IDEA
INFRA
ISSUE
MAILSERVER
TESTS
To-Be-Reviewed
WEB
WEBSITE
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Disroot/gpg-lacre#110
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
At least one message has caused the following issue while being processed:
This means we sometimes try decoding a message (from
bytes
tostr
) using wrong decoder. However, we never useascii
directly, so this might come:LANG=C
);Content-Type
header, esp. its encoding part.We probably should stop decoding to
str
.This one is raised in
smtplib
, Python's built-in SMTP client module:I'm still investigating, because it would be very surprising if a widely used library didn't support non-ASCII messages. However, ASCII encoding is fixed.
Character set mismatchto Message encodings: "UnicodeEncodeError: 'ascii' codec can't encode..."I am getting this error as well, but with an email that should not even be touched by gpg-lacre as there is no public key for the receiving address in the server's keyring. It is a confirmation mail from keyserver@keys.openpgp.org which bounces and is not delivered. gpg-lacre really needs to fail softer, especially when there is nothing more for it to do than passing on the message.
It seems this only happens with emails not to be encrypted by the server. After setting up server side encryption for the account, the confirmation emails from keyserver@keys.openpgp.org arrive, although german umlauts (ä,ö,ü) are replaced with something displayed as non-printable characters.
Thank you very much @EmanuelLoos! I've just reproduced the issue, created a test message and will now try fixing it.
However, judging by the contents of your log I can tell you're runng outdated code. Please consider updating:
This will set your repository to our current release, where several issues have already been fixed.
Thanks, I saw the update after commenting on this issue and and have already installed it.
Modules useful while solving this issue and #111:
email.message
,email.generator
andemail.parser
).I suspect we'll need to get rid of all
as_string
calls and direct conversions fromstr
tobytes
and use generators and parsers provided byemail
submodules. This should make the code much more thorough, but will probably take some more time than initially expected.Finally I've made some progress. A friend has provided a message that does lead to this issue on Lacre test environment.
Unfortunately, my E2E tests produce different results, so I'm trying to figure out where the difference is.
Still looking for the key difference between our test environment and the environment in which E2E tests run.
I've also spent a lot of time trying to figure out if there's something obviously wrong about my approach, but I haven't found anything.
The issue can be reproduced when:
I'll need to add a test with an expired key to make sure I'm not missing anything in e2e tests.
An E2E test with an expired key and the message known to cause encoding issue on
@lacre.io
doesn't lead to expected results, i.e. the encoding issue is not reproduced in test environment.It leaves the question open: what the heck happening there?
I've got two messages: one is
multipart/alternative
withtext/plain
andtext/html
, and the other is just atext/plain
. Both use UTF-8 withContent-Transfer-Encoding: 8bit
and UTF-8 charset, yet the first doesn't cause encoding error and the second does.Reading Section 6 of RFC 2045 (MIME: Format of Internet Message Bodies). Because I need to understand the basics first.
Seems like the issue is fixed, but let's wait a day or two to verify it.
Along the way I've addressed #113 and #115.
Where's the fix?
Hi @EmanuelLoos, you can find it on post-test-fixes branch (recent commits are the most important).
However, please note there's a temporary addition of extended logging (opt-in, off by default). See example config in the repository to see the details.
I just updated. However,
gpg-lacre
still seems to have a hard time encrypting8-Bit
UTF-8
messages as sent fromThunderbird
containing emojies and umlauts.Sent:
Recieved:
7bit
emails sent fromK-9 Mail
work fine.Sent:
Recieved:
After updating I still get the error:
Thanks for detailed reports @EmanuelLoos! I'll investigate some more.
@EmanuelLoos one more thing - you can remove screenshots now. They contain PII and email addresses, which you might want to keep private.
I'm investigating the issue. I know how and why it happens, but don't know how to solve it yet.
email
module expects the message to be encoded with ASCII, but it is not.I'm mapping the flow of data in Lacre and how messages are handled in PGP/MIME and PGP/Inline modes to understand when and how data transcoding takes place and how to do it properly.
👉 Python Unicode HOWTO
I've pushed some more changes to post-test-fixes branch. I still need to deploy them to test environment (and the commit history is very messy because of trial-and-error approach), but I feel this time it might be fixed, or at least improved. 😅
Nope, still haven't fixed it.
Back to square one...
I've finally found something:
I've pushed some changes to the branch and deployed it to our test environment.
My test message has been processed properly. Let's test these changes some more.
Seems that the tests are going well.
@EmanuelLoos, do you also confirm that current HEAD of branch post-test-fixes solves encoding issues?
With this version emails sent from Thunderbird (with and without special characters) won't arrive at all for me:
Python 3.9 (Devuan 4
chimaera
based on Debian 11bullseye
):Python 3.11 (Devuan 5
daedalus
based on Debian 12bookworm
):Emails sent using K9-Mail don't cause any problems.
Thanks for reporting @EmanuelLoos, I've just pushed a change that'll fix the issue.
By the way -- please note you can use the advanced mail filter now. Basic documentation is in doc/adv-filt.md. It boils down to running Lacre as a daemon and telling Postfix to forward email to that daemon's port.
Our documentation includes a link to Postfix documentation on Advanced Filters too, for your convenience.
edit:
I suspect K9-Mail messages are multipart and therefore were not rewrapped. Rewrapping code is the only place where this error could be raised.
First of all, thanks for the fix and the information. Emails sent from Thunderbird arrive again now!
Sadly however, there is still something going wrong during the encryption of emails sent from Thunderbird containing special characters:
Thanks for the report @EmanuelLoos. Could you please send similar messages from both of these clients to me? You'll find my email in my profile.
Meanwhile, I'll try reproducing it myself.
Did you also get my two additional emails about encoding errors regarding messages sent from K9-Mail?
I'm asking because if you're using
gpg-lacre
already on your server they might have not arrived.Hi @EmanuelLoos, yes, I've got it - thank you. I've been offline for a week and that's why I didn't reply to you.
Hi @EmanuelLoos,
you can try using the most recent code. I've made it slightly more reliable. However, I'll keep encouraging you to use the advanced filter if possible.
Hi @EmanuelLoos,
It looks like I've fixed the issue (or at least can no longer reproduce it after recent fixes). Please let me know if current code works for you.
Since we'd like to move on with the project, I'd like to close this ticket by the end of this week, unless of course there are reports of Lacre still messing with message contents encodings.
If we find anything else (not related to encodings), we'll create new tickets and leave this one.
I just updated to the current master branch to which the fix was merged.
Well, all emails do arrive, but it still messes up special characters in emails from Thunderbird for me.
This is from the source code of the original message, this is the way Thunderbird encodes special characters and emojis in
UTF-8
that gpg-lacre seems to not get along with:When saving the message as a
.eml
file and opening it with a text editor I get this:Yahoo seems to have had an issue like that too for some time.
I think most likely gpg-lacre has trouble with the 8bit content transfer encoding (i.e. non-ASCII characters), ...
Content transfer encodings
... or maybe it might have some more specific problems with Thunderbird's utf-8 encoding.
Tools for fixing character encoding
Hi @EmanuelLoos,
This is very weird. Could you please execute
make test
in your working copy of Lacre repository and share the output? The output should look like this:These tests take some time to run, so be patient. Shouldn't take more than 1-2 minutes though, so when they do -- interrupt with Ctrl-C and execute the command again.
If you see any output other than the above, please paste it here.
We already have a test case for
Content-Transfer-Encoding: 8bit
, including an emoji, so I'm confident it's not the root cause. I've also double-checked and sent myself exactly the same message:The result is as expected. Something else must be happening.
You can inspect Lacre logs, which can contain lots of diagnostic information when you enable debug-level logging. (Just set
level=DEBUG
ingpg-lacre-logging.conf
in your handler.)Hi @pfm,
are there any news on this?
Should this issue be opened again or should we open a new one?
Also, there is another weird thing going on, some emails look like this (see attached screenshot) after being encrypted by gpg-lacre:
They seem to be missing something like (copied from the source of some correct PGP message):
Hi @EmanuelLoos,
I'm so sorry, I must've missed a notification about your comment... 🙁
Regarding the previous issue -- could you please switch to Python 3.9 and try again? I can't support more Python versions at the moment, and I know that with Python 3.9 Lacre should work just fine.
This most recent report about missing MIME boundaries -- does it still happen? Do you see something in the logs? I'll think about a scenario that could lead to it, but an actual use case would be highly appreciated (please feel free to send it directly to my email to avoid attaching files with personal data to this issue.