I have a client who uses an external mail server to log and filter their inbound email before passing it on to their main mail system (which is Office365). The external server is running Sendmail, formerly on Debian, now on Ubuntu (Amazon's version).
The change of OS (and hosting) has also involved a bump in the version of Sendmail, from 8.14.4 on the older server (quite elderly, yes) to 8.15.2 on the newer one. Unfortunately it's also induced a slight change in behaviour and I am struggling to find whether this is controlled by a flag or other config setting.
The external server does some filtering of spam and other cruft and, up until the change of server, all of the remaining mail was successfully delivered to O365. (Some of it was marked as spam there, but it all got to their servers at least.) This is no longer true. Most of the mail which is being held, and not delivered, is bounce messages from their mailshots, but they do need to see those to clean their lists.
It seems that the mail which is not being delivered is mail which has names which are not fully qualified in the mail headers. They aren't from the hop where it arrives on the external server -- we would reject e.g. nonexistent domains at that stage -- but from earlier on in the mail's journey. So we might see headers like
Received: from EXTERNAL.com (EXTERNAL.com [NN.NN.NN.NN]) by OUR.SERVER (8.15.2/8.15.2/Debian-10) with ESMTP id xELIDED for <[email protected]>; Sat, 11 May 2019 10:49:17 GMT Received: from EXTERNAL.com.local (EXTERNAL.com [NN.NN.NN.NN]) by EXTERNAL.com (18.104.22.168/22.214.171.124) with ESMTPS id xELIDED (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for <[email protected]>; Sat, 11 May 2019 18:49:15 +0800
-- it is the .local header which is causing the problems. But we aren't seeing these problems when the mail arrives at the server, only when it leaves the server to head to Office365. At that point, we see
EXTERNAL.com.local: Name server timeout
And the transaction concludes with
timeout writing message to OURDOMAIN-com.mail.protection.outlook.com. <[email protected]>... Deferred: Name server: OURDOMAIN-com.mail.protection.outlook.com.: host name lookup failure Closing connection to OURDOMAIN-com.mail.protection.outlook.com.
i.e. the eventual failure message suggests that it's failing to connect to O365. This isn't the case since the connection starts successfully, other connections to the O365 servers are connecting successfully (this is a moderately high-traffic server).
Sendmail logging and tcpdump shows that the initial part of the SMTP connection goes fine. In the DATA section, the headers are transmitted and then the connection terminates due to not being able to look up a hostname which is not immediately relevant to the connection (it never gets as far as transmitting any of the body of the email).
As well as situations where there was a hostname within the headers which could not be resolved, I've also seen this happen on an email where the reply-to had been set to a domain which did not exist and was thus not resolvable. (Quite likely spam but that isn't really the issue here.)
I've looked through the change logs for the relevant versions of Sendmail and haven't spotted anything obviously relevant; I've also spent a little bit of time poking through the source code. The logs from tcpdump show the connection being terminated on our side, not Microsoft's -- we have an engineer from their side attempting to help us but, because there is never a successful connection for these mails, they're struggling to see what is going on. DNS lookups for everything else seem to be working fine.
If anyone knows where to find the config which will say "don't try to look up irrelevant hostnames", I'm all ears. These aren't our misconfigurations!
Thanks in advance.