An organization experienced a delay in receiving emails. This doesn’t seem critical until it comes to time-based authentication via email, such as Microsoft 365 or portal registrations. The transmission delay causes authentication timeouts, preventing users from accessing critical business applications and support portals.
The intermittent nature of the failures made troubleshooting the issue more difficult. This article shows how network data was captured and analyzed to identify the root cause of this problem.
Troubleshooting
The infrastructure consists of the sending client, the client-side mail server, Internet edge firewalls of the affected organization, a Mail Transfer Agent (MTA), DMZ firewall, and an internal Mail Exchange server.

Figure 1: Overview of the affected infrastructure. Redundancies removed to show only the relevant parts.
The first step in troubleshooting is to check for the last changes before the problem occurred. In this situation, a new Mail Transfer Agent was installed, and the Internet edge firewall was updated to a new release. The first step for the administrators responsible for the infrastructure was to review the log files of both systems.
At the firewall, they couldn’t see any dropped packets, even in the integrated IPS logs. The integrated MTA functionality in the firewall was disabled. Destination Network Address Translation (DNAT) rules were matched correctly.
The next step was to review the Mail Transfer Agent's log files in the DMZ. The MTA, which was a virtual appliance built on Postfix, only told about a timeout from the source. There were several sources involved, from Apple to Google to Microsoft, but only about half of the emails timed out.
Mar 03 08:14:21 mail postfix/smtpd[18452]: connect from client-4b176ee1cd.local [10.34.56.0]
Mar 03 08:14:21 mail postfix/smtpd[18452]: 220-client-266ac34bd7.local
Mar 03 08:15:21 mail postfix/smtpd[18452]: timeout after CONNECT from client-4b176ee1cd.local [10.34.56.0]
Mar 03 08:15:21 mail postfix/smtpd[18452]: disconnect from client-4b176ee1cd.local [10.34.56.0] commands=0/0
Listing 1: Postfix log file with connection timeout. IP addresses and hostnames sanitized.
How to tackle the issue, and where and how to capture it?
The analysts asked themselves which timeout was raised and why? They decided to take a look at the packet data. Multiple components were involved in this problem, so they decided to take a multipoint capture at the WAN interface of the internet edge firewall and between the MTA and the firewall when the failure occurs.
No TAPs were installed in the data center for packet capture. Because of high link load, they decided to capture in in-line mode with the ProfiShark at both locations, so they would not lose any packets due to capture limitations inherent to SPAN or host-based capturing.

Figure 2: Capture points for multi-point capture with ProfiShark.
The next step was to analyze capture traces in Wireshark. Mail traffic is based on the SMTP protocol, which itself depends on TCP. In this case, they correlated IP addresses of a failed connection mentioned in the log with a display filter on the IP address (ip.addr == 10.34.56.0).
After this display filter, they tried to look at only one TCP stream by right-clicking a packet in the packet list, then left-clicking "Follow" and "TCP Stream". First look goes to the TCP handshake, which was successful.
Right after, some strange things were seen. Directly after the greeting reply code 220, there was a "-" directly concatenating the domain string. According to RFC 5321, this is valid only in multiline replies, and the last line must include a space after the reply code. In this case, it’s only one line before "\r\n", so there needs to be a space character.
The format for multiline replies requires that every line, except the last, begin with the reply code, followed immediately by a hyphen, "-" (also known as minus), followed by text. The last line will begin with the reply code, followed immediately by <SP>, optionally some text, and <CRLF>. As noted above, servers SHOULD send the <SP> if subsequent text is not sent, but clients MUST be prepared for it to be omitted.
For example:
250-First line
250-Second line
250-234 Text beginning with numbers
250 The last line
Listing 2: Excerpt from RFC 5321 with clarification on SMTP Greeting format.
Source: https://datatracker.ietf.org/doc/html/rfc5321
After this invalid Greeting, the sending client was not responding because it waited for a correct 220 SMTP Greeting in packet 1222 (incorrectly marked as a retransmission) before sending an SMTP EHLO message in packet 1224.
The analysts assumed EHLO was received too late, so the MTA closed the connection with a timeout log message. A support ticket was raised with the vendor of the MTA, which is built on the open source project Postfix. They fixed the faulty greeting message with a patch.

Figure 3: WAN side capture in Wireshark.
Sometimes a second look is needed
After implementing the fix for the greeting message, the original issue persisted, so a second look was needed. New captures were made with ProfiShark at the external and DMZ interfaces of the Internet edge firewall.
At the external interface, EHLO messages were retransmitted as in packets 1224 to 1409 in the old capture file. At the DMZ interfaces, EHLO packets were not visible, so the MTA retransmitted 220 Greeting messages there. The conclusion was that the firewall was dropping packets in between, but the corresponding logs didn’t report dropped packets, so the firewall vendor was involved.
The integrated MTA within the firewall inspected SMTP packets, even when SMTP inspection was disabled, as was the case here. In the end, the firewall vendor sent a patch that prevented EHLO packets from being inspected by the firewall when all SMTP checks were disabled.
How ProfiShark helped and future plans
ProfiShark fits this situation perfectly because the customer hadn't prepared their network for this case by installing TAPs in their data center infrastructure. This means they needed a portable, lightweight solution for high-fidelity packet capture that could be quickly positioned at the relevant capture points.
For future cases, the organization is considering permanently tapping their network so they wouldn't need to disconnect physical links in the data center for full visibility with in-line capture. In the end, truth lies in the packets.
