CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 397
Description
Oracle Solaris ships unbound on SPARC platforms built with ADI (Application Data Integrity) enabled. This SPARC HW feature picks up on memory corruption/violations, and it's this feature which is triggering unbound SEGVs.
NOTE: In the following examples, "XXX" is a redaction.
The resulting core file when analysed points to a problem with attempting to log a SERVFAIL. The unbound configuration has:
log-servfail: yes
Disabling ADI (by using elfedit(1) on the unbound binary) instead results in the SERVFAIL messages being logged. A real world example:
Jul 8 05:15:01 XXX unbound: [ID 993594 daemon.error] [11684:0] error: SERVFAIL <XXX. A IN>: all the configured stub or forward servers failed, at zone . from (inet_ntop_error) upstream server timeout
The "inet_ntop_error" is the message of interest here.
This comes from addr_to_str(). We can see addr_to_str() in the stacktrace in the core file:
$ mdb core.unbound.26674.1682413768
Loading modules: [ libc.so.1 ld.so.1 ]
unbound:core> ::status
debugging core file of unbound (64-bit) from XXX
file: /usr/sbin/unbound
initial argv: /usr/sbin/unbound
status: process terminated by SIGSEGV (Segmentation Fault) code=5 (SEGV_ADIPERR), pc=7fff590d7b918c, ADI version c mismatch for VA 7fff590bfdfa28
unbound:core> $c
addr_to_str+8(7fff590bfdfa28?, 7fff59?, 7fff590bfdefd0?, 100?, 7fff590bfdedd0?, d00000890beaaf68?)
errinf_reply+0x12c(d00000890beaa2a0?, d00000890beaa6d8?, d00000890beaa870?, d00000890beaa870?, 1?, 7fff590bfdefd0?)
processQueryTargets+0x189c(d00000890beaa2a0?, d00000890beaa6d8?, b000008909ed9ee0?, 1?, 1000000000000?, d00000890beaa870?)
iter_handle+0x544(d00000890beaa2a0?, d00000890beaa6d8?, b000008909ed9ee0?, 1?, 7fff590d741e70?, 7fff590d77cf18?)
iter_operate+0x384(d00000890beaa2a0?, 3?, 1?, d00000890beaaef8?, d00000890beaa2a4?, d00000890beaa2a0?)
mesh_run+0x80(400000890b9772a0?, d00000890beaa250?, 3?, d00000890beaaef8?, 7fff590d744bb0?, d00000890beaa2a0?)
...
Running through the call sequence, errinf_reply() is attempting to "add response specific error information for log servfail". It calls addr_to_str() passing "fail_reply" (a copy of a pointer to a struct comm_reply). In turn, addr_to_str() calls inet_ntop(), which first validates the address family; failure to validate means inet_ntop() returns NULL, and it's this NULL that results in addr_to_str() producing the "(inet_ntop_error)" string.
So why does the address family validation fail?
Using debug logging it appeared that lookups were failing with both THROWAWAY and timeouts. Code inspection lead to the following few lines:
iterator/iterator.c
static int
processQueryTargets(struct module_qstate* qstate, struct iter_qstate* iq,
struct iter_env* ie, int id)
{
...
} else if(type == RESPONSE_TYPE_THROWAWAY) {
/* LAME and THROWAWAY responses are handled the same way.
* In this case, the event is just sent directly back to
* the QUERYTARGETS_STATE without resetting anything,
* because, clearly, the next target must be tried. */
verbose(VERB_DETAIL, "query response was THROWAWAY");
} else {
Namely the "without resetting anything" comment.
Rather than attempt to craft a DNS environment which results in response_type_from_server() returning RESPONSE_TYPE_THROWAWAY, response_type_from_server() was modified to always return RESPONSE_TYPE_THROWAWAY.
Then set unbound.conf to have two forward-addr settings: one for a working DNS server, the other for a machine with no DNS service.
Finally, a script which fires a number of dig(1) queries at unbound completes the test case.
Without ADI enabled, log messages seen were of the likes:
[1692363449] unbound[1500:0] error: SERVFAIL <XXX. A IN>: all the configured stub or forward servers failed, at zone . from (inet_ntop_error) upstream server timeout
Which seems pretty close to the original failure. The suspect code/comment suggests what's happening is:
- query sent to working configured DNS server
- response passed to response_type_from_server(), which artificially always returning RESPONSE_TYPE_THROWAWAY
- processQueryResponse() as cited above moves to try the next target
- no response from second target (it doesn't have a DNS service)
- logging of SERVFAIL ends up in errinf_reply() which finds a "fail_reply" pointer and attempts to use it
- however, this pointer was for the THROWAWAY response, and the underlying memory has since been reused
- without ADI, inet_ntop() fails with the described address family validation
- with ADI the HW SEGVs the process with "this isn't the memory you are looking for"
Changing processQueryResponse() and clearing "fail_reply", ie:
* because, clearly, the next target must be tried. */
iq->fail_reply = NULL;
verbose(VERB_DETAIL, "query response was THROWAWAY");
} else {
gives us a quick fix, as we aren't leaving an old pointer lying around.
Note...the change to response_type_from_server() is an ugly hack to make reproducing the circumstances easier, obviously in the real world the DNS environment was resulting in a RESPONSE_TYPE_THROWAWAY return value from time to time.
Finally, as a quick fix I'm sure there's a more elegant/complete solution which can be implemented.