You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had a cluster fail in production. Turns out a value in our key value store was wrong and our python code was frequently exiting with an unhandled exception. This apparently caused a fast memory leak and all the unbound servers in the cluster started dying due to OOM killer.
I reproduced the problem in our test environment and under valgrind I could clearly see Python leaking. We've had similar high rates of unhandled exceptions during previous events under previous versions of unbound, but never had OOM killer get involved. I wonder if it's related to a change in the unbound codebase (We just recently upgraded to a 1.13.2 from a very old version)? But it seems like a python bug to me.
Below is the valgrind stack trace of the biggest leak. We tried adding gc.collect() (Manually run a garbage collection run) to our codebase but there seemed to be no impact.
==4241== 52,405,144 bytes in 595,513 blocks are possibly lost in loss record 3,991 of 3,996
==4241== at 0x4C28BE3: malloc (vg_replace_malloc.c:299)
==4241== by 0x512E593: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8D58: _PyObject_GC_Malloc (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x51B8E95: _PyObject_GC_NewVar (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x514012E: PyTuple_New (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518521F: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x5187609: PyEval_EvalFrameEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x518B1FC: PyEval_EvalCodeEx (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x511571F: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F0FE2: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F10C4: ??? (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x50F119D: PyObject_CallFunction (in /usr/lib64/libpython2.7.so.1.0)
==4241== by 0x4B1D00: pythonmod_operate (pythonmod.c:530)
==4241== by 0x4655DF: mesh_run (mesh.c:1704)
==4241== by 0x42946B: worker_handle_request (worker.c:1573)
==4241== by 0x4AEB70: comm_point_udp_callback (netevent.c:769)
==4241== by 0x4E7C36: handle_select (mini_event.c:220)
==4241== by 0x4E7DEB: minievent_base_dispatch (mini_event.c:242)
==4241== by 0x4AED1B: comm_base_dispatch (netevent.c:246)