How we found who was “poisoning” our memcached server

Memcached is a straightforward memory cache system. You send it a key/value pair, it stores then in memory.

While ago, one of our memcached servers was being populated with outdated data and, of course, returning outdated data.

We had to found which server was doing that. But how? We decided it was time to some strace. We took the memcached server out of production so it would be easier to trace the system calls. After that, we started the process:

$ ssh myserver
$ pgrep -l memcached
13708 memcached
$ sudo strace -f -t -s1024 -p 13708

Then we’ve found this very useful line:

[pid 13714] 10:32:57 read(38, "set 357933550488859b4caae308d73f2df7 2 6 252\r\n{\"status\": 401, \"container_count\": null, \"storage_policies\": {\"0\": {\"object_count\": 0, \"container_count\": 0, \"bytes\": 0}, \"1\": {\"object_count\": 0, \"container_count\": 0, \"bytes\": 0}}, \"bytes\": null, \"total_object_count\": null, \"meta\": {}, \"sysmeta\": {}}\r\n", 2048) = 300

This was the outdated data. And the first number after “read”, is the file descriptor which from the memcached was receiving that data.

Ok but who is “38”? Well, that is a good moment to call your good friend “lsof“. lsof can list all open files on your system, including tcp connections!

$ sudo lsof -n -p 13708|grep 38
memcached 13708 memcached 38u IPv4 2140063769 0t0 TCP 10.99.12.23:memcache->10.99.56.35:13607 (ESTABLISHED)

GOTCHA!

Server 10.99.56.35 is connected to our memcached and sending outdated data!

After that discovery, we logged into the server 10.99.56.35and found out it wasn’t restarted after its last configuration change. We restarted it and everything got back to normal. Problem solved!

Nexts steps:  we are working on how to improve our deployment process. We know puppet, our configuration management tool,  can automatically restart servers after a configuration file change but we have some worries about this automation. Since we receive 1 billion req/day this service is quite important and we can’t risk an outage due some bad automation.

That is all folks. I hope you enjoy the post and please, feel free to comment any tip or question you have about it!

See you all on the next post 🙂

Leave a comment