And also why and when you shouldn't use memcacheq/memcachedb in highload.
So, one day i was given a task to develop a CDN system for our indoor use.
With the following features: it should have easy scalable storage with duplication (duplication also takes off some workload from nodes), it should have some api for frontends, and as such - be fast, as we have lots of them. Thing is, we are storing files on frontend first, and CDN need to fetch them somehow, and then clone on all nodes.
For the first version i've taken memcachedb for api queue (not queue, actually, just as temporary storage while files are being fetched from frontend) and memcacheq for real queues on storages.
So it kinda works like this:
- frontend sends request to api
- api sends request to one of storage nodes and saves request in memcachedb
- node accepts request from api and stores it in memcacheq
- downloader takes task from memcacheq and downloads file
- downloader sends request back to api
- api deletes request from memcachedb
It was ok, while we tested it on small amount of files.
Problems arouse when we started moving existing data to CDN (tens of thousands of files)
First problem: if downloader crashes we have a "backup" of request only in memcachedb on api server, with no easy way to get it from there (as memchache and memcachedb doesnt have "list keys" feature). The proper way is for downloader to push currently processing job to some other queue, and delete from there on success. While some other daemon would re-queue failed jobs. But, damn, there's no such thing as "delete" in memcacheq.
Second problem: when writing/reading tons of data memcacheq simply fails to store data properly (as it uses bdb instead of memory) resulting in some of jobs being concateneted together.
There was also other things, like proper logging with some interface to view logs, and monlitoring of nodes with proper load-balancing between them. For which i, sooner or later, was to write some bicycle-with-square-wheels, if not for above problems.
Well, yes, it was my fault, as system architect, for choosing this solutions.
Solution came suddenly, when i was installing gitlab. Gitlab uses resque for it's jobs, which, in turn, uses redis for queue.
Wait a second, i thought, isn't redis just some other hipster NoSQL DB?
So, i've checked the docs, and oh god, it supports key-value, like memcached, which i can use for logging, but also lists (even sorted!), which i can use for queue, arrays, which i can use for node-balancing. I can push, pop, delete from lists by value! Even atomically pop from one list and push to another, while still getting value. It supports async and blocking queries and even pubsub.
Also, it's event-loop-based and written in C and BLAZINGLY fast.
That was like silver bullet for my task. So well it fits over my infrastructure.
As it supports authorization i can push jobs directly from api to redis on storage node (while, of-course, maitaining white-list in iptables) , instead of my poorly written daemon which only job was to take request and put it to memcacheq. I can push cloning jobs directly to other nodes, etc.
Numbers of needed network daemons went down from 7 to 4, which makes whole infrastructure lot easier to debug.
And it comes with nice C binding called hiredis (and c++ counterpart, hiredispp, which are kinda abandoned and incomplete, but still works and easy to hack)
Redis supports lot more features, be sure to take a look at docs at http://redis.io
So, in conclusion:
- Do not, i repeat, DO NOT use memcacheq/memcachedb in high-number-of-requests/s situations. It's called memcache-something only because it supports memcached protocol. Under hood it uses BDB for persistance, which is nice, but not really reliable undeh high io.
- Redis IS cool. You can use it for queues, for storing logs (not sure on this one, as sql solution might be better), for counters and other plain and not so data, which rapidly changes.
But, as always, remember to not compare NoSQL to SQL. They are not competitive, they complement each other. There are tasks for which SQL is the best, and tasks for which key-value storages are best, and also tasks for document-oriented DBs.