CouchDB monitoring: you're doing it wro...you can do it better!

Why do I need to care about things monitoring?

The main propose of application metrics to answer on simple question: "what's going on there?" Why our app works slowly? Or may be it works fast enough? Will we face some issues in near future? Or may be we're expecting them right now? All these questions are important if you're going to provide high quality service and yes, wise metrics analysis gives you ability even to predict (a little) the future!

GET /_stats HTTP/1.1

First of all, we'll take a look on CouchDB /_stats resource which contains a lot of various CouchDB metrics: httpd request and responses, opened databases and auth cache, continuous changes feeds listeners and view reads, and more. These metrics are have good self-descriptions, but let's stop on the most important ones.

Request Methods

Every request to CouchDB starts with HTTP method which defines operation semantic and results behavior. CouchDB operates with the next methods:

and tracks stats by them as well. How this information could help us?

As was said, request method defines operation semantic. HEAD requests are used to check existence of the databases, documents or attachments, GET are for fetching the data; a lot of PUT requests signs about large set of update operations. On other hand if you see a lot of DELETE requests this is the open room for optimizations: mass deletes are better and effective to made via bulk update requests.

Putting all these metrics together into single graph easily provides the answer about how your CouchDB instance is used by your clients.

Request methods

Requests Time

Every request is served for some amount of time. As faster it has processed, as faster client receives the response and as more your users are happy (as you too). Larger request times leading to "slow" service and request timeouts, which are bad.

CouchDB provides request_time metric which represents length of a request inside CouchDB without MochiWeb - not a full, but still a picture which helps to estimate quality of service of your CouchDB instance.

Suddenly, CouchDB doesn't provides per each request statistic about spent time on it, but it provides others useful values: stddev and mean against spent time. So we could know about average time of our requests and a low standard deviation for them too. The difference between them is in details: stddev value represents times within +/- 1σ range (34,1% from both sides on mean), so it's immune to rare, but very fast or very slow requests.

Looking on the next stats:

"request_time": {
    "description": "length of a request inside CouchDB without MochiWeb",
    "current": 409002.771,
    "sum": 409002.771,
    "mean": 51.687,
    "stddev": 287.39,
    "min": 0.6666666666666667,
    "max": 5014.25
}

we could say, that to serve the most slow request 5 seconds was spent, but for average request it took ~300ms to be processed by CouchDB.

Request times

Since stddev and mean values are formed from all request that were served, this makes them not much accurate in long term service. To make them better, let's take a look on samples configuration option. You're probably wondered why it exists and on what it affects. Well, it's allows to get the statistics for specified time frame (in seconds). The default value is [0, 60, 300, 900] that means that we could receive metrics values for all the time (0), for the last minute, 5 minutes, 15 minutes. In other words, we could use that to measure stddev and mean within certain time frame without side effect of all previous requests to get the most accurate values which reflects the reality.

Requests by Type

There are some specific requests that requires some additional attention. CouchDB provides special httpd metrics group for them:

Requests by type

How this information could be useful?

The bulk update operation costs single POST request, but it may cause a lot of changes and significantly boost databases writes rate (see Databases I/O section below).

The temporary view reads is almost always signs about executing temporary view. That's really not good operation, especially for huge databases for millions of documents. You'll be happy to keep this metric around 0.

The view reads is good metric to determine how often your views are been used. Also, if a lot of view read requests comes when view index isn't build, the resulting request time for them could be huge.

Continuous Changes Feed Listeners

Another special kind of requests is subscribing to the continuous changes feeds. Why it's so special that we're discussing it separately? Because unlike others requests, it goes not in form request-response, but request-response-response-response-...-response. During this communication session the client receives all database changes instantly.

Continuous changes feeds listeners

This metric also could be used to approximately determine amount of pull replications which served on others instances and uses yours as data source. But the key word is "approximately" since you (and CouchDB too) couldn't be sure in that - these listeners could be just a clients which monitors database changes, nothing more.

Response Status Codes

Finally, when request had been served, the response returns back to the client. With what every HTTP response starts? Right, with HTTP status code, which defines the response nature: is it completed successfully, or some error happened, or may be request wasn't even correct.

Response status codes

Keeping eye on amount of HTTP 4xx and 5xx responses helps you provide quality service for you users. Normally, you want to see no 500 errors at all. Having high amount of 401 errors could say about authentication problems while 403 tell you that something or someone actively doing things that he's shouldn't do. Notable amount of 409 tells about high update concurrency for some resources.

Do you need to track the others codes? Yes, you do, since you'll never know what part of successful responses errors takes. "2 HTTP 500 errors" is about nothing, but this is bad is there are only 10 successful responses and in the same time they could be ignored (unless you're perfectionist) if 100000 HTTP 200 OK were returned back.

Authentication Cache

CouchDB keeps some amount of user credentials in memory to speedup authentication process by elimination of additional database lookups. This cache size is limited by the configuration option auth_cache_size. On what this affects? In short, when user login CouchDB first looks for user credentials what associated with provided login name in auth cache and if they miss there then it reads credentials from auth database (in other words, from disk).

The auth_cache_miss metric is highly related to HTTP 401 responses one, so there are three cases that are worth to be looked for:

Note that "high" and "low" in metrics world should be read as "anomaly high" and "anomaly low" for your instance.

Authentication cache ratio

Whoa! Someone really trying to do bad things on the graph above! And that's not looks like auth_cache_size issue.

Ok, but why do we need auth cache hit there? We need it as an ideal value to compare misses counter with. Just for instance, is 10 cache misses a high value? What about 100 or 1000? Having cache hits rate at some point helps to answer on this question.

Databases I/O

The database_reads and database_writes shows overall I/O for your databases when documents are been read and changed. Combining both these metrics helps to define nature of your instance: is it readonly, mostly for reads, have balanced reads and writes, or writes-heavy. The answer on this question helps to determine your server optimization: for write-heavy case you need in good and fast disks, while for read-heavy case SSD disks would be the best solution. However, SSD always makes things better, but they lives longer when writes are less.

Databases I/O

You can try to get the same answer by using HTTP request methods metrics, but the result you get wouldn't be clear: single POST request for bulk update can cause a lot of writes, while same POST requests against views causes only reads.

Open Databases

CouchDB only keeps opened databases which are receives some activity: been requested or running the compaction. The maximum amount of opened databases in the same moment of time is limited by max_dbs_open configuration option. By default it set to 100 so you're unlikely hit this error at startup, but it may strike later when your services grows. When CouchDB hits this limit, any request to "closed" databases will generate the error response: {error, all_dbs_active}.

However, once opened database doesn't remains open forever: in case of inactivity CouchDB eventually closes them providing more space in the room for others, but sometimes such cleanup may not help due to large activity on your CouchDB instance, especially if it runs a lot of replication processes.

Open databases

The open_databases metric combined with max_dbs_open configuration option helps to predict and eliminate possibility of such error and also configure your CouchDB right by your needs.

Open Files

The open_files metric shows amount of file descriptors that CouchDB holds opened. Normally, CouchDB opens two file descriptors per each database + one per view index + for compaction files + few for other files like log file etc.

Open files

By the way, on this graph you may see the face of the real issue: file descriptors leak.

Suddenly, this metric doesn't shows real file descriptor usage, so these values isn't accurate for emfile error prediction (but for approximately it's ok). See below for the fix.

Cool! But can I do more?

Yes! But since this moment we need to elevate our privileged up to CouchDB server administrator. This opens us more doors and stats sources.

Active Tasks

One day you're looking though graphics of your monitoring system and see that your CouchDB server suffers from heavy disk I/O. Really heavy disk I/O. What could cause that? You take a look on Databases I/O graph, but nothing seems wrong there. Another look on the requests graphs - again, there are not much as usual write requests made, bulk updates are also on the normal level. But diskstats is full of not peaks, but really high "mountains". So what's going wrong there?

Before completely exclude CouchDB from list of causers, we need to check the /_active_tasks resource. The database compaction is an operation with heavy I/O (since it's about copying a lot of data from database A to database B) and if there runs multiple of them for the large databases this could explain the situation. Views indexing also matters, but not so much as compactions are.

Active tasks

Allowing your monitoring tools to track active tasks on your CouchDB instance helps to detect these operations: when they starts and when they ends and with help of others metrics make a diagnostic of current disk I/O issues.

As bonus, active tasks also shows amount of active replication served by this instance. This could be useful for you.

Know Your Server Admins

CouchDB server administrators are the gods and can do a lot of things, especially dangerous and which requires care and attention. Normally, you have set up some certain amount of server admins users: for yourself and for people you trusted to administrate your instance, but are you sure that their amount wouldn't be changed in time?

With simple monitoring of server admins amount (keys length of object from /_config/admins response) you'll be notified in case when new server administrators were added and if you still don't know "who? and why?" it's good time to find the answers on these questions. For the security reasons.

Know Your Users

Ok, you provides some public service based on CouchDB, let's say it's blogging platform. Users comes, registers, generates content, service lives active life. But how many users you have? How faster your user database grows? Or may be it's stalled? These questions are also important. Bigger users database requires to have bigger resources to provide for your service and reserve for future.

The answer on these questions is pretty simple: you need to monitor authentication database documents count. By requesting database information you'll know two things about your users: how much registered you have and how many of them were deleted.

CouchDB users

This graph shows us that nothing had changed in observed time frame, it's a pity.

This would say nothing about how many users are really active, nothing about how much of them working with your service in the same time - CouchDB doesn't knows that, but tracing the change of documents count answers helps to understand growing potential of your users database. That's really important for planning and marketing things.

As about users activity, just place near graphic with request methods and apply some base mathematics magic to estimate average activity.

And now that all?

Not yet. On monitoring the authentication database we used information about documents count and how faster their number grows. Why not to apply same questions for others databases?

Graveyard Databases

You should know, that when you deletes a document from database it's actually doesn't physically removed from database. If you'd used DELETE method, then CouchDB removes all document fields (except system _id and _rev) and adds another new special one: _deleted: true. You can also mark document as deleted by just simple update via PUT request by adding same _deleted: true field to it without removing any others fields - document will be "deleted", but all his content remains preserved.

This behavior leads to the situations when long lived databases contains a lot of deleted documents which are consumes significant amount of disk space. The Graveyard Database is the database which have deleted documents more than existed ("alive") ones.

Database documents

The reason to track such databases is the disk space. At some moment you'll want to completely remove deleted documents from these databases. That's easily to do via filtered replication. However, make sure that there is no any active replication that uses these databases as target otherwise all deleted documents would be replicated back again.

Database Fragmentation

Another bit of your knowledge about CouchDB reminds you that databases have to be compacted from time to time to free disk space which is consumed by old documents revisions.

But when to run database compaction? And, since this is heavy disk I/O operation, when it's reasonable to compact database?

Great questions. And again monitoring of database size helps to answer on them.

Database disk usage

This database looks as a good candidate to be compacted RIGHT NOW!

The database information object contains two fields that are relevant to this problem: disk_size and data_size. data_size is the actual size of stored data, disk_size is the actual on disk size of database. Difference between these values shows the stored overhead which could be eliminated via database compaction.

At what point it's reasonable to start a database compaction? Well, the answer on that question depends on total amount of available disk space on partition where databases are stored. Personally, I prefer mark 50% for overhead of disk_size over data_size, but it's all depends from your environment and resources that you have.

Views Fragmentation

As well as databases, view indexes are also needs to be compacted. The strategy for them is the same, except that single database can contains dozens design documents.

Beyond of CouchDB API

Our monitoring powers could be extended if we have access to environment of the host which serves CouchDB instance. This gives us more important information about CouchDB.

Note, I wouldn't tell there about disk space issues: that's quite trivial case and it handles by every monitoring tools by default.

File Descriptors

Does the error: {error, emfile} looks familiar for you? Right, it's about exhausting of available file descriptors for CouchDB user. This error always happens SUDDENLY and leads to deny of service. That's completely bad, but this situation you can predict and fix long before it'll happens by using monitoring tools.

OS Daemons

CouchDB has awesome feature called os_daemons - it turns CouchDB into supervisor for thirdparty scripts or applications which are served by CouchDB, respawned on failure and able to communicate with CouchDB thought simple JSON-over-stdio protocol.

While that's feature is awesome, CouchDB doesn't shows os_daemons status in the list of active tasks, so you don't know are they alive or actually dead (CouchDB stops his attempts to respawn process after 5 continuous failures). But we can fix that with monitoring tools.

The strategy is quite trivial: get the list of os_daemons from configuration and iterate over CouchDB child processes. If process exists - it's alive. Those who wasn't found marks as dead.

Resume

Metrics are import part of any system since they can tell you what's going on, what's going wrong and even what's will be wrong in nearest future.

Please note, that all these graphs are good to visualise processes that goes on server, they are miss concrete details about situation. If you have powerful system to manage application logs with flexible searching both solutions might be awesome additional to each over.

Awesome! Can I haz all these NOW?

The most of monitoring systems plugins for CouchDB are unable to handle all the described cases since they are trying to work with just /_stats resource - it's good, but, as you may noted, not enough to see full picture of your CouchDB.

However, at least for Munin there is one that's going to handle all this post recommendations. This plugin had been initially written by Gordon Stratton 5 years ago, but the time has only made it better:

https://github.com/gws/munin-plugin-couchdb

Today as for 0.6 release it's able to handle all the described cases with few exceptions, but work isn't done yet.