Sunday, 20 April 2008

MySQL engines space usage comparison - continued

Since Oracle announced the InnoDB plugin which provides compressed row formats, I thought I'd do this comparison again.

Here are some more data that I've cobbled together.

I've created a 2M row table with a schema that we use for audit data. It has 47 columns and 17 indexes.

I copied the table into new empty tables using INSERT ... SELECT with each of these engines:
  • InnoDB row_format=compressed (from the InnoDB plugin)
  • InnoDB row_format=Compact
  • InnoDB row_format=Redundant
  • MyISAM pack_keys=1
  • MyISAM packed with myisampack (NOTE: these tables are readonly)
  • PBXT
  • Falcon
Here are the data:

EngineSpace (MB)
InnoDB compressed636
InnoDB compact1946
InnoDB redundant2253
MyISAM pack_keys=1738
MyISAM packed476

I am very impressed with how good Falcon is. I am very impressed with InnoDB's compressed row format's space usage.

I don't think it's very reasonable to compare the loading times for the different engines as they all work in different ways, but some were a LOT faster than others. Take this with a pinch of salt:

Fast: MyISAM, Falcon, InnoDB except compressed
Slow: InnoDB compressed, PBXT

InnoDB compressed and PBXT both took well over an hour to complete what the others did in a few minutes. I can't explain this - maybe it's a tuning issue.

In each case, the copy was done in an INSERT ... SELECT which uses a single transaction. I tried to tune the buffers as best as I could (the server has enough ram) but I'm no expert in PBXT and Falcon.

Sunday, 13 April 2008

Improve reliability and maintainability by using stateless connections

MySQL application developers have some problems with connection state. There is a lot of state which can be used associated with a single connection
  • Transactions (autocommit, uncommitted work, when a transaction was started, isolation level)
  • Temporary tables
  • Session variables which affect behaviour
  • Other session variables
If you can avoid all of these, then you can benefit from transparent auto-reconnection and transparent connection reuse / pooling etc.

If you can't, then you will have to deal with reconnection, retry at a higher level. This is complicated and difficult to test.

I'll discuss ways of avoiding these things one at a time:


I'm not advocating shunning transactions completely, just avoiding keeping any transaction state in the connection:
  • Keep autocommit on. Execute everything you want in a transaction in a single statement (in MySQL it is possible to delete from several tables in one statement, insert several rows etc).
  • Transactional engines will still use transactions if autocommit is on - a single statement becomes a single transaction - you will see a consistent view within that statement - no other thread will see parts of a batch insert, update or delete.
  • Non-transactional engines behave the same anyway.
  • Alternative: do everything that needs to be in a transaction inside a stored procedure - if these are retried, they will be retried as a whole.
Temporary tables

These are best avoided. They also cause problems for statement-level replication.
  • Use short-lived permanent tables with unique names instead (perhaps in a separate database which is automatically cleaned up when tables get old)
  • Wrap things which need temporary tables inside stored procedures
  • Use derived tables instead - in some cases this is sufficient.
Session variables which affect behaviour

Sometimes you can't get away from having to set these variables. Common examples are:
  • SET NAMES utf8
If an autoreconnect happens while these are in effect, you could accidentally get rubbish data inserted into the database if a connection reverts to latin1 unexpectedly or SQL_MODE reverts to its default "anything goes".

As these are unavoidable, they always have to be set - but there is a solution - the MYSQL_INIT_COMMAND option. As these commands won't actually have any effect on their own, we can have them automatically re-done when the library reconnects to the database.

The only problem with this approach is that not all client library interfaces allow the setting of this option - but in some cases it can be set in my.cnf (for client libraries which read this)

Other session variables

Quite simply - don't use them. There is no reason I know of for having another session variable.
  • Don't select @@LAST_INSERT_ID or last_insert_id() - use your API method to get it instead. An auto-reconnect happening between the insert and the select @@LAST_INSERT_ID will cause it to return zero, which is probably not what you wanted or expected.
  • Don't define your own variables - keep them on the client instead (or if you absolutely must store state server-side, put it in a table!)
  • It is possible to enable automatic reconnect for greater resilience at the MySQL API level
  • To do so safely you must avoid keeping session state in the connection
  • Variables which need to be reset on connect can be reset with mysql_init_command