I'm trying to diagnose the bottleneck for high-write I/O activity on a percona master (with replication to a slave). The pattern of behaviour is roughly cyclic and regular: 60 mins of 75-90% write-based IO utilisation followed by 10 mins of ~25%. Occasionally this spikes to such a point that the fronting webservice is unresponsive for a couple of minutes.
Doing a SHOW PROCESSLIST shows the replication process and then a number of idle threads (though occasionally there are 1-2 COMMITs which complete by the time I reissue the process list command) but nothing indicating a specific active query.
Innodb buffer status variables:
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_%';
+-----------------------------------------+-----------+
| Variable_name | Value |
+-----------------------------------------+-----------+
| Innodb_buffer_pool_pages_data | 1678714 |
| Innodb_buffer_pool_pages_dirty | 718 |
| Innodb_buffer_pool_pages_flushed | 195847729 |
| Innodb_buffer_pool_pages_LRU_flushed | 0 |
| Innodb_buffer_pool_pages_free | 5299854 |
| Innodb_buffer_pool_pages_made_not_young | 0 |
| Innodb_buffer_pool_pages_made_young | 59369 |
| Innodb_buffer_pool_pages_misc | 99319 |
| Innodb_buffer_pool_pages_old | 619661 |
| Innodb_buffer_pool_pages_total | 7077887 |
+-----------------------------------------+-----------+
10 rows in set (0.00 sec)
SHOW ENGINE INNODB STATUS
givesBuffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000.
The mysqld section from my.cnf is
[mysqld]
user = mysql
default-storage-engine = InnoDB
socket = /var/run/mysqld/mysqld.sock
pid-file = /var/run/mysqld/mysqld.pid
key-buffer-size = 32M
myisam-recover = FORCE,BACKUP
max-allowed-packet = 16M
max-connect-errors = 1000000
innodb = FORCE
datadir = /srv/mysql/
server_id = 30
log-bin = /srv/mysql/mysql-bin
expire-logs-days = 14
sync-binlog = 1
tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 500
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 1024
table-open-cache = 2048
tmpdir = /tmp
innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table = 1
innodb-buffer-pool-size = 108G
log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 1
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log
The slow query log has a number of entries but none of them are long-lived. A sample extract is:
use xxxxxxxxx;
SET timestamp=1411484104;
SELECT amnews_article.aid, amnews_article.sid_id, amnews_article.fid_id, amnews_article.url, amnews_article.url_hash, amnews_article.title, amnews_article.sha1, amnews_article.date_ts, amnews_article.summary, amnews_article.author_feed, amnews_article.author, amnews_article.language, amnews_article.fetched, amnews_article.processed, amnews_article.ngrams, amnews_article.indexed_ngram, amnews_article.topics_ngram, amnews_article.journals, amnews_article.persons, amnews_article.links, amnews_article.dois, amnews_article.verified, amnews_article.journals_version, amnews_article.duplicate_of_id, amnews_article.external, amnews_article.blog, amnews_article.last_updated, amnews_article.grouped FROM amnews_article ORDER BY amnews_article.aid DESC LIMIT 1;
As an aside I'm not sure why that query makes it into the slow query log because it runs quickly and uses indexes, as shown by an EXPLAIN
+----+-------------+----------------+-------+---------------+---------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------------+-------+---------------+---------+---------+------+------+-------+ | 1 | SIMPLE | amnews_article | index | NULL | PRIMARY | 4 | NULL | 1 | | +----+-------------+----------------+-------+---------------+---------+---------+------+------+-------+
Underlying, the disks are spinning platters configured as an ext4 LVM partition on RAID1 and the machine is physical (albeit managed) rather than a VM. The data partition is on the same disks (LVM volume group) as the log partition. RAM usage is ~90% of 128G and swap is available but not being used.
$ free -m
total used free shared buffers cached
Mem: 128912 117370 11542 0 516 36092
-/+ buffers/cache: 80761 48151
Swap: 2045 0 2045
Version is 5.5.36-34.2-648.precise-log
Edit to add info asked for by Vérace:
Is your web server on the same box? The webserver is on a different box from the db.
Are there cron jobs? There are db-related cronjobs but they run from various different hosts. (But with queries not showing in the process list I don't know where to direct attention).
How are you determing that there's an I/O bottleneck? We use newrelic for server and service monitoring. It produces graphs that show the long periods of high IO usage with the flatlining at 100MB/s.
iotop -aoover ~10 mins show that the processes responsible are mysql. (Well, jdb is the very top but presumably that's expected with the disk setup)
My questions are twofold: What is mysql doing writing so much data if nothing is shown in the process list (i.e. how can I find out which queries are responsible)? Are there any configuration changes I can make to alleviate the almost-perpetual high IO?