0

I have a galera mariadb cluster on 3 centos 8 linux servers. Each server crashes several times a week. I have 16 Go RAM on one server and 60 Go ram on other servers. Each wsrep recovery takes about 12 hours. My DB is about 1 billion raws. Please help ! Thanks a lot !

Here is my mariadb journalctl log when crashing:

May 25 17:01:04 server2.net rsyncd[12826]: connect from server2.net (192.168.0.1)
May 25 17:01:04 server2.net rsyncd[12826]: rsync to rsync_sst/ from server2.net (192.168.0.1)
May 25 17:01:04 server2.net rsyncd[12826]: receiving file list
May 25 17:01:04 server2.net mariadbd[6340]: 2021-05-25 17:01:04 0 [Note] WSREP: 0.0 (server1): State transfer to 2.0 (server2) complete.
May 25 17:01:05 server2.net mariadbd[6340]: 2021-05-25 17:01:05 0 [Note] WSREP: Member 0.0 (server1) synced with group.
May 25 17:01:05 server2.net mariadbd[6340]: WSREP_SST: [INFO] Joiner cleanup. rsync PID: 6410 (20210525 17:01:05.697)
May 25 17:01:06 server2.net mariadbd[6340]: WSREP_SST: [INFO] Joiner cleanup done. (20210525 17:01:06.213)
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 3 [Note] WSREP: SST received
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 3 [Note] WSREP: Server status change joiner -> initializing
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Uses event mutexes
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Number of pools: 1
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Using Linux native AIO
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Initializing buffer pool, total size = 8589934592, chunk size = 134217728
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Completed initialization of buffer pool
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1134219835368,1134226613416
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 32 row operations to undo
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Trx id counter is 458025768
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Starting final batch to recover 22131 pages from redo log.
May 25 17:01:21 server2.net mariadbd[6340]: 2021-05-25 17:01:21 0 [Note] InnoDB: To recover: 13362 pages from log
May 25 17:01:36 server2.net mariadbd[6340]: 2021-05-25 17:01:36 0 [Note] InnoDB: To recover: 2461 pages from log
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: 128 rollback segments are active.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Creating shared tablespace for temporary tables
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: 10.5.9 started; log sequence number 1134227259746; transaction id 458025769
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025603
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Plugin 'FEEDBACK' is disabled.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Recovering after a crash using tc.log
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Starting crash recovery...
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Crash recovery finished.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Server socket created on IP: '0.0.0.0'.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025201
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025717
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458022731
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025212
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: wsrep_init_schema_and_SR 0x0
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Server initialized
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Server status change initializing -> initialized
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: Server status change initialized -> joined
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: Recovered position from storage: de1727aa-bad3-11eb-80f5-9e4d94951082:3223152
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Uses event mutexes
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Number of pools: 1
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Using Linux native AIO
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Initializing buffer pool, total size = 8589934592, chunk size = 134217728
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Completed initialization of buffer pool
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1134219835368,1134226613416
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: 14 transaction(s) which must be rolled back or cleaned up in total 32 row operations to undo
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Trx id counter is 458025768
May 25 17:01:06 server2.net mariadbd[6340]: 2021-05-25 17:01:06 0 [Note] InnoDB: Starting final batch to recover 22131 pages from redo log.
May 25 17:01:21 server2.net mariadbd[6340]: 2021-05-25 17:01:21 0 [Note] InnoDB: To recover: 13362 pages from log
May 25 17:01:36 server2.net mariadbd[6340]: 2021-05-25 17:01:36 0 [Note] InnoDB: To recover: 2461 pages from log
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: 128 rollback segments are active.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Creating shared tablespace for temporary tables
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: 10.5.9 started; log sequence number 1134227259746; transaction id 458025769
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025603
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Plugin 'FEEDBACK' is disabled.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Recovering after a crash using tc.log
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Starting crash recovery...
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Crash recovery finished.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Server socket created on IP: '0.0.0.0'.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025201
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025717
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458022731
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] InnoDB: Rolled back recovered transaction 458025212
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: wsrep_init_schema_and_SR 0x0
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Server initialized
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Server status change initializing -> initialized
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: Server status change initialized -> joined
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: Recovered position from storage: de1727aa-bad3-11eb-80f5-9e4d94951082:3223152
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: Recovered view from SST:
May 25 17:01:44 server2.net mariadbd[6340]: id: de1727aa-bad3-11eb-80f5-9e4d94951082:3223133
May 25 17:01:44 server2.net mariadbd[6340]: status: primary
May 25 17:01:44 server2.net mariadbd[6340]: protocol_version: 4
May 25 17:01:44 server2.net mariadbd[6340]: capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAM
May 25 17:01:44 server2.net mariadbd[6340]: final: no
May 25 17:01:44 server2.net mariadbd[6340]: own_index: 2
May 25 17:01:44 server2.net mariadbd[6340]: members(3):
May 25 17:01:44 server2.net mariadbd[6340]: 0: 733e0788-b3ce-11eb-8484-7287f059debd, server1
May 25 17:01:44 server2.net mariadbd[6340]: 1: 74afdfe1-9b9e-11eb-9a0e-2bb2d25ae0ab, server3
May 25 17:01:44 server2.net mariadbd[6340]: 2: 942adf56-86f7-11eb-a7de-7bdeacbb6480, server2
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 6 [Note] WSREP: Recovered cluster id de1727aa-bad3-11eb-80f5-9e4d94951082
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: SST received: de1727aa-bad3-11eb-80f5-9e4d94951082:3223152
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 3 [Note] WSREP: SST succeeded for position de1727aa-bad3-11eb-80f5-9e4d94951082:3223152
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Joiner monitor thread ended with total time 11483 sec
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Reading of all Master_info entries succeeded
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] Added new Master_info '' to hash table
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 2 [Note] WSREP: Installed new state from SST: de1727aa-bad3-11eb-80f5-9e4d94951082:3223152
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] /usr/sbin/mariadbd: ready for connections.
May 25 17:01:44 server2.net mariadbd[6340]: Version: '10.5.9-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 2 [Note] WSREP: Cert. index preload up to 3223152
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: ####### IST applying starts with 3223153
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: ####### IST current seqno initialized to 3223069
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Receiving IST...  0.0% ( 0/65 events) complete.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: IST preload starting at 3223069
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Service thread queue flushed.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:3223068, protocol vers
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: REPL Protocols: 10 (5)
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: ####### Adjusting cert position: 3223132 -> 3223133
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Service thread queue flushed.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Lowest cert index boundary for CC from preload: 3223069
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Min available from gcache for CC from preload: 3223069
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Receiving IST...100.0% (65/65 events) complete.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 2 [Note] WSREP: Cert. index preloaded up to 3223133
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 2 [Note] WSREP: Lowest cert index boundary for CC from sst: 3223069
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 2 [Note] WSREP: Min available from gcache for CC from sst: 3223069
May 25 17:01:44 server2.net mariadbd[6340]: 210525 17:01:44 [ERROR] mysqld got signal 11 ;
May 25 17:01:44 server2.net mariadbd[6340]: This could be because you hit a bug. It is also possible that this binary
May 25 17:01:44 server2.net mariadbd[6340]: or one of the libraries it was linked against is corrupt, improperly built,
May 25 17:01:44 server2.net mariadbd[6340]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 25 17:01:44 server2.net mariadbd[6340]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
May 25 17:01:44 server2.net mariadbd[6340]: We will try our best to scrape up some info that will hopefully help
May 25 17:01:44 server2.net mariadbd[6340]: diagnose the problem, but since we have already crashed,
May 25 17:01:44 server2.net mariadbd[6340]: something is definitely wrong and this may fail.
May 25 17:01:44 server2.net mariadbd[6340]: Server version: 10.5.9-MariaDB
May 25 17:01:44 server2.net mariadbd[6340]: key_buffer_size=134217728
May 25 17:01:44 server2.net mariadbd[6340]: read_buffer_size=131072
May 25 17:01:44 server2.net mariadbd[6340]: max_used_connections=0
May 25 17:01:44 server2.net mariadbd[6340]: max_threads=153
May 25 17:01:44 server2.net mariadbd[6340]: thread_count=2
May 25 17:01:44 server2.net mariadbd[6340]: It is possible that mysqld could use up to
May 25 17:01:44 server2.net mariadbd[6340]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467864 K  bytes of memory
May 25 17:01:44 server2.net mariadbd[6340]: Hope that's ok; if not, decrease some variables in the equation.
May 25 17:01:44 server2.net mariadbd[6340]: Thread pointer: 0x7f4c640009b8
May 25 17:01:44 server2.net mariadbd[6340]: Attempting backtrace. You can use the following information to find out
May 25 17:01:44 server2.net mariadbd[6340]: where mysqld died. If you see no messages after this, something went
May 25 17:01:44 server2.net mariadbd[6340]: terribly wrong...
May 25 17:01:44 server2.net systemd[1]: Started MariaDB 10.5.9 database server.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 8 [Warning] Aborted connection 8 to db: 'crmdb' user: 'user1' host: 'localhost' (Got an error reading communication pac
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 9 [Warning] Aborted connection 9 to db: 'crmdb' user: 'user1' host: 'localhost' (Got an error reading communication pac
May 25 17:01:44 server2.net mariadbd[6340]: stack_bottom = 0x7f4c7c173cb0 thread_stack 0x49000
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: 2.0 (server2): State transfer from 0.0 (server1) complete.
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 3223702)
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(my_print_stacktrace)[0x56239dfbb7fe]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(handle_fatal_signal)[0x56239d9bfa37]
May 25 17:01:44 server2.net mariadbd[6340]: sigaction.c:0(__restore_rt)[0x7f4c83c3d630]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(thd_get_thread_id)[0x56239d76db31]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x56239dcadb32]
May 25 17:01:44 server2.net mariadbd[6340]: /usr/sbin/mariadbd(+0x642a4c)[0x56239d691a4c]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x56239dcf69e2]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x56239dcfd217]
May 25 17:01:44 server2.net mariadbd[6340]: /usr/sbin/mariadbd(+0x64a816)[0x56239d699816]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x56239dda4663]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x56239dcb7f04]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(handler::ha_rnd_pos(unsigned char*, unsigned char*))[0x56239d9c4d82]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep::client_state::ordered_commit())[0x56239d9d13a3]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(Rows_log_event::find_row(rpl_group_info*))[0x56239dad9a39]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(Update_rows_log_event::do_exec_row(rpl_group_info*))[0x56239dada1ae]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x56239dacee84]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x56239dc820bd]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(Wsrep_high_priority_service::reset_globals())[0x56239dc6bd00]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(Wsrep_applier_service::apply_write_set(wsrep::ws_meta const&, wsrep::const_buffer const&, wsrep::mutable_buffer&))[0x56239dc6bdf3]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep::server_state::start_streaming_applier(wsrep::id const&, wsrep::transaction_id const&, wsrep::high_priority_service*))[0x56239e03cb7f]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep::server_state::on_apply(wsrep::high_priority_service&, wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x5
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep::wsrep_provider_v26::last_committed_gtid() const)[0x56239e04e0a8]
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 10 [Warning] Aborted connection 10 to db: 'crmdb' user: 'user1' host: 'localhost' (Got an error reading communication p
May 25 17:01:44 server2.net mariadbd[6340]: 2021-05-25 17:01:44 11 [Warning] Aborted connection 11 to db: 'crmdb' user: 'user1' host: 'localhost' (Got an error reading communication p
May 25 17:01:44 server2.net mariadbd[6340]: src/trx_handle.cpp:387(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsr
May 25 17:01:44 server2.net mariadbd[6340]: src/replicator_smm.cpp:504(galera::ReplicatorSMM::apply_trx(void*, galera::TrxHandleSlave&))[0x7f4c7ec6549a]
May 25 17:01:44 server2.net mariadbd[6340]: src/replicator_smm.cpp:2145(galera::ReplicatorSMM::process_trx(void*, boost::shared_ptr<galera::TrxHandleSlave> const&))[0x7f4c7ec6afd1]
May 25 17:01:44 server2.net mariadbd[6340]: src/gcs_action_source.cpp:63(galera::GcsActionSource::process_writeset(void*, gcs_action const&, bool&))[0x7f4c7ec445f9]
May 25 17:01:44 server2.net mariadbd[6340]: src/gcs_action_source.cpp:110(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&))[0x7f4c7ec44c3a]
May 25 17:01:44 server2.net mariadbd[6340]: src/gcs_action_source.cpp:29(~Release)[0x7f4c7ec44e6e]
May 25 17:01:44 server2.net mariadbd[6340]: src/replicator_smm.cpp:390(galera::ReplicatorSMM::async_recv(void*))[0x7f4c7ec6b4bb]
May 25 17:01:44 server2.net mariadbd[6340]: src/wsrep_provider.cpp:263(galera_recv)[0x7f4c7ec7df38]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x56239e04e7de]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(wsrep_fire_rollbacker)[0x56239dc83d78]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(start_wsrep_THD(void*))[0x56239dc77243]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x56239dc0d11d]
May 25 17:01:44 server2.net mariadbd[6340]: pthread_create.c:0(start_thread)[0x7f4c83c35ea5]
May 25 17:01:44 server2.net mariadbd[6340]: ??:0(__clone)[0x7f4c81d5e8dd]
May 25 17:01:44 server2.net mariadbd[6340]: Trying to get some variables.
May 25 17:01:44 server2.net mariadbd[6340]: Some pointers may be invalid and cause the dump to abort.
May 25 17:01:44 server2.net mariadbd[6340]: Query (0x7f4c72ade49b): update crm_users set us_title_clean='\n              Data Scientist'  where us_id = 44378796
May 25 17:01:44 server2.net mariadbd[6340]: Connection ID (thread ID): 2
May 25 17:01:44 server2.net mariadbd[6340]: Status: NOT_KILLED
May 25 17:01:44 server2.net mariadbd[6340]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engi
May 25 17:01:44 server2.net mariadbd[6340]: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
May 25 17:01:44 server2.net mariadbd[6340]: information that should help you find out what is causing the crash.
May 25 17:01:44 server2.net mariadbd[6340]: Writing a core file...
May 25 17:01:44 server2.net mariadbd[6340]: Working directory at /var/lib/mysql
May 25 17:01:44 server2.net mariadbd[6340]: Resource Limits:
May 25 17:01:44 server2.net mariadbd[6340]: Limit                     Soft Limit           Hard Limit           Units
May 25 17:01:44 server2.net mariadbd[6340]: Max cpu time              unlimited            unlimited            seconds
May 25 17:01:44 server2.net mariadbd[6340]: Max file size             unlimited            unlimited            bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max data size             unlimited            unlimited            bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max stack size            8388608              unlimited            bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max core file size        0                    unlimited            bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max resident set          unlimited            unlimited            bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max processes             63441                63441                processes
May 25 17:01:44 server2.net mariadbd[6340]: Max open files            100000               100000               files
May 25 17:01:44 server2.net mariadbd[6340]: Max locked memory         65536                65536                bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max address space         unlimited            unlimited            bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max file locks            unlimited            unlimited            locks
May 25 17:01:44 server2.net mariadbd[6340]: Max pending signals       63441                63441                signals
May 25 17:01:44 server2.net mariadbd[6340]: Max msgqueue size         819200               819200               bytes
May 25 17:01:44 server2.net mariadbd[6340]: Max nice priority         0                    0
May 25 17:01:44 server2.net mariadbd[6340]: Max realtime priority     0                    0
May 25 17:01:44 server2.net mariadbd[6340]: Max realtime timeout      unlimited            unlimited            us
May 25 17:01:44 server2.net mariadbd[6340]: Core pattern: core
May 25 17:01:45 server2.net systemd[1]: mariadb.service: main process exited, code=killed, status=11/SEGV
May 25 17:01:45 server2.net systemd[1]: Unit mariadb.service entered failed state.
May 25 17:01:45 server2.net systemd[1]: mariadb.service failed.
May 25 17:01:50 server2.net systemd[1]: mariadb.service holdoff time over, scheduling restart.
May 25 17:01:50 server2.net systemd[1]: Stopped MariaDB 10.5.9 database server.
May 25 17:01:50 server2.net systemd[1]: Starting MariaDB 10.5.9 database server...

my.cnf file:

[server]

this is only for the mysqld standalone daemon

[mysqld] innodb_purge_threads=1

innodb_buffer_pool_size = 8G max_allowed_packet= 573741824 character-set-client-handshake = FALSE character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci

#this increases perf see https://dba.stackexchange.com/questions/187458/100x-less-insert-performance-with-galera-multi-master innodb_flush_log_at_trx_commit = 2 sync_binlog = 0

* Galera-related settings

[galera]

Mandatory settings

wsrep_on=On wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so wsrep_cluster_name="cluster" wsrep_cluster_address="gcomm://server1,server2,server3" binlog_format=row default_storage_engine=InnoDB innodb_autoinc_lock_mode=2

wsrep_sst_method=rsync wsrep_node_address="server2" wsrep_node_name="server2"

Allow server to accept connections on all interfaces.

Patrice G
  • 1
  • 1
  • 1

1 Answers1

1

As danblack helpfully pointed out in the comments below, if we assume that MariaDB adheres to the P1990 standard, then "mysqld got signal 11" suggests an invalid memory access. That is a very broad class of obscure bugs.

So, you have two options:

  1. Try to capture all the specifics of your edge case and submit a bug report, which will require a full stack trace, possible installation of a MariaDB version with debug symbols, and then waiting in hopes that someone on the MariaDB team responds sooner rather than later. Sometimes they respond very quickly, other times -- including in a bug report very similar Galera based crash with signal 11 -- not for months. Response time is likely impacted by the quality of the bug report submitted and the number of users impacted. So if you have a really obscure bug that is hyper-specific to your implementation, and the context is not fully known or else difficult to articulate and capture then you should expect to wait a while.

  2. Chart a course around the bug. There are a number of posts and reports around the web, for example, that associate "signal 11" with data corruption. If you're lucky enough to fall into that sub-class of bugs, then you might be able to resolve the problem yourself by finding the data that MariaDB doesn't like on your own and eliminating it from your database (i.e. stop being an edge case).

Below is a successful example of how to proceed with #2:

You also might get lucky and find something easily with mysql_check, or you might know enough about your database to have a good idea which table or tables are structured quite differently from the rest, then try doing a backup that excludes those to see what happens. Worst case scenario, do a random process of elimination search.

In my case, I had a database with 206 tables, several with more than 100 million rows, and mysql_check found no problems, but I knew that if only 1 table was the problem I'd need at most 2 * log2(206) tries to find it (one try to confirm a crashing half, another try to confirm a non-crashing half, repeat). So I did backups of half the tables, watched to see which ones crashed and which ones didn't, and 16 tries later had my answer.

Ultimately, there was a single table causing all the problems, and it had its ROW_FORMAT set to COMPRESSED. Apparently MariaDB is getting ready to deprecate this feature, so I guess they're not attending to its nuances very carefully over in mariabackup development, hence the obscure bug. As soon as I changed that table with:

ALTER TABLE my_table ROW_FORMAT=default;

The crashes stopped happening during restores from incremental backups. Keep in mind that this assumes your default row format is set to dynamic, as per the new default guidelines for any version of MariaDB >= 10.2.2. If you need to find which tables are set to compressed, try:

SELECT table_name from information_schema.tables WHERE create_options LIKE '%COMPRESSED%'
s0kr8s
  • 111
  • 2