@mattab opened this issue on April 13th 2015

During core:archive cron execution, it can happen that Mysql server fails and then the core:archive script itself will die because the connection to database was cut. What this happens the script has to be restarted manually or it will wait until the next cron run. This causes complications to Piwik administrators and we want to make the script smarter, so it tries to re-establish connection to Mysql in case the db server had only a temporary problem. #### Issue context

Imagine a long running core:archive script run. It triggers a lot of requests to the API to pre-process websites. Often a Mysql server may fail and return an error such as Mysql server has gone away. When this happens, the API request will fail and this failure be displayed in the core:archive script output, which will proceed with the next request. This works fine and the script continues with the remaining websites..

The problem occurs when the console command connection to Mysql has itself been broken. then core:archive will die any remaining websites and segments will not be pre-processed. #### Reproduce

To reproduce: during a core:archive run, restart the Mysql server.

the core:archive script will error with an error:

INFO CoreConsole[2015-04-13 04:28:00] Archived website id = 83, period = month, 0 visits in last last52 months, 0 visits this month, Time elapsed: 0.447s
INFO CoreConsole[2015-04-13 04:28:00] Will pre-process for website id = 83, year period, the following 1 segments: { pageUrl!=xx } 
ERROR CoreConsole[2015-04-13 04:28:01] Error unserializing the following response from http://localhost/piwik-master/index.php?module=API&method=API.get&idSite=83&period=year&date=last7&format=php&token_auth=removed&trigger=archivephp: SQLSTATE[HY000] [2003] Can't connect to MySQL server on '127.0.0.1' (111) 
INFO CoreConsole[2015-04-13 04:28:01] Archived website id = 83, period = year, 0 visits in last last7 years, 0 visits this year, Time elapsed: 0.469s
ERROR CoreConsole[2015-04-13 04:28:01] SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted



  [Exception]                                                               
  SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted  

Here is an error that happened in production Piwik:

2015-03-21 07:27:54 INFO CoreConsole[2015-03-21 07:27:54] [4d9c4] Archived website id = 9, period = week, 40731912 visits in last last4 weeks, 7821120 visits this week, Time elapsed: 195.531s

2015-03-21 16:06:26 INFO CoreConsole[2015-03-21 16:06:26] [4d9c4] Archived website id = 9, period = month, 157882146 visits in last last4 months, 31272773 visits this month, Time elapsed: 31111.368s

2015-03-21 16:06:26 ERROR CoreConsole[2015-03-21 16:06:26] [4d9c4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

2015-03-21 16:06:26 ERROR CoreConsole[2015-03-21 16:06:26] [4d9c4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

#### Proposed solution

The goal is to prevent such error and proceeding with archiving of remaining websites. - when an error occured for a request, do not re-try it: it's important to skip the failed request. For example if the archiving failed for a website because memory exhaust, it would fail again the next time. by skipping we ensure the remaining sites will be processed. - when a DB breaks temporarily, the script should be smart and wait a bit to retry. - could a retry mechanism re-connect to database after 1s and retry N times, logging a warning or so, be implemented in the command? - how to detect such errors? error codes 70100 and 2006 are a starting point

@SR-mkuhn commented on July 10th 2015

It seems like, for every site two mysql connections are established. The first one sits idle and the other one does some work. If the first one reaches wait_timeout everything fails.

One solution could be eliminating the idle connection.

@nekohayo commented on September 7th 2015

So as I understand it, the error is "harmless" in the sense that the user can just repeatedly re-run the console archive command until the thing succeeds? There will be no incomplete/corrupt data?

As it stands, the official technique involving "console core:archive" does not work at all on shared hosts like Hostgator's; on mine, I am unable to process piwik tracking logs on the commandline through SSH even for some fairly low/medium traffic websites (the piwik instance I'm currently looking at only has 38 thousand rows of tracking log tables, weighting about 6.4 MB). From what I understood of my discussions with HG's support staff, MySQL processes get killed when they use too much CPU time for too long.

@nekohayo commented on September 8th 2015

FWIW, I received this additional piece of info/suggestion from Hostgator today:

Since the script is being killed due to CPU limitations on our shared and reseller servers, the suggested solution would be to configure the script to perform its operations in batches instead of attempting to perform all of the work during a single process and MySQL connection. If the script can be configured to spawn separate processes to handle the work in batches, you can avoid the issue of a single process hitting the CPU limits and being killed

I don't know if that is a valid suggestion, and if it is, if it should be filed as a separate issue?

@mattab commented on September 8th 2015

@nekohayo The core:archive already will trigger a separate process for each of the archiving requests. so this should already be implemented

@mattab commented on September 18th 2015

as @SR-mkuhn pointed out, the common workaround to solve this issue (in most cases) is to increase wait_timeout in Mysql configuration.

@nekohayo commented on September 18th 2015

You can't actually do that as a user of a shared host, no?

@SR-mkuhn commented on September 18th 2015

@mattab, but wouldn't it be better to work with one mysql connection instead of two AND requirement of increasing wait_timeout? One connection seems to gather data and at the end, the other idle one writes something. If the first connection has to do a lot - the first send command by the other connection exceeds the timeout limit. If those database connections would be merged - a increase of wait_timeout should be absolutly unnecessary.

@bartek85 commented on October 10th 2015

Maybe I can give you some more data to help you? I have a problem after upgrade from 2.12.0 to 2.14.3, it goeas away after archiving one website and switching to another.

wait_timeout=28800 (it's not the case) max_allowed_packet_size=512MB (not the case, changing doeasn't made any difference, on 2.12.0 - 128MB - same data) connection_timeout=28800 - that one helped a bit, but on 2.12.0 it's set to "10" and everything is working.

It's new archive (droped archive tables) with 4,4k websites and 40+GB piwik_log_visit.

@monkz commented on January 9th 2016

Hi, i think i might found a solution to this problem, altough i have no patch.

As described by @SR-mkuhn , core:archive seems to use two mysql connections. This is because the singleton in https://github.com/piwik/piwik/blob/master/core/Db/Adapter/Pdo/Mysql.php loses its context. CliMulti is the reason for it - it spawns a seperate process for a request and this process doesn't have any information about the mysql connection in the parent process. While waiting for those worker processes to finish, there is no action in the mysql connection of the parent, so we run into the wait_timeout threshold and everything fails.

My recommendation is: While waiting on child processes in https://github.com/piwik/piwik/blob/master/core/CliMulti/Process.php , execute a NOOP query once or twice during the period of wait_timeout inside the parent. Select 1;, SET @ignore_me = 0; or DO 0; should be sufficient. http://lists.mysql.com/mysql/142004 Like every PDO gets a noop function and we can call it from CliMulti.

Greetings MonkZ

@SR-mkuhn commented on March 10th 2016

Looks good.

@twanmeijerink commented on April 7th 2016

Hi @monkz, we seem to have the same problem as you described. However, we do not really understand your solution. Can you describe in more detail where you exactly added the NOOP query?

Thanks in advance, Twan

@monkz commented on April 7th 2016

Hi @twanmeijerink , so far the concept of a noop is currently not implemented and i'm not a php developer. It seems like, if a child process needs longer, the parent dies due to inactivity. A NOOP command might be introduced here: https://github.com/piwik/piwik/blob/master/core/CliMulti/Process.php#L113

My solution is just a recommendation/suggestion for @mattab and @tsteur .

Powered by GitHub Issue Mirror