The goal of this issue is to make the query in purgeInvalidatedArchiveTable faster.
In Piwik 2.8.0 we're seeing that the nice SQL query introduced in #6116 is sometimes a bit slow. This query is triggered for example when log import is used. Maybe we could make this query faster?
Mysql slow query log report:
Count: 13 Time=342.83s (4456s) Lock=0.00s (0s) Rows=0.0 (0), 13users@2hosts SELECT t1.idarchive FROM `piwik_archive_blob_2014_06` t1 INNER JOIN `piwik_archive_blob_2014_06` t2 ON t1.name = t2.name AND t1.idsite=t2.idsite AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period WHERE t1.value = N AND t2.value IN(N, N) AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S' Count: 16 Time=253.72s (4059s) Lock=0.00s (0s) Rows=0.0 (0), 16users@2hosts SELECT t1.idarchive FROM `piwik_archive_blob_2014_07` t1 INNER JOIN `piwik_archive_blob_2014_07` t2 ON t1.name = t2.name AND t1.idsite=t2.idsite AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period WHERE t1.value = N AND t2.value IN(N, N) AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'
Currently we purge invalidated data by looking over each archive table: https://github.com/piwik/piwik/blob/master/core/DataAccess/ArchivePurger.php#L28-29
Instead maybe we could keep track of which dates were invalidated, and only run the query on tables for the months where some data was invalidated?
Do we have a sample SQL to actually test this? Maybe it can be split in two queries to make it faster? Not sure how many archives are usually invalidated... I think this is only executed daily so 3-10min sounds actually ok maybe. All this 'done' stuff seems to be a bit weird anyway. Maybe it makes sense to refactor it at some point.
The reason it's so slow is that it takes 3-10min per monthly archive table. So if a user has 5 years with PIwik, it will take 60 * 5min = 5 hours. I'd suggest to keep track of which months were invalidated (eg. in option table) and only run the query on tables for the months where some data was invalidated?
different question: In this query why is there a check if there is also a OK or temporary archive? Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what. If needed it will be reprocessed.
I presume there is a reason for this as it wouldn't be there otherwise. Maybe we can at least fix the naming as invalid sounds confusing in case an invalid archive is still used.
Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what.
Ok sorry the code does not make this clear enough. basically users will import logs and at the end of log import it marks archives as invalidated. Now imagine the scheduled task runs after log import and before the next
core:archive run (
core:archive could run next only 10-20 hours afterwards). Then for those 10-20 hours, if we deleted all that was invalidated, we would have "no data" gaps. That's why we only delete data after it has been already re-processed (the check for OK/temp archive). This was done in #5932
Now I'm wondering how to best solve this issue... thinking.
The purge query is here: https://github.com/piwik/piwik/blob/master/core/DataAccess/Model.php#L25-25
- the scheduled task can run many times (and/or for days) before
core:archive next really re-processes the website(s) that was invalidated.
- there can be many log imports at the same time each invalidating their own months and websites
core:archive may run several times before it archives all the websites that have been invalidated.
- there is an index on
INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_archived), that we should use to only invalidate data for websites that were imported. This is missing for now from the logic.
I started doing refactoring and I think I know how to fix it. I've self assigned issue!