@mattab opened this Issue on October 23rd 2014 Owner

The goal of this issue is to make the query in purgeInvalidatedArchiveTable faster.

In Piwik 2.8.0 we're seeing that the nice SQL query introduced in #6116 is sometimes a bit slow.
This query is triggered for example when log import is used.
Maybe we could make this query faster?

Mysql slow query log report:


Count: 13  Time=342.83s (4456s)  Lock=0.00s (0s)  Rows=0.0 (0), 13users<a class='mention' href='https://github.com/2hosts'>@2hosts</a>
  SELECT t1.idarchive FROM `piwik_archive_blob_2014_06` t1
  INNER JOIN `piwik_archive_blob_2014_06` t2
  ON t1.name = t2.name AND t1.idsite=t2.idsite
  AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period
  WHERE t1.value = N
  AND t2.value IN(N, N)
  AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'

Count: 16  Time=253.72s (4059s)  Lock=0.00s (0s)  Rows=0.0 (0), 16users<a class='mention' href='https://github.com/2hosts'>@2hosts</a>
  SELECT t1.idarchive FROM `piwik_archive_blob_2014_07` t1
  INNER JOIN `piwik_archive_blob_2014_07` t2
  ON t1.name = t2.name AND t1.idsite=t2.idsite
  AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period
  WHERE t1.value = N
  AND t2.value IN(N, N)
  AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'
@mattab commented on October 28th 2014 Owner

Currently we purge invalidated data by looking over each archive table: https://github.com/piwik/piwik/blob/master/core/DataAccess/ArchivePurger.php#L28-29

Instead maybe we could keep track of which dates were invalidated, and only run the query on tables for the months where some data was invalidated?

@tsteur commented on November 4th 2014 Owner

Do we have a sample SQL to actually test this? Maybe it can be split in two queries to make it faster? Not sure how many archives are usually invalidated... I think this is only executed daily so 3-10min sounds actually ok maybe. All this 'done' stuff seems to be a bit weird anyway. Maybe it makes sense to refactor it at some point.

@mattab commented on November 4th 2014 Owner

The reason it's so slow is that it takes 3-10min per monthly archive table. So if a user has 5 years with PIwik, it will take 60 * 5min = 5 hours. I'd suggest to keep track of which months were invalidated (eg. in option table) and only run the query on tables for the months where some data was invalidated?

@tsteur commented on November 4th 2014 Owner

different question: In this query why is there a check if there is also a OK or temporary archive? Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what. If needed it will be reprocessed.

I presume there is a reason for this as it wouldn't be there otherwise. Maybe we can at least fix the naming as invalid sounds confusing in case an invalid archive is still used.

@mattab commented on November 4th 2014 Owner

Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what.

Ok sorry the code does not make this clear enough. basically users will import logs and at the end of log import it marks archives as invalidated. Now imagine the scheduled task runs after log import and before the next core:archive run (core:archive could run next only 10-20 hours afterwards). Then for those 10-20 hours, if we deleted all that was invalidated, we would have "no data" gaps. That's why we only delete data after it has been already re-processed (the check for OK/temp archive). This was done in #5932

Now I'm wondering how to best solve this issue... thinking.

@mattab commented on November 5th 2014 Owner

The purge query is here: https://github.com/piwik/piwik/blob/master/core/DataAccess/Model.php#L25-25

Assumptions

  • the scheduled task can run many times (and/or for days) before core:archive next really re-processes the website(s) that was invalidated.
  • there can be many log imports at the same time each invalidating their own months and websites
  • core:archive may run several times before it archives all the websites that have been invalidated.

Notes

  • there is an index on archive_ tables INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_archived), that we should use to only invalidate data for websites that were imported. This is missing for now from the logic.

I started doing refactoring and I think I know how to fix it. I've self assigned issue!

This Issue was closed on November 5th 2014
Powered by GitHub Issue Mirror