@theyosh opened this Pull Request on August 30th 2017

Changed only the function insertActionsToKeep

Speed improvements:

  • Get all the needed ID field values in a single run per table in stead of amount of ID field times per table (3-5 times faster)
  • Start with the lowest ID instead of zero. With auto numbering and cleanup, the first valid ID will be higher and higher. So why start at zero if the first ID is somewhat about 106644353. This will safe a lot of empty selects
  • Use extra temporary tables for data pivoting
@sgiehl commented on August 30th 2017 Member

Note: Test failures are unrelated to this changes

@tsteur commented on September 13th 2017 Owner

We will need to have a good look at it and test it etc. I noticed the current version is super slow and takes ages to purge data even for a fairly small site. cc @mattab

@theyosh commented on September 14th 2017

If you need more info, I am happy to provide.

@tsteur commented on September 22nd 2017 Owner

Code looks good but would need to test it and then also compare performance. Biggest performance problem is the start from min(id) I reckon 👍

Did you test whether the insert you did is faster than the insert ...select?

@theyosh commented on September 22nd 2017

@tsteur Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production. Do not know why. It is a long time. But still when the minimal value is somewhat like 1200000 it will save about 12 X amount of fields off queries running through the hole database without finding anything. So the min(id) is to reduce the amount of 'empty' result queries due to increased auto numbering field idlink_va

But lets take the following example. We have the table piwik_log_link_visit_action. There all the ID fields are queried. So there is a loop over the fields idsite,idvisitor,idvisit,idaction_url_ref,etc.. That will request only one field from the table. And then walks through the complete table (of which we have about 556970018 records) The code will loop amount of fields X (total records in table / query limit). This will take at least 1 week to process. Almost 2 weeks.

So what I did was removing the loop over the fields idsite,idvisitor,idvisit,idaction_url_ref,etc.. and created a single query that is requesting all those fields at once. As the data is (big)int, it cannot be that big, and should not give memory errors. Now you will only have 1 X (total records in table / query limit) actions. This is the big time saver.

So I get a result set back that contains all the data that should be saved in the temporary table. In order to do this, the following query needs to be created 'INSERT IGNORE INTO [TEMPTABLE] VALUES (idsite),(idvisitor),(idvisit),(idaction_url_ref)'. So I create a single query that insert 1000 records at once. This 1000 is just a number. It could be that 10000 is also possible. But that is depending on the amount of available memory.

My insert query is not faster then insert ...select. But that is not the case here. The case is, that I have reduced the total amount of queries that is run during the cleanup. Because you are using insert ...select you are basically forced to select only one field at a time. And this will force multiple loops over the same data set which is a killer.

We have noticed that this fix will use a lot of disk. Our SSD showed a 70% utilization when running this new cleanup code. So it will take more resources to cleanup, but now it only takes about 2 days in stead of 2 weeks. So we are happy with it

@tsteur commented on September 24th 2017 Owner

Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production.

Interesting, as there is an index on this field (first column and on log_visit table even primary key) I would also have expected this query to be very quite fast.

@theyosh commented on November 12th 2017

Ping? any updates on this?

Powered by GitHub Issue Mirror