@mattab opened this Issue on August 4th 2014 Owner

Let's discuss ideas to make archiving faster when thousands of websites with low or no traffic.

@mattab commented on August 4th 2014 Owner
  • Instead of archiving each website separately, and issue SQL queries for each plugin of each website, instead each plugin will archive all websites with a set of SQL queries.
    • we could change the Archiver to issue a GROUP BY idsite statement to process all websites at once
    • it will only work if the dataset can fit in memory (may not work well for biggest number of websites). Maybe we could process by batch of 1,000 websites?
@mattab commented on August 4th 2014 Owner
  • archive first the websites that have more traffic, to provide better user experience
  • avoid re-processing the period archive (which is slow as we select all sub-periods and sum and write the data). To avoid re-processing we may update the existing archive for this period, and update date2 column to today. This way we avoid lots of writing and also having to delete these outdated "current period" archives. from #4940
@czolnowski commented on August 4th 2014 Contributor
  • use tracker cache/db to store last visit for checking if sites have visits in archiving
@mattab commented on August 4th 2014 Owner

if anyone has a archive.log for this use case of thousands of websites with low traffic, please attach to this ticket, thanks!

@bjornhij commented on August 21st 2014

In the following URL the logfile of archiving about 8000 small websites (I can only attach images here so I have to link to an external URL):

http://stats.exto.nl/5922/archive.log

Hope this helps. Since archiving was moved to a CLI process, it became to slow to archive every day.

@mattab commented on December 15th 2014 Owner

Right now i'm archiving some empty websites and it takes a long time:

INFO CoreConsole[2014-12-15 23:18:14] [9369d] Archived website id = 14, period = day, 0 visits in last last52 days, 0 visits today, Time elapsed: 1.235s
INFO CoreConsole[2014-12-15 23:18:29] [9369d] Archived website id = 14, period = week, 0 visits in last last260 weeks, 0 visits this week, Time elapsed: 15.551s
INFO CoreConsole[2014-12-15 23:18:35] [9369d] Archived website id = 14, period = month, 0 visits in last last52 months, 0 visits this month, Time elapsed: 5.111s
INFO CoreConsole[2014-12-15 23:18:35] [9369d] Archived website id = 14, period = year, 0 visits in last last7 years, 0 visits this year, Time elapsed: 0.738s
INFO CoreConsole[2014-12-15 23:18:35] [9369d] Archived website id = 14, 4 API requests, Time elapsed: 22.638s [1/1365 done]

Note that It was the first time that those websites are being archived, which explains some of the slowness, but still: something should be done so that archiving some empty websites should be faster than 22 seconds.

@tsteur commented on April 7th 2015 Owner

As @czolnowski suggested something like a flag for each site should help I reckon. Basically, we wanna know whether there was at least one tracking request since the last archiving. Or to describe it differently: We wanna only trigger the archiving for sites, that had at least one tracking request since the last archiving run. This will only help if one has many sites with 0 visits. It does not help for many sites with low traffic.

At first I thought we could just query the log_visit table or the log_link_visit_action table to check whether there was any record since the last archiving but that's probably not doable since one might track something for a previous day and invalidates the archives. I'm not sure if we update the last archiving date in this case. Maybe it is doable.

It is probably not worth storing a flag in the option table or so as it would make Tracking slower. A cache file etc can also not be used as we might clear the cache. If there's a solution it should be probably based on the log_ table.

@mattab commented on April 9th 2015 Owner

We wanna only trigger the archiving for sites, that had at least one tracking request since the last archiving run.

if you don't trigger archiving when there are no new visit, then we would have missing daily archives and missing week/month/year archives, leading to "no data" in some reports. we'd need to change more code (eg. change the code that deletes out of date archives).

Maybe there is instead some room to decrease CPU walltime of archiving requests on "no traffic" website to make them very fast?

@tsteur commented on April 9th 2015 Owner

When there is already an archive, and there were no visits, we don't need to rearchive or not? Of course we might need to change some code but that's normal or not?

@mattab commented on April 9th 2015 Owner

When there is already an archive, and there were no visits, we don't need to rearchive or not?
Of course we might need to change some code but that's normal or not?

Yes we'd need to change code (archive selector would need to allow reading old archives, and do not purge outdated archives as we may read them if we don't re-archive every day eg. 5 days old archives)... it's possible but maybe error prone.

I was hoping there we could make the use case "pre-archiving a site when there is no visit" request so fast that we wouldn't need to be clever about reading old archives, etc. not sure if archiving very fast those "low / no traffic" days is really possible though?

@tsteur commented on April 10th 2015 Owner

FYI: A quick profile of archiving one day for one site: 60-80% is spent outside archiving for bootstrapping Piwik, loading all reports, all segments, ... It might be possible to make a faster version for the CLI that doesn't bootstrap a lot and directly calls something like new Piwik\ArchiveProcessor\Loader()->prepareArchive()

@diosmosis commented on April 11th 2015 Member

Would it be possible to publish these profiles? It would be interesting to be able to examine them.

@tsteur commented on April 12th 2015 Owner

I didn't have it anymore but quickly started the archiver again. Attached a screenshot
archiver_profile

and send you the actual profile via message as I cannot attach it here.

I wrote another version of the archiver where I did just the following in a simple command:

 $pluginNames = Plugin\Manager::getInstance()->getAllPluginsNames(); // we should only get plugins with archiver
 $params = new Parameters(new Site($idSite), Factory::build($period, $date), new Segment('', array($idSite)));
$loader = new Loader($params);
foreach ($pluginNames as $pluginName) {
     $loader->prepareArchive($pluginName);
}

This way I imported more than 300-400 sites per minute (period = day) that each have 2-6 visits instead of only 50 sites per minute.

@mattab commented on April 27th 2015 Owner

running core:archive for a website that has no data is still slow. It takes about 20 seconds on my laptop to process day/week/month/year/range reports (5 API requests). Here is a typical output:

INFO [2015-04-27 05:11:16] Starting Piwik reports archiving...
INFO [2015-04-27 05:11:16] Will pre-process for website id = 27, day period
INFO [2015-04-27 05:11:16] - pre-processing all visits
INFO [2015-04-27 05:11:17] Archived website id = 27, period = day, 0 segments, 0 visits in last last9 days, 0 visits today, Time elapsed: 0.023s
INFO [2015-04-27 05:11:17] Will pre-process for website id = 27, week period
INFO [2015-04-27 05:11:17] - pre-processing all visits
INFO [2015-04-27 05:11:19] Archived website id = 27, period = week, 0 segments, 0 visits in last last9 weeks, 0 visits this week, Time elapsed: 1.569s
INFO [2015-04-27 05:11:19] Will pre-process for website id = 27, month period
INFO [2015-04-27 05:11:19] - pre-processing all visits
INFO [2015-04-27 05:11:23] Archived website id = 27, period = month, 0 segments, 0 visits in last last9 months, 0 visits this month, Time elapsed: 3.811s
INFO [2015-04-27 05:11:23] Will pre-process for website id = 27, year period
INFO [2015-04-27 05:11:23] - pre-processing all visits
INFO [2015-04-27 05:11:37] Archived website id = 27, period = year, 0 segments, 0 visits in last last7 years, 0 visits this year, Time elapsed: 14.471s
INFO [2015-04-27 05:11:37] Will pre-process for website id = 27, range period
INFO [2015-04-27 05:11:37] - pre-processing all visits
INFO [2015-04-27 05:11:38] Archived website id = 27, period = range, 0 segments, 0 visits in last previous30 ranges, 0 visits this range, Time elapsed: 0.497s
INFO [2015-04-27 05:11:38] Archived website id = 27, 5 API requests, Time elapsed: 21.300s [1/74 done]

The problem gets N times worse when you add N segments...

If we can improve this performance in future Piwik versions, it would for sure help a lot!

@mattab commented on November 25th 2015 Owner

Closing this issue as we made heaps of progress recently and this issue scope is too wide.

Well done to the team for all improvements done in last few months!

This Issue was closed on November 25th 2015
Powered by GitHub Issue Mirror