@Synchro opened this issue on June 22nd 2016

I've just started using Piwik to analyse some web server logs (vanilla nginx access logs) for 2 sites on the same server, one with around 9 million log lines, the other with 24million, combined compressed log size around 1Gb, covering 1 year of traffic. I'm only interested in file-based analysis, so I'm not using the javascript tracker at all.

The import worked, but it was very, very slow, taking over 2 days for the import. The server has 5 CPUs and 8G of RAM and is running PHP 5.4 and MySQL 5.6, and is generally lightly loaded. Here are the results of the second site's import using 3 importer processes:

Logs import summary
-------------------

    24628478 requests imported successfully
    153417 requests were downloads
    1 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        1 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    24628478 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 94279 seconds
    Requests imported per second: 261.23 requests per second

I don't have the stats from the first import, but it was also very long.

The resulting MySQL database is 6.4Gb on disk. I have started the Piwik archiving process to process these logs, but there is no indication of how long it might take - is there any way to find out? So far I've got feedback like this:

INFO [2016-06-22 22:09:02] Starting Piwik reports archiving...
INFO [2016-06-22 22:09:03] Will pre-process for website id = 1, period = day, date = last1000
INFO [2016-06-22 22:09:03] - pre-processing all visits
INFO [2016-06-22 22:58:54] Archived website id = 1, period = day, 0 segments, 0 visits in last 1000 days, 0 visits today, Time elapsed: 2991.650s
INFO [2016-06-22 22:58:54] Will pre-process for website id = 1, period = week, date = last1000
INFO [2016-06-22 22:58:54] - pre-processing all visits

So to pre-process all daily stats for 1 site took about 50 mins. I'm concerned that the amount of CPU and database activity it's using may affect the site.

Is this kind of performance normal for Piwik? I've previously used AWStats to process similar-sized logs and had it take only a few minutes.

@Synchro commented on June 23rd 2016

The archiving phase took about 2 hours.

@tsteur commented on June 23rd 2016

In case the "Provider" plugin is enabled, try to disable it. This can make it much faster

@mattab commented on July 8th 2016

In case the "Provider" plugin is enabled, try to disable it. This can make it much faster

fyi @tsteur the Provider plugin is automatically disabled by the log importer, unless --enable-reverse-dns is specified

Requests imported per second: 261.23 requests per second

It's decent number, but to improve performance of the log import process, one can try to increase the imported requests per second. We have before got 600 or even 800 requests per second using log analytics. It will help performance if you use the QueuedTracking plugin: http://plugins.piwik.org/QueuedTracking

@Synchro commented on July 12th 2016

Since this import I set up Piwik to do an hourly import/analyse run as per the docs. What I'm seeing is spectacularly bad performance that's interfering with normal server operation. It's clear from my activity monitoring that Piwik restarts its analysis from the begining of the log file every time it's run, and it hits the database for every single record, every time - so as the current log file gets bigger, Piwik's processing time gets longer and longer, until the log file is rotated and it drops back down again.

You can see this happening in this graph: the little peaks get bigger through the day:

Over a week it looks like this:

This doest look like much, but it's actually saturating disk access. Here's the database POV:

This is all due to Piwik activity because the app on this server only uses MyISAM, and this graph only shows InnoDB, which Piwik uses. You can see here the InnoDB activity since installing Piwik: the initial import (that I originally opened this ticket for) followed by 2 weeks of hourly imports:

What I gather from this is that Piwik does not know how to skip records that have been seen before - so rather than looking up the latest timestamp in the DB and then scanning the log file and ignoring all lines prior to that, or storing a log file inode and byte ofset so that it can jump straight to where it left off, it starts from the beginning every time and absolutely hammers the database and iowait.

I'm not using reverse DNS lookups, I don't need them for this case.

Is this normal?

@mattab commented on July 12th 2016

@Synchro Thanks for the feedback. it is "normal" that Piwik works in this way, but we are aware this is not good enough and we hope to do better in the future :+1: I've created a new issue to discuss this particular problem: Log Analytics could detect log lines that were already imported and skip them automatically #144 I hope the description helps. for now you may have to workaround this issue by importing each log file only once, eg. once a day.

This issue was closed on July 15th 2016
Powered by GitHub Issue Mirror