@alexkuk opened this Pull Request on March 7th 2016 Contributor

Implements #9389 - UserId module, that adds a new Users report showing unique user IDs and some aggregated data. Includes reindexing of raw visitors log into aggregated user ids table.

@sgiehl commented on March 7th 2016 Member

Is there any reason for not having that as a separate plugin, not bundled with core but available on the marketplace instead?
If any changes to core are required to make such a plugin possible we could only add those.

@alexkuk commented on March 7th 2016 Contributor

@sgiehl The issue description looks like it's needed in core. If it's not, I will go with the marketplace and small changes in core.

@sgiehl commented on March 7th 2016 Member

Not sure about it. @tsteur @mattab what do you mean?

@tsteur commented on March 7th 2016 Owner

Yes, it would be helpful to have in core I'd say. I'm excited about this one I'll have a look :)

@tsteur commented on March 7th 2016 Owner

First :+1: for working on this. It'll be a useful feature and it'll be good to have all UserId related feature in a separate plugin. We will move some more UserId specific things that are currently in other plugins there some day. For people not knowing what it does I took a screenshot:

image

Even the visitor profile row action works :+1:

Re visualization:

Re indexer:
What you built with the indexer is basically exactly what an Archiver does in Piwik. It'll "re-index / re-archive" periodically, delete old archives that are no longer needed etc out of the box. Plus it allows to use segments on top, it'll work for date ranges etc. However, the Archiver is the most complicated part in Piwik and even for us it takes often very long time to understand and make it work each time again for a new plugin. We will start refactoring the archiver in Piwik 3 in 1-2 months (it'll take 3-6 months approx) so that nobody ever has to create an archiver again manually.

I've been thinking for a while what a good example Archiver would be but first: You can create a new archiver for this plugin by executing ./console generate:archiver. Maybe a good inspiration for an archiver that you can adapt would be https://github.com/piwik/piwik/blob/2.16.0/plugins/DevicePlugins/Archiver.php and https://github.com/piwik/piwik/blob/2.16.0/plugins/Resolution/Archiver.php

It's quite complex but an example could look like this:

<?php
/**
 * Piwik - free/libre analytics platform
 *
 * <a class='mention' href='https://github.com/link'>@link</a> http://piwik.org
 * <a class='mention' href='https://github.com/license'>@license</a> http://www.gnu.org/licenses/gpl-3.0.html GPL v3 or later
 */

namespace Piwik\Plugins\UserId;
use Piwik\DataArray;
use Piwik\DataTable;
use Piwik\Metrics;

/**
 * Class Archiver
 * <a class='mention' href='https://github.com/package'>@package</a> Piwik\Plugins\UserId
 *
 * Archiver is class processing raw data into ready ro read reports.
 * It must implement two methods for aggregating daily reports
 * aggregateDayReport() and other for summing daily reports into periods
 * like week, month, year or custom range aggregateMultipleReports().
 *
 * For more detailed information about Archiver please visit Piwik developer guide
 * http://developer.piwik.org/api-reference/Piwik/Plugin/Archiver
 */
class Archiver extends \Piwik\Plugin\Archiver
{
    /**
     * It is a good practice to store your archive names (reports stored in database)
     * in Archiver class constants. You can define as many record names as you want
     * for your plugin.
     *
     * Also important thing is that record name must be prefixed with plugin name.
     *
     * This is only an example record name, so feel free to change it to suit your needs.
     */
    const USERID_ARCHIVE_RECORD = "UserId_archive_record";

    /**
     * <a class='mention' href='https://github.com/var'>@var</a> DataArray[]
     */
    protected $arrays   = array();
    protected $metadata = array();

    public function aggregateDayReport()
    {
        $this->aggregateUsers();
    }
    /**
     * Period archiving: simply sums up daily archives
     */
    public function aggregateMultipleReports()
    {
        $dataTableRecords = array(
            self::USERID_ARCHIVE_RECORD,
        );
        $columnsAggregationOperation = null;
        $this->getProcessor()->aggregateDataTableRecords(
            $dataTableRecords,
            $this->maximumRows,
            $maximumRowsInSubDataTable = null,
            $columnToSortByBeforeTruncation = null,
            $columnsAggregationOperation,
            $columnsToRenameAfterAggregation = null,
            $countRowsRecursive = array());
    }

    protected function aggregateUsers()
    {
        $dimensions = array('user_id');

        $where = "user_id is not null and user_id != ''";
        $additionalSelects = array('sum(visit_total_actions) as nb_actions');
        // might need to add more metrics, also we could eg show number of searches per user etc but not needed for now

        $table = $this->getLogAggregator()
                      ->queryVisitsByDimension($dimensions, $where, $additionalSelects)
                      ->asDataTable();
        $table->filter('ColumnCallbackDeleteRow', array('label', function ($value) {
            return strlen($value) <= 5;
        }));
        $this->insertTable(self::USERID_ARCHIVE_RECORD, $table);
        return $table;
    }

    protected function insertTable($recordName, DataTable $table)
    {
        $report = $table->getSerialized($this->maximumRows, null, Metrics::INDEX_NB_VISITS);
        $this->getProcessor()->insertBlobRecord($recordName, $report);
    }

}

This will also make your API much easier as pretty much all you need to do is:

    public function getUsers($idSite, $period, $date, $segment, $expanded, $flat)
    {
        Piwik::checkUserHasViewAccess($idSite);

        $dataTable = Archive::createDataTableFromArchive(Archiver::USERID_ARCHIVE_RECORD, $idSite, $period, $date, $segment, $expanded, $flat);

        foreach ($dataTable->getRows() as $row) {
            /*
             * Get idvisitor and remove it from DataTable. We need it to form an URL that is used in
             * JS to show visitor details popover. See rowaction.js
             */
            $idVisitor = $row->getColumn('idvisitor');
            $row->deleteColumn('idvisitor');
            $row->setMetadata(
                'url', 'module=Live&action=getVisitorProfilePopup&visitorId=' . urlencode(bin2hex($idVisitor))
            );
        }

        return $dataTable;
    }

Afterwards you can remove pretty much all related to the indexer and it will have more features. I haven't tested anything but wanted to point in the right direction. Really excited to see this report in Piwik :)

@alexkuk commented on March 8th 2016 Contributor

@tsteur I've updated the PR with 3 more columns and date formatting.
Regarding the re-indexer. I was considering utilizing an existing archiving subsystem, however I don't think it's correct to do so, so I went with custom indexer. I think that archives are meant to hold data aggregated by periods and are optimized for short period reports.
What we have here is different - we add an additional entity userId, and this entity should be extracted from visit logs. The Users report is not actually a report by period but just a list of ALL users and some aggregated information about them. And we'd also like to search a certain user between all users. Doing so using existing archiving means we need to load all report records from multiple DB tables, create many objects in memory with PHP on every Users page request. Sorting, pagination, search and even listing would become very resource greedy. If you support systems with 1b+ visits this may become a problem.
Doing the same thing with custom indexer allows us to create a DB structure that completely satisfies our needs. Reindex process is fast, read processes are fast.
Please correct me if I'm wrong.

@tsteur commented on March 9th 2016 Owner

I think that archives are meant to hold data aggregated by periods and are optimized for short period reports.

They are also optimized for holding data for long periods, such as month, year, etc.

What we have here is different - we add an additional entity userId, and this entity should be extracted from visit logs. The Users report is not actually a report by period but just a list of ALL users and some aggregated information about them.

That's not really how Piwik works. All reports in Piwik are by period. This way users can decide whether they want to see all users (eg by selecting a year or a long data range) or whether they want to see only recent users (by selecting a short date range, day period, week period). This is IMHO quite important to give users that freedom as not everyone is interested in seeing all users but maybe more recent active users.

And we'd also like to search a certain user between all users. Doing so using existing archiving means we need to load all report records from multiple DB tables, create many objects in memory with PHP on every Users page request.

With archiving the search comes out of the box and actually we do only load one record from the database (the archived report record). It might fetch multiple archives from the database if a range date is selected or if the aggregated report is very big but it's still very fast. It's a very fast query and unserializing + searching is quite fast as well with the existing archives. Plus we save some time not having to aggregate any data.

Sorting, pagination, search and even listing would become very resource greedy.

It's true it could be faster by having stored aggregated reports differently but it's actually not too greedy as the select for an archive is very fast (instead of having to fetch many aggregated entries and aggregate them again or so). The problem is that we might do the unserialize (and some other work) for all rows within the datatable, even though we maybe need only one eg when filter_offset=10000,filter_limit=1. However, it should be still quite fast (but not as fast as it could be).

If you support systems with 1b+ visits this may become a problem.

Reading the archives with 1b+ visits is still very fast I would say. Problem is only the archiving process itself: meaning querying the log tables and aggregating the data for the new archive. Even with 1b+ visits it's actually not a big deal to archive the reports, mainly the Action URLs report is becoming slow with 1b+ visits but that's a different topic :)

There are a couple of problems with the indexer.

  • For example it requires schema changes when adding new metrics etc. For people having 1b+ visits and many users (or when thinking of other reports many URLs, many referrer websites, many countries) changing the DB schema would become a big problem.
  • You cannot get data for date ranges, specific days, or weeks etc. Users want to select the data by dates.
  • Re-indexing the whole data could be very very slow as you would have to iterate over all log_visit entries. There could be hundreds of millions of entries.
  • Re-indexing only a specific part of data would not be possible. For example sometimes old log data is corrected afterwards or data from another Piwik system is imported etc see eg http://piwik.org/faq/how-to/faq_155/ . It would force to re-index all data.
  • Segmented reports for users cannot not be pre-indexed and it would not be possible at all to apply a segment to this report.

There are likely many more problems with it as the archiver is a "beast" in terms of feature and logic and stuff :) In general, yes, the archives could be a bit smarter and less resource intensive. On the other side reading the archives is always quite fast, no matter if there are 1b visits or 100 visits. Generating the aggregated archives is a bit of a problem and could be faster but to solve this properly it'll take many months of work.

To have this report in Piwik we would kinda need to use the archiver here as it's just "how Piwik works". Users want to apply different dates and segments to their reports etc.

@alexkuk commented on March 10th 2016 Contributor

@tsteur Thanks for the detailed answer :+1: I'll rework it to use archiver and see how it works

@alexkuk commented on March 10th 2016 Contributor

@tsteur BTW an example why I'm concerned about high resources consumption by archive reads. I took the Cities report as an example, because it's closest by numbers to users report on my local instance. I have around 300k users and there are 262587 cities. Total visits from 2013-06-06 till 2016-03-10 is around 1.7M. From the issue description I got that searching a certain user by user IDs between all users may be a common usage pattern for the Users report. So, I'm trying to load lifetime cities report and search through it:

  1. Loading all cities report, show first 5 (Loading time: 503s):
    index.php?date=2013-06-06,2016-03-10&viewDataTable=table&module=UserCountry&action=getCity&period=range&idSite=2&filter_limit=5&filter_sort_column=nb_visits&filter_sort_order=desc&filter_column=label
  2. Searching for "Germany" (Loading time: 261s):
    index.php?date=2013-06-06,2016-03-10&viewDataTable=table&module=UserCountry&action=getCity&period=range&idSite=2&filter_limit=5&filter_sort_column=nb_visits&filter_sort_order=desc&filter_column=label&filter_pattern=Germany

Paginating it is the same pain.

Also, it started working only when changed memory limit from 1Gb to 2Gb. Archiving is configured to happen by cron.

With such numbers the report is unusable. I'm afraid to have the same performance with user IDs.

@tsteur commented on March 10th 2016 Owner

When requesting 2013-06-06,2016-03-10 we do actually not query against the log tables. Instead it will load the 1 archive for 2014, 1 archive for 2015, and 1 archive for 2016 + approx 10 more archives for July 2013, August 2013, ... and some week reports etc. This can be slow but shouldn't be "too slow" and still generate a report in say 300ms but it depends on the amount of the data in the reports and the server. Could be even faster, could be slower. Getting those say approx 14 archives out of the database is very fast, pretty much the slowest part there is likely to unserialize all the rows for all 14 archives but it depends on how many rows there are in the data tables.

The loading time of 500s (not ms) looks like it had to create lots of archives for some reason. If the archives already existed (eg via regular cronjob ./console core:archive) this should be really fast.

@tsteur commented on March 10th 2016 Owner

Do you maybe have the settings

[Debug]
always_archive_data_period = 1
always_archive_data_day = 1
always_archive_data_range = 1
aalways_archive_data_year = 1
aalways_archive_data_week = 1
aalways_archive_data_month = 1

enabled?

@alexkuk commented on March 25th 2016 Contributor

@tsteur I reworked it to use core archiver.

Concerning the performance questions. The [Debug] options mentioned by you are not active on my local.

On my test data set I have 298567 user IDs. If I load a lifetime report, it takes around 60s (PHP 5.5) and 20s (PHP 7.0). Profiling shows that the main bottlenecks are operations on DataTable. It's mainly deserializing, sorting etc. Disabling even one "Sort" filter makes Users report twice faster for php5 and 5x faster for php7. You may take a look at the xdebug profiles:
Users:

users

Cities:

cities

On short periods or small data it works just like other reports. The problem is that piwik archiver-based reports work slowly and consume a lot of resources when containing so many rows. Cities is one example, other example is user IDs. I think having > 100000 users is not something extraordinary for an average business.

@tsteur commented on March 28th 2016 Owner

It's mainly deserializing, sorting etc.

These are our known bottlenecks so far where we cannot do too much about it as mentioned in previous posts. Problem is we're unserializing etc all rows even the ones that are not needed etc. The user report might have 300k rows and Piwik does not perform very well when having so many users.

BTW: In Piwik 3 sorting will be faster as we're using native sort methods there instead of custom sort.

A workaround could be to disable sort when there are eg > 300k users. This will let most Piwik users still use sort but big sites would not be able to use it until it performs faster. Code would be like this:

        if ($dataTable->getRowsCount() > 300000){
            $dataTable->disableFilter('Sort');
        }

Haven't tested it though. If it's 20 seconds with PHP 7 I think it would be still acceptable for now to leave Sort always enabled (for now). Have you profiled this on a fast server or more an average or local development server? Also be aware that XHProf adds quite a bit of overhead when one method is called many times contrary to xdebug see eg https://tideways.io/profiler/blog/profiling-overhead-and-php-7 . It would be good if you could test actual speed of API call without Xdebug and without xhprof enabled. I'd expect it to be at least twice as fast just by disabling xhprof. If xdebug is enabled it should be even much faster.

@alexkuk commented on March 29th 2016 Contributor

@tsteur 19-20s is the time without xdebug or xhprof, I enable profiler only when I need to debug. It's my local dev machine: Macbook 2014, 2GHz Core i7, 16GB DDR3, SSD. Haven't tried on a dedicated server, I think it should be faster there. Would be nice if you can try it on more or less real environment.

@tsteur commented on March 29th 2016 Owner

I can't try it on a more or less real environment soon but suggestion would be to just merge it (I need to review PR) and see how it performs. Most times users rather view a small date range of a single day or week and most users will have much less than 300k users. Nonetheless it should be also fast when having eg 300k users for a single day but we'd improve this one afterwards. I'll try to review today or in the next days and do some tests.

@tsteur commented on March 30th 2016 Owner

So far I had a look and PR looks good apart from a few things. Really good work so far :+1:

@alexkuk commented on March 31st 2016 Contributor

Implemented the small changes you requested.

@alexkuk commented on April 1st 2016 Contributor

@tsteur added the data-row-metadata attribute. It allowed to remove much code, so I like it more.

@tsteur commented on April 3rd 2016 Owner

Awesome work :+1: We will merge for Piwik 2.16.2 as we just released the RC for 2.16.1 last Friday. Looking forward to have this feature in Piwik

@tsteur commented on April 19th 2016 Owner

Thanks for the PR @alexkuk I will merge now 👍 👍

This Pull Request was closed on April 19th 2016
Powered by GitHub Issue Mirror