@mattab opened this issue on May 13th 2009

Proposal spec for the new Actions schema

There are various tickets related to improvements in the "Actions" reporting in Piwik (pages, outlinks, downloads). Some of these require database schema updates to accodomate the new funtionnality. Here is a proposal spec. This spec was written with the following tickets in mind: #306, #530, #556, #707 ## Requirements

We want to report: - top pages by URL, classified by category (current behavior) - top pages by flattened URL (see #707) - report entry / exit / time per page by URL (see #5726); we already have this data. - top pages by page title. In the top pages by title, titles don't link to the URLs by default, because it is a lot of computation to keep the relationship page title<> (URL1, URL2, ..) as a same title can have many URLs. Also, we don't report best entry page, top exit page, or time per page for each title. These analysis are only done for the URLs, to minimize overhead during archiving. (see #530) However we now need two different information about a pageview: URL, and a Name.

We do not consider an event tracking feature (see #472) as it is out of scope for Piwik 1.0, and would anyway require a different more complicated and modular data structure.

This change is also welcome at this time considering the new Javascript API that has the following related methods:

piwik.setDocumentTitle( customTitle )
piwik.trackLink( url, type, customVars)

Users can customize the document title for a given pageview. This would only customize the "Top pages by title report" and would not affect the "top pages URL" and "top pages flattened URLs" reports. ## Implications on Tracking (piwik.php, core/Tracker/*)

There is a currently a performance issue when querying piwik_log_action that we hope to resolve with this schema update. Indeed, when querying this table for a given URL, eg./test/test2/test3/ if most entries in this table have a similar structure, because the index on name is limited to a few characters (15) the index lookup badly fails and we end up doing a full table scan at each page view. To fix this issue we are going to add a new column in this table, "hash", that will be the hash of the "name" column, using the fast CRC32 algorithm. This will speed up the SELECT idaction in piwik.php.

We also now need to save two information per page view: the name of the page, and the URL. For clicks and downloads, we will by default only save the URL, but plugins could override this and also save a name, to allow advanced reporting on downloads and outlinks. We want to keep things simple and will save all names and all URLs in this same table, piwik_log_action.

Schema updates for piwik_log_action

Current schema:

  name VARCHAR(255) NOT NULL,
  PRIMARY KEY(idaction),
  INDEX index_type_name (type, name(15))

Proposed schema:

  name VARCHAR(255) NOT NULL,
  PRIMARY KEY(idaction),
  INDEX index_type_name (type, hash)

Note: this is explained in the book "high performance mysql 2nd edition" page 104;

Basically instead of

SELECT WHERE name = '/path1/path2/path3' 

you do

SELECT WHERE hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3'

you need theAND name = '/path1/path2/path3' in case two urls have the same hash (which happens with 1% probability after 100k results, cf birthday paradox).

CRC32 doc is on http://dev.mysql.com/doc/refman/5.1/en/mathematical-functions.html#function_crc32.

The hash field should be set when INSERTing new page names/ urls in this table, using the mysql CRC32 function; we don't want to use the php crc at any time.

Currently Actions->getIdAction returns the idaction for the given name. Instead, the function would now set actionIdName and actionIdUrl in the object. For downloads and outlinks, actionIdName would be empty;

In the case of pages, we would like to select these two IDs in only one SQL select for efficiency. For example, it would look like:

SELECT idaction 
FROM piwik_log_action 

    (hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3' AND type = $this->getActionType())
    (hash = CRC32('Welcome to URL') AND name = 'Welcome to URL' AND type = Piwik_Tracker_Action_Interface::TYPE_ACTION_NAME)

For downloads/outlinks there wouldn't be the part after the OR, as by default we only track URLs for downloads/outlinks;


SELECT idaction 
FROM piwik_log_action 
WHERE (hash = CRC32('http://piwik.org/latest.zip') AND name = 'http://piwik.org/latest.zip' AND type = $this->getActionType()

Schema update for piwik_log_link_visit_action

  `idlink_va` int(11) NOT NULL auto_increment,
  `idvisit` int(10) unsigned NOT NULL,
  `idaction` ---> renamed in idaction_url
  NEW idaction_name 
  `idaction_ref` --> renamed in idaction_url_ref
  `time_spent_ref_action` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`idlink_va`),
  KEY `visit` (`idvisit`)

For pages, we would record both idaction_name and idaction_url fetched from piwik_log_action as specified above. For download and outlinks, we would only set idaction_url, and idaction_name would be null. Plugins could set this value if necessary

Action->record() will have to be updated accordingly to now record these two idaction_*;

Note: Action object already has a getActionName and getActionUrl method which matches nicely; we would maybe add a getActionNameId and getActionUrlId that would be set by the current getActionId (that could be rename loadActionNameAndUrl)?

Schema update for piwik_log_visit

visit_exit_idaction is renamed in visit_exit_idaction_url visit_entry_idaction is renamed in visit_entry_idaction_url

Several lines have to be updated in the code to reflect this rename but there are no logic changes.

Schema update for piwik_log_conversion

idaction is renamed in idaction_url

GoalManager->recordGoals() query must be updated accordingly.

As shown, all existing idaction reference the entry in piwik_log_action for the URL of the concerned pageview; the URL is now officially the main information about a pageview, and the "name" (eg. the html document title) is one more piece of information about each page view. However conversions, and visits entry/exit pages, only care about the pageview URL, mostly for simplicity.

Note: I would expect to have more small modifications not explained here, in the php code of Tracker/*, especially Tracker/Action.php ## Implications on Archiving (plugins/Actions/Actions.php)

Changes in the schema should only affect archiving for plugins/Actions/Actions.php.

To do #530, the first query at line 124 would have to ran twice, once joining on idaction_url for processing reports by page title (existing query), and one more by joining on idaction_name to process the new report "top pages by title".

Other queries would have to be slightly updated to reflect the field's name change. ## User Interface

To do #530 we would add a new submenu: Actions> Page titles to show best pages by title, similar to existing Actions>Pages. There would be a new API method that would return the top pages by title. ## Migration

There are several required schema updates which may take minutes to execute on the piwik_log_* tables; we might want to update the message before the update to make clear that this will take a while and user should be patient.

The schema updates must be done in the CREATE TABLE statements in core/Piwik.php, in all queries querying these tables (mentionned above) and the update script can be written in core/Updates/X.php ## Other - The data generator in core/Tracker/Generator and misc/generateVisits.php will have to be updated to reflect the new DB URL structure and ensure that the generator generates all the title + URL data as expected. - Also this new schema change will invalidate the TrackerSecondaryDb which will have to be slightly updated to reflect the new structure. - Some of this code is covered by unit tests which will have to be updated (tests/core/Tracker/Action) with new logic ## Conclusion

This is not a small change, but we are blocked on several fronts because of the current DB schema; implementing this change will make a lot of other tickets possible and add new features in Piwik in the near future. Also, it makes possible for plugins to custom name outlinks and downloads.

Any questions, feedback, ideas, please let me know: this is a proposal!

@zawadzinski commented on October 16th 2009

Attachment: 708.patch

@robocoder commented on October 16th 2009

Attachment: revised update script 0.5.php

@anonymous-piwik-user commented on June 17th 2009

Seems a good proposal to me! I have a note on saving the page Title.

I've been playing around with Piwik to make it log both Title and url. In PIWIK 0.4 it is possible to log the Title using the following function in the trackingcode:


But in the (utf-8) database this results in problems with both utf-8 encoding and html_entity encoding. I had to put in this code to resolve this problem:

if($actionType == 1){
   $actionName = utf8_decode($actionName);   
   $actionName = html_entity_decode($actionName);

I think this is something that needs to be addressed when implementing this proposal.

@mattab commented on June 19th 2009

rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings?

@anonymous-piwik-user commented on June 23rd 2009

Replying to matt:

rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings?

I don't know if the problem relates to utf7 strings.

As far as I remember these steps: 1. add


in trackingcode. 1. Look in de database (I use phpMyAdmin) in the table "log_action" and I see a lot of messed up data in the "name" column. But the data only gets messed up for pages with special characters. Characters like "", "" etc. But also special characters in HTML like "'", "&" etc.

does this help?

@mattab commented on June 23rd 2009

do you then have wrong data in the "Actions" report as well?

@robocoder commented on June 24th 2009

In this forum post, r.guggeis suggests changing the index on the log_action table:

ALTER TABLE `log_action` DROP INDEX `index_type_name`
ALTER TABLE `log_action` ADD INDEX `name` (`name`);
@mattab commented on June 24th 2009

Replying to vipsoft:

In this forum post, r.guggeis suggests changing the index on the log_action table:

ALTER TABLE `log_action` DROP INDEX `index_type_name`
ALTER TABLE `log_action` ADD INDEX `name` (`name`);

yes this would make sense for the current schema, but the proposal here is an even better solution (we index the hash of the name as opposed to indexing the name which has an unknown length - this could be very costly for websites having very long names)

@zawadzinski commented on August 11th 2009

new version of patch (compatible with current release of Piwik 0.4.3) added.

@zawadzinski commented on October 16th 2009

Current patch in the attachment. One important note:

Page Titles are tracked only when user sets in Javascript action_name. If action_name is not given, given page view is not tracked in Page Titles view. As long as user does not set action_name Piwik behaves the same way as before (view Actions->Pages display action URL's).

Backward incompability occurs only when user used action_name's because they will appear in Actions->Pages, and after the update they will go into Actions->Page titles (data in the database won't be updated because we do not have information about which Action was URL and which was not). This is important to note that users have to be aware of before updating to 0.5.

@mattab commented on October 23rd 2009

see commit in [1530]

TODO before closing the ticket: - Anthon: apply small modifs to the JS tracker (setCustomUrl() and action_name defaulting to document.title) - Matt: improve/fix unit tests around the expected behavior of Action naming - Maciej/Matt: load testing to ensure this is working as fast as we expect under heavy load, with a huge log_action table

@robocoder commented on October 23rd 2009

In [1531] - refs #708 - add setCustomUrl(url) to override document.location.href (default) - set action_name to document.title (default) - add alias for window.location.hostname (for yuicompressor) - rename private method, getWebBug(), to logPageView() for consistency - jslint 2009-10-04 ("The Good Parts")

@mattab commented on November 2nd 2009

marking as fixed for now - thanks all for your work

@robocoder commented on November 2nd 2009

what about the two unit tests failing in tests/core/Tracker/Action.test.php (empty request) and plugins/Actions/tests/Actions.test.php ?

@robocoder commented on February 1st 2010

[1550], refs #708

This issue was closed on February 1st 2010
Powered by GitHub Issue Mirror