@hpvd opened this Issue on January 30th 2016

Sometimes more than one data source is available for description/documentation of the same activity.
In most cases the data source have different strength and also weakness.
But combining them, the image of reality is always better
than only using one source.

To give an example:
there is a place with two different cameras looking at it from two different directions.
One of the camera is a HD Color camera, mounted in a height of 10m and the other one is an black&white model, with lower resolution, mounted in a height of 2m, but it can make pictures also in the dark.

Both on their own can't document everything happening all day long on the place in perfect quality.
But together they doesn't miss anything.

The same situation exists when trying to track activities using Piwik:

In the future when Piwik will become a "universal activity tracker" with v3
but also today when tracking "only" websites.

With Piwik's java script tracking you can track many many details.
But there are things that may block Piwik's js: browser settings, browser add ons etc.

In this case these visits are not tracked. And what is even more worse from statistics pov:
one do not only not now what these visitors have done, but one do not know how many visits were missed.
With this some numbers in statistics like number of total visitors are bably broken.
This may have effects on other things like e.g.Conversion rate not only in ecommerce (numbers of vistors/reached goals), impression counting when doing advertisments, etc.

With Piwik's analyses of server log files, all visitors are tracked -always.
But not with that great details js tracking can do.

=> So why not making it possible to use data from different source and combine
the best of both worlds to build a perfect image of reality?

When starting structural work on the core of Piwik for v3.0, it is a perfect point to think of these possibilities.

@tsteur commented on January 31st 2016 Owner

It's a great idea and would be an awesome feature indeed. However, technically probably quite difficult. I presume we won't find time to work on this soon as we can maybe provide more value by spending the time on some other features. It would be really cool if someone could but some thoughts on how it could work technically. Eg how can we 100% correctly match a user tracked with JS with some logs from the webserver. Not sure if it's possible, especially when requests are coming from same IP / company.

@hpvd commented on February 1st 2016

thanks for positive answer!

Well of course great things are not always easy and could also not very often be fulfilled within first approach ;-)

But there are things that could be done relatively easily:

e.g.
to ensure that all sources always uses the exact same time base is a good start to make sync and combination possible
or
to accept and allow some kind of non user assigned actions (e.g. page visits from same IP but different visitors) is another one.

=> With this one can already optimize statistics on all fields where url + counting is enough:

  • number of page visits per period (pp)
  • conversion rates per pageview pp (e.g. how often is a productpage visited and how often is the product put to basket?)
  • bounces rates pp
  • number of downloads pp
  • number of searches pp
  • search keywords pp
  • number incoming campaigns
  • ...

probably there are some more
-especially if one assume that a visitor in most cases (99%) would not change is "I'm track-able with js / I'm not track-able with js" status during a visit of the website....

@hpvd commented on February 1st 2016

when doing this kind of statistic quality check "manual"
one would e.g.
set up two wesbites within Piwik for the same websities and let

  • one use data from js tracking and
  • import logfiles from the server for the other one.

After that one can look for a given period in the data above on both websites
and compare them to

  • got the answer how reliable the data from js tracking is (how big is the part of pageviews one miss?)
    "quality assurance"
  • determine corrected values shown in post above using "data of type counting" from log import
@tsteur commented on February 1st 2016 Owner

But there are things that could be done relatively easily:

Good point and very true. Interesting approach of using it in 2 sites and comparing. Didn't even think of this initially. It could kinda work like tracking into 2 sites separately, then we check visits / actions against each other and merge them eg into a third site. It's still not super trivial but a simple proof of concept could be maybe made. There are still challenges eg when IP is anonymized it will be probably impossible to know if an individual was already tracked or not. This applies especially to German users.

A nice thing is the new kind of reports it would allow. For example we could have a site tracked with JavaScript, but still have bandwidth reports that are usually only available with log importer. We would maybe know how many resources were loaded etc. Still, merging this data won't be easy (eg when dealing with dates/times to find matching user etc there are always problems :) )

@hpvd commented on February 1st 2016

merge them eg into a third site

perfect idea!
...good discussions brings one further than one can go alone.

=> regarding the other points you are mentioning: looks like you got hooked on this idea :-)

@hpvd commented on February 2nd 2016

For doing a combination like that, it would help very much to keep as many raw data of tracking
and process and filter it later if needed (hide bots, spam, deleted visits)

hmmm the more I think, keeping raw data is not only helpful but essential
to have chance to do combinations in an efficient way
(and many other things)
See +1 keeping raw data https://github.com/piwik/piwik/issues/8955#issuecomment-178479720

(and storage is becoming cheaper and faster every day, but visitor count (data production) on websites tracked with piwik is not enhancing with same speed)

@gaumondp commented on February 2nd 2016

Another use case for "more than just JS tracking" : External File download.

If someone link a file on their website, just using Piwik will not be enough since the downloads will not be fired by Piwik at all.

I have exactly that request right now to have "more precise" (external) downloads which is only possible thru Apache log files...

@tsteur commented on February 2nd 2016 Owner

that's a pretty good use case!

@hpvd commented on February 4th 2016

to make this usable (and in general)
Log analytics should be easier to use / acessible by more users
opend a new ticket for this: #9711

@hpvd commented on February 4th 2016

having the possibility to compare js tracking results easily with log import tracking results,
it would help and be more easy to notice and identify problems and implausible values of one of them.
So quality of result data would rise futher.

@masteranalyze commented on March 29th 2016

:+1: for the ideea,this is exactly what i did thinked : " Interesting approach of using it in 2 sites and comparing. Didn't even think of this initially. It could kinda work like tracking into 2 sites separately, then we check visits / actions against each other and merge them eg into a third site. It's still not super trivial but a simple proof of concept could be maybe made."

2 sites,one tracked with java,one with server side tracking,and an 3rd website matching the data.

On the server side,to have an real picture,right now,are we able to filter GOOD BOTS + BAD BOTS ?

If we can filter :+1: GOOD BOOTS :+1: BAD BOTS :+1: Real Humans ,practically we can get an real picture.

Most of the good bots of course can be easily identify,because they use good practice,like having the word "bot" in their construction:googlebot,bingbot,adsensemediabot,etc
Or some are using the word "crawler" or robot.
The problem i think is on bad bots identify...maybe somebody haves some ideea how to filter the bad bots nowdays,witch does not use neither bot neither crawler neither robot,etc.

On joomla for example,there is the EORISIS piwik plugin and there is another if i remember very well from yoat or something like that,witch is only for server side tracking.

Eorisis piwik can track on joomla with :java,java+image,Server side.
The other plugin can track only server side.

I tryed practically to run on the same website,eoris with java and the other with server side,the problem,is that if you enable both plugin,joomla crashes,so its not working,they get into conflicts,so you cant compare data.

Anyhow this should be done as @hpvd noted here : https://github.com/piwik/piwik/issues/9711

And things like "great details" like screen resolution ,plugins used,can be solved,if we implement misc tracker ,like awstats is doing,and i can detail this,as it`s documented and can be done for Piwik as well.

https://github.com/piwik/piwik/issues/9963#issuecomment-202895669

Like @hpvd said : " With Piwik's analyses of server log files, all visitors are tracked -always. " this is the only certain thing that you can have control as an website owner,on the server logs.

Maybe we can setup this as an milestone for piwik 3.

@masteranalyze commented on March 29th 2016

@tsteur about : "There are still challenges eg when IP is anonymized it will be probably impossible to know if an individual was already tracked or not. This applies especially to German users."

Can you detail this ?Maybe i can help.Give more precise example of what you mean,and about what ip`s are you talking about.

@tsteur about : " A nice thing is the new kind of reports it would allow. For example we could have a site tracked with JavaScript, but still have bandwidth reports that are usually only available with log importer. We would maybe know how many resources were loaded etc. Still, merging this data won't be easy (eg when dealing with dates/times to find matching user etc there are always problems :) ) "

Why just not having 2 websites so we can compare,if anyone wish that,and maybe implementing misc tracker,as awstats is doing,for getting into user,their resolution,plugins used and so on.

That way via server logs ,it won`t be missed nice data tracked with javascript,and users with javascript enabled can be directly trackable into just 1 website.

And the real picture of the data,it can be achieved only if we can filter : REAL HUMANS,GOOD BOTS +BAD BOTS (i think the bad bots filtering is more hard) and if we implement what @hpvd said on this topic : #9711 ,piwik will be the only real data stats analytics tool.
If we can filter that,people will be able to use either java,either server side tracking,either both for comparasion on the same website.

@nicolasbadia commented on September 21st 2017

Being able to combine server logs and javascript tracking logs is also one of the first thing I thought about when I saw the log analytics features.

I don't really know how Piwik works internally but what seem feasible and really reliable to me would be to use an iterative process to merge javascript tracking logs into server logs.
Server logs would be our reference as we are 100% sure they are correct. Then we would try to find a matching JS log with it. For this, we could do several loop which become less and less restrictive to merge the data. Here is an example of the condition we could use:

  • (JS log time - Server log time) < 1s && URL && IP
  • (JS log time - Server log time) < 2s && URL && IP
    ...
  • (JS log time - Server log time) < 1s && URL
  • (JS log time - Server log time) < 2s && URL
    ...

If we can't find a matching JS log for a server log, we ignore the JS log and add it to a no_matching_server_log.log file (which we might use to improve our process).

I believe this would prevent the use of 2 sites which I do not find really practical from a user perspective.

Here is a basic PHP implementation of what I am thinking of:

foreach ($serverLogs as $sl) {
  $sec = .5;
  $match = false;

  while (!$match && $sec < 60) {
    foreach ($jsLogs as $jk => $jl) {
      if (($jl['time'] - $sl['time']) < $sec && $jl['url'] === $sl['url'] && $jl['ip'] === $sl['ip']) {
        $match = $jk;
        break;
      }
    }
    $sec *= 1.5;
  }

  $sec = .5;
  while (!$match && $sec < 60) {
    foreach ($jsLogs as $jk => $jl) {
      if (($jl['time'] - $sl['time']) < $sec && $jl['url'] === $sl['url']) {
        $match = $jk;
        break;
      }
    }
    $sec *= 1.5;
  }

  if ($match) {
    $sl['jsLog'] = $jsLogs[$jk];
    unset($jsLogs[$jk]);
  }
}

Any thought on this?

@mattab commented on September 21st 2017 Owner

Hi @nicolasbadia
Yes that's the general idea (didn't look at the pseudo code). we'd need to make it really efficient and directly implement this feature, not in the log importer script, but in the Piwik Tracking API somehow

Powered by GitHub Issue Mirror