@anonymous-piwik-user opened this Issue on November 27th 2013

We have lots of log lines with invalid requests, for example:

www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "\x80w\x01\x03\x01" 400 226 "-" "-"

(these are drive-by exploit attempts).

It looks like the script wants both the request method and the HTTP version, since if I modify the line above it works:

www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "GET \x80w\x01\x03\x01 HTTP/1.1" 400 226 "-" "-"

We can't stop these requests, so I guess the parser should at least parse any old garbage that shows up in the path?

@mattab commented on November 28th 2013 Owner

Thanks for the report. We'll try to reproduce and add a test case showing the failure and that it's fixed.

@anonymous-piwik-user commented on November 28th 2013

Thanks, it looks like the problem is that the path regex was expecting two spaces that weren't present, between where e.g. "GET" and "HTTP/1.1". I came up with a patch that works locally and made a pull request:

https://github.com/piwik/piwik/pull/159

I confess I couldn't get the test suite to run, so it might need some work still.

@mattab commented on December 8th 2013 Owner

(these are drive-by exploit attempts).

Piwik puts them as "invalid" which I think they really are invalid request. Marking as wontfix. If you think it really should be fixed let us know why (since they appear to be invalid requests)

@anonymous-piwik-user commented on December 9th 2013

Replying to matt:

(these are drive-by exploit attempts).

Piwik puts them as "invalid"

It looks like I left out something important -- these are often the first requests in the logs!

$ git clone -q https://github.com/piwik/piwik.git
$ cat bad.log 
www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "\x80w\x01\x03\x01" 400 226 "-" "-"
$ cp $MYCFG piwik/config/config.ini.php
$ python2.7 ./piwik/misc/log-analytics/import_logs.py --url $MYURL bad.log 
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log bad.log...
Fatal error: cannot determine the log format using the first line of the log file. Try removing it or specifying the format with the --log-format-name command line argument.

You could search through the entire file looking for a line that matches the regex, but that can waste tons of time on log files with an unknown format. The fix in the pull request instead allows the regex to match even when the request is garbage.

I am planning to add a test case to the pull request, but will be super busy for another 1.5 weeks.

@mattab commented on December 9th 2013 Owner

Sorry I missed the "Crashed" keyword!

@diosmosis commented on December 9th 2013 Member

Fixed in 3572ef7f261d36171f62b91d8de8b91dc300aa25.

@anonymous-piwik-user commented on December 9th 2013

Replying to capedfuzz:

Fixed in 3572ef7f261d36171f62b91d8de8b91dc300aa25.

Thanks for the quick fix! I just ran it on last night's logs and it made it all the way through.

I would beware of a potential Heisenbug, though: we have a ton of sites that basically no real human ever visits. It's totally possible that one day we could have a log containing nothing but invalid requests (e.g. the bad.log case above). In that case, the entire import will still crash.

For a workaround in a similar vein, maybe if you run through the entire file and find no valid log lines, it shouldn't error out? I suppose to be reliable, the number of log lines (1000) would need to be adjustable. I would personally let it check the whole file for a match.

Does Piwik record the invalid requests that it finds once it knows the log format? I ask because you should get the same results from one file containing,

invalid
invalid
invalid
invalid
valid

as you do if you split that into two files,

invalid
invalid
invalid
invalid
valid

Only one invalid line would be parsed with the files split, since no format would be found in the first file. But if the invalid lines are dropped anyway, it's a moot point.

Thanks again.

@mattab commented on December 9th 2013 Owner

In bb477d77f70ec4d04be8a470e9572cbcf7715cee: Refs #4352 add comments

@mattab commented on December 9th 2013 Owner

In d549d2cbbda6ab1e15ad60c95369cbd15adf245f: Refs #4352 increase to 100,000, it takes ~ 6 seconds on my box on a file with 100k lines, so still fast enough to fail. If a file has more than 100k wrong lines there's defnitely something wrong with it.

@anonymous-piwik-user commented on December 9th 2013

Agreed, thanks a million guys.

This Issue was closed on December 9th 2013
Powered by GitHub Issue Mirror