@anonymous-piwik-user opened this Issue on January 3rd 2014

In my setup I use log import to load data into piwik. Everything seems to go well except for the fact that sometimes (and it doesn't seem to be repeatable) one of recorder threads hangs and the rest of the threads wait on futex.

For example - main process:

strace -p 24643

Process 24643 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 819826}) = 0 (Timeout)
gettimeofday({1388734405, 12820}, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...>
Process 24643 detached

Hung thread:

strace -p 24646

Process 24646 attached - interrupt to quit
recvmsg(6, ^C <unfinished ...>
Process 24646 detached

Rest of the threads (all look alike):

strace -p 24645

Process 24645 attached - interrupt to quit
futex(0x7f8028001540, FUTEX_WAIT_PRIVATE, 0, NULL^C <unfinished ...>
Process 24645 detached

I tried to add a thread dump capability to the import_logs.py and so I got:
One of waiting threads (all are identical):

Thread: Thread-13(139637655840512)

File: "/usr/lib64/python2.6/threading.py", line 504, in bootstrap
self.
bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.
args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 1167, in _run_bulk
hits = self.queue.get()
File: "/usr/lib64/python2.6/Queue.py", line 168, in get
self.not_empty.wait()
File: "/usr/lib64/python2.6/threading.py", line 239, in wait
waiter.acquire()

Hung thread (I assume, that's the only one different):

Thread: Thread-7(139638058493696)

File: "/usr/lib64/python2.6/threading.py", line 504, in bootstrap
self.
bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.
args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 1170, in _run_bulk
self._record_hits(hits)

Main process:

Thread: Thread-1(139638191130368)

File: "/usr/lib64/python2.6/threading.py", line 504, in bootstrap
self.
bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.
args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 820, in _monitor
time.sleep(config.options.show_progress_delay)

As far as I know it used to occur in 1.12 (although it used to be less frequent, I think) and occurs in 2.0.2 as well.

My installation works on CentOS 6.4

rpm -qa | grep python

python-ethtool-0.6-3.el6.x86_64
python-libs-2.6.6-37.el6_4.x86_64
python-setuptools-0.6.10-3.el6.noarch
python-devel-2.6.6-37.el6_4.x86_64
python-iniparse-0.3.1-2.1.el6.noarch
python-dateutil-1.4.1-6.el6.noarch
python-urlgrabber-3.9.1-8.el6.noarch
python-pycurl-7.19.0-8.el6.x86_64
rpm-python-4.8.0-32.el6.x86_64
python-2.6.6-37.el6_4.x86_64
python-pip-1.3.1-4.el6.noarch
newt-python-0.52.11-3.el6.x86_64
libxml2-python-2.7.6-12.el6_4.1.x86_64
libproxy-python-0.3.0-4.el6_3.x86_64

Keywords: log import

@anonymous-piwik-user commented on January 3rd 2014

One more thing - it seems that the hang occurs always after all the file has already been read (but not whole of it has been recorded) or is at the very close to the end.

@mattab commented on January 5th 2014 Owner

Thanks for the detailed report.. Maybe other users experience the issue as well. If you find the bug in the code, please comment here or pull request. Happy new year!

@anonymous-piwik-user commented on January 10th 2014

After few more occurrences it seems that I was wrong that it happened only at the end of the input file - it is not the case. It hangs as well in the middle of the log.
It seems also that lowering concurrency from 12 recorders too 6 lowers significantly probability of hanging. I have a daily intake of slightly over 4.5G into various files. I split the big ones into 50000-lines chunks, which gives me about 300-400 files to import.
With 12 recorders I had about 10 hangs per such process, maybe more. Now I get 1 or 2.
But it is still annoying since it stops the whole batch import process.

@anonymous-piwik-user commented on February 7th 2014

I think this is more related to the API rather than the import process (Or may be the import script itself). But when the import process hangs, restarting apache resolves the issue.

Also, whenever import script hangs the website also goes unavailable. So it seems more like an API + web issue rather than the import process issue itself.

@anonymous-piwik-user commented on February 7th 2014

Replying to shabeepk:

I think this is more related to the API rather than the import process (Or may be the import script itself). But when the import process hangs, restarting apache resolves the issue.

Also, whenever import script hangs the website also goes unavailable. So it seems more like an API + web issue rather than the import process issue itself.

This issue seems to occur when apache is not configured with mpm_prefork (which is recommended for php).

@mattab commented on February 8th 2014 Owner

Maybe check your server error logs for any relevant error?

@anonymous-piwik-user commented on March 3rd 2014

I can't confirm whether it stops website as well (lowered worker numbers in my scripts and now I just launch parsing few logs in parallel and since I have 3 workers I only had one hang so far, but the import time is significantly longer because it's determined by the longest file size).
I'll try to raise workers number again and see if it starts hanging.
But I'm pretty sure I'm using apache with MPM prefork.

$ curl http://piwik/server-info 2>/dev/null | grep MPM | sed -s 's/<[^>]\+>//g'
Timeouts: connection: 60 &nbsp;&nbsp; keep-alive: 15MPM Name: Prefork
MPM Information: Max Daemons: 256 Threaded: no Forked: yes
 -D APACHE_MPM_DIR="server/mpm/prefork"
@mattab commented on March 4th 2014 Owner

the MPM error message is a server config issue, not Piwik bug. All the best to find solution for this one!

@anonymous-piwik-user commented on March 5th 2014

Come on. This is not "MPM error". It's just an excerpt from server-info page showing which MPM I'm using. It's configured well with prefork MPM.
And It's working OK.
And, additionaly, after I raised worker limit again - my import job hang while website remains avaliable.
There is nothing in server error logs.
Restarting apache doesn't change anything.
It seems that one of the threads hangs on (as I shown earlier):

# strace -p 15712
Process 15712 attached - interrupt to quit
recvmsg(4,

All other threads are just waiting on futexes.

@mattab commented on March 6th 2014 Owner

What is happening AFAIK is that the request to your web server times out. It is not a Piwik issue but unless this is Piwik crashing in some ways. If you don't have anything in your error log, then likely it is server configuration issue.

what exact command do you use?

@anonymous-piwik-user commented on March 6th 2014

I'm not saying it's a piwik issue as the server part. I'm thinking of it more as an issue within the log importing script.

Notice that the process hangs on recvmsg(), not on select() or similar call. It means that if there is no more data to be sent by server, the receiving process (importing script) will hang forever. I suppose someone is trying to read from a socket without first checking whether there is any data available for reading. There is no timeout because the socket is still open.

The actual command I use is generated by a script:

MERGER=merge_files.py
IMPORTER=import_logs.py

eval find $SQUIDS -type f -name \*-access.log-$CURRENT_DATE -printf \'%f\\n\' |
    sort | uniq | xargs -P $CONCURRENCY_LIMIT -I {} bash -c "
      logfile={}
      LOGHOSTNAME=\${logfile%-access.log-*}
      eval $MERGER $SQUIDS/\$logfile |  
      $PYTHON_CMD $IMPORTER  \
        --token-auth=$AUTH_TOKEN \
        --add-sites-new-hosts \
        --recorders=$THREAD_LIMIT \
        --recorder-max-payload=1000 \
        --url=https://piwik  \
        --log-format-regex='(?P<ip>\S+) \- \- \[(?P<date>\S+)\s?\S+?\] \"\S+ http\:\/\/(?P<unusedhost>[\w\-\.]*)(?P<path>.*?) \S+\" (?P<status>\S+) (?P<length>\S+) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\" \S+'  \
        --enable-http-errors \
        --enable-http-redirects \
        --log-hostname=\$LOGHOSTNAME \
        --enable-static - "
@mattab commented on March 7th 2014 Owner

Ok this may indeed be a bug. Maybe try with a payload of 100 instead of 1000 and see if the problem still occurs?

@anonymous-piwik-user commented on March 7th 2014

OK. I'll try to lower the payload then and see for few days if the problem persists (it occurs very unpredictively).

@anonymous-piwik-user commented on March 8th 2014

I lowered the payload to 100. Didn't help.
Got thread hung at recvmgsg().
Makes you wonder tho - if the import is hanging with many recorders and doesn't hang with few (or at least not as often), is the library you're using for querying the server thread-safe? I read that urllib2 is supposed to be behaving ok if you're not doing specific things with its object, but it was not written with thread safety in mind, so might be misbehaving in unpredictable way.

@mattab commented on March 8th 2014 Owner

We still cannot replicate the hanging process issue. Likely there is a bug or misconfiguration but until we can replicate or if you can troubleshoot to find out what is causing it, we cannot really help yet. Maybe someone else will report the issue here? or maybe you find out how to reproduce...

@anonymous-piwik-user commented on March 10th 2014

I understand. The main problem is that I can't reproduce it myself in a deterministic way. As I said before - I can't predict whether it happens or not. It just sometimes does.

@anonymous-piwik-user commented on March 24th 2014

I managed to run my import script using strace. It seems that the issue arises when two threads are trying to query DNS servers at the same time. While one thread succeeds, another one hangs as I described earlier.
But the strange thing is that the socket (fd=12 in this case) is getting closed by this very thread and after that the import script still tries to use it.

08:37:19.920001 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 12
08:37:19.920253 fstat(12, {st_mode=S_IFREG|0644, st_size=83, ...}) = 0
08:37:19.920480 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbf7ff6e000
08:37:19.920661 read(12, "127.0.0.1   localhost.localdomai"..., 4096) = 83
08:37:19.921097 read(12, "", 4096) = 0
08:37:19.921548 close(12) = 0
08:37:19.922040 munmap(0x7fbf7ff6e000, 4096)      = 0
08:37:19.922591 futex(0x7fbf747bd240, FUTEX_WAKE_PRIVATE, 1)           = 0
08:37:19.923433 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP)                = 12
08:37:19.924190 connect(12, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
08:37:19.924692 gettimeofday({1395646639, 924719}, NULL) = 0
08:37:19.925232 poll([{fd=12, events=POLLOUT}], 1, 0)          = 1 ([{fd=12, revents=POLLOUT}])
08:37:19.925636 sendto(12, "\302\240\1\0\0\1\0\0\0\0\0\0\5piwik\2eo\2pl\0\0\1\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
08:37:19.926164 poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
08:37:19.927042 ioctl(12, FIONREAD, [147]) = 0
08:37:19.927876 recvfrom(12, "\302\240\205\200\0\1\0\1\0\3\0\3\5piwik\2ab\2ab\0\0\1\0\1\300\f\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [16]) = 147
08:37:19.928919 close(12) = 0
08:37:19.929849 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
08:37:19.931274 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0
08:37:19.933639 gettimeofday({1395646639, 933684}, NULL) = 0
08:37:19.935177 poll([{fd=12, events=POLLOUT}], 1, 4991) = 1 ([{fd=12, revents=POLLOUT}])
08:37:19.939369 sendto(12, ",W\1\0\0\1\0\0\0\0\0\0\5piwik\2ab\2ab\0\0\34\0\1", 29, MSG_NOSIGNAL, NULL, 0) = 29
08:37:19.939872 gettimeofday({1395646639, 939872}, NULL) = 0
08:37:19.940067 poll([{fd=12, events=POLLIN}], 1, 4984) = 1 ([{fd=12, revents=POLLIN}])
08:37:19.940524 ioctl(12, FIONREAD, [0]) = -1 EOPNOTSUPP (Operation not supported)
08:37:19.941066 recvfrom(12,  
@anchalaggarwal commented on September 11th 2014

Hi All,
Is there a timeline for this bug resolution? Till when can we expect a solution?

@mattab commented on December 1st 2014 Owner

@anchalaggarwal unfortunately we don't know how to solve this problem. in our tests or when we run in production we haven't yet AFAIK had this issue. Maybe you or someone else has some idea?

@mattab commented on March 12th 2015 Owner

Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues

refs #7163

This Issue was closed on March 12th 2015
Powered by GitHub Issue Mirror