Message boards :
Number crunching :
Long-running work unit
Message board moderation
Previous · 1 · 2 · 3
| Author | Message |
|---|---|
|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0
|
The timing values are based on the QueryPerformanceCounter function on Windows 2000 and later. If that isn't available, there's a fallback using GetSystemTimeAsFileTime which is less precise. The primary timing looks to be using inline assembly with the rdtsc instruction, the fallback the gettimeofday function. There are other options for both depending on build configuration. If you run the test BilBg suggested, we'll at least know for sure if the symptoms on your Linux system really match. There might possibly be other parts of the initialization which could hang. Joe |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0
|
 I did a test today (15 runs) - stderr.txt (K6-2+ / Windows 2000) http://pastebin.com/NLKpNxiJ Some observations: - it is known but lets make it clear - app never hangs (at least for me) in the middle of Computing, it only may hang on the startup during 'Optimal function choice' - wisdom.sah is 'updated' first (the fake update, only the order of lines changes) The time of the file is set to 'now' 13-15 s after app start (since this happens before 'Optimal function choice' this file is updated even for hang app) - the 'Test duration' is another wrong value: Test duration 0.36 seconds Test duration 380.57 seconds In reality I measured 120-140 s from app start to appearing of line 'Test duration' (obviously only for not-hang runs, time is not exactly the same every run but is in the range of ~2 minutes on this system) - After app Hangs - most of the CPU time/load (~70%) is in Kernel mode (SIV, Process Explorer):   - ALF - "Find out what you don't do well ..... then don't do it!" :) |
|
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0
|
I ran some tests, but the results were unremarkable. I've also been running SETI@Home normally without any problems, until today. Workunits 1612718929 and 1612022476 both stalled with the same symptoms as before. Unfortunately, I'm unable to reproduce the problem. For instance, stderr.txt for 1612022476 (27mr08ah.20785.20522.438086664204.12.8) is: setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0
Work Unit Info:
...............
WU true angle range is : 0.432149
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_vGetPowerSpectrum2 0.000204 0.00000But when I test the WU with -verbose, it looks normal: 11:46:43 (12672): Can't open init data file - running in standalone mode
11:46:43 (12672): Can't open init data file - running in standalone mode
setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)
libboinc: BOINC 7.1.0
Work Unit Info:
...............
WU true angle range is : 0.432149
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.000253 0.00000 test
v_vGetPowerSpectrum 0.000172 0.00000 test
v_vGetPowerSpectrum2 0.000193 0.00000 test
v_vGetPowerSpectrumUnrolled 0.000177 0.00000 test
v_vGetPowerSpectrumUnrolled2 0.000198 0.00000 test
v_avxGetPowerSpectrum faulted
v_vGetPowerSpectrum 0.000172 0.00000 choice
v_ChirpData 0.010563 0.00000 test
fpu_ChirpData 0.017693 0.00000 test
fpu_opt_ChirpData 0.010357 0.00000 test
v_vChirpData_x86_64 0.073799 0.01993 test
sse1_ChirpData_ak 0.009647 0.00000 test
sse1_ChirpData_ak8e 0.007629 0.00000 test
sse1_ChirpData_ak8h 0.007942 0.00000 test
sse2_ChirpData_ak 0.010725 0.00000 test
sse2_ChirpData_ak8 0.006080 0.00000 test
sse3_ChirpData_ak 0.010540 0.00000 test
sse3_ChirpData_ak8 0.006001 0.00000 test
avx_ChirpData_a faulted
avx_ChirpData_b faulted
avx_ChirpData_c faulted
avx_ChirpData_d faulted
sse3_ChirpData_ak8 0.006001 0.00000 choice
v_Transpose 0.013878 0.00000 test
v_Transpose2 0.007828 0.00000 test
v_Transpose4 0.006351 0.00000 test
v_Transpose8 0.011334 0.00000 test
v_pfTranspose2 0.008176 0.00000 test
v_pfTranspose4 0.005297 0.00000 test
v_pfTranspose8 0.009363 0.00000 test
v_vTranspose4 0.004491 0.00000 test
v_vTranspose4np 0.004173 0.00000 test
v_vTranspose4ntw 0.004860 0.00000 test
v_vTranspose4x8ntw 0.002933 0.00000 test
v_vTranspose4x16ntw 0.002436 0.00000 test
v_vpfTranspose8x4ntw 0.004869 0.00000 test
v_avxTranspose4x8ntw faulted
v_avxTranspose4x16ntw faulted
v_avxTranspose8x4ntw faulted
v_avxTranspose8x8ntw_a faulted
v_avxTranspose8x8ntw_b faulted
v_vTranspose4x16ntw 0.002436 0.00000 choice
FPU opt folding 0.000967 0.00000 test
ben SSE folding 0.000801 0.00000 test
AK SSE folding 0.000651 0.00000 test
BH SSE folding 0.000723 0.00000 test
JS AVX_a folding faulted
JS AVX_c folding faulted
AK SSE folding 0.000651 0.00000 choice
Test duration 4.93 secondsWhen I ran the tests in September, it was co-incidentally after I'd rebooted the machine after a kernel upgrade. I ran for several days without sleeping the machine overnight, but the first time I do sleep the machine after a reboot, I get messages like the following in syslog when I wake it: [44821.501321] TSC synchronization [CPU#0 -> CPU#1]: [44821.501322] Measured 5056157788 cycles TSC warp between CPUs, turning off TSC clock. [ 0.008000] tsc: Marking TSC unstable due to check_tsc_sync_source failed [ 0.008000] process: Switch to broadcast mode on CPU1 [44811.682905] CPU1 is up The BIOS is up-to-date (or as up-to-date as it's ever going to be, given the age of the motherboard). /proc/cpuinfo is unchanged after these messages, and in particular the flags tsc, rdtscp, constant_tsc and nonstop_tsc are present before and after sleeping. I have a suspicion this has something to do with the problem, but as I say, I'm unable to reproduce the problem. I've also tried testing with taskset(1) to set the CPU affinity. I'm going to try restarting the WUs. I suspect they'll complete normally. |
|
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0
|
And indeed, both WUs have finished normally now. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0
|
But when I test the WU with -verbose, it looks normal Did you try 10-20 times or only one? It is known that the hang do not happen every time. Â - ALF - "Find out what you don't do well ..... then don't do it!" :)Â |
|
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0
|
I didn't count, but I tested both stalled WUs about 10 times each, and ran similar tests in September. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.