Strange lag, goes away after moving windows (Win 8.1, nvidia)

leeta wrote on Thursday, September 11, 2014:

This question is about some lag spikes that I can’t currently explain.

I have a simple interactive GLFW GUI application, using the regular system cursor for point-and-click, and rendering everything in the window using simple OpenGL 2D sprites. I use mouse event callbacks for capturing UI, and then actually “act” on that UI within my main game loop, turning the evented delivery into polled game logic.

This is in Windows 8.1 on a 64 bit system with GeForce GTX 870M graphics and Intel HD4600 frame buffer. The CPU is a Core i7 with four cores, eight threads. The application also spawns two threads for sound playback and background loading, but communication with those threads uses a non-blocking FIFO so they shouldn’t be able to block the main thread (and I’ve verified that that’s not where the time is being spent.)

I’m using glfw3.dll and glfw3d.dll (for the release/debug builds) from version 3.0.4.

I also use a repaint callback that calls the same function as my once-per-frame re-paint function.

After opening the window, I’m having all kinds of weird pauses and stutters in responsiveness. This always happens in the GL render function, and the actual time is spent somewhere in GL, not in the swapbuffers call. Sometimes the pauses are a few hundred milliseconds, but I’ve measured them going up to 5 seconds.

I’ve tried running with glfwPollEvents() and glfwWaitEvents() and the behavior is the same in both cases. My internal watchdog timers tell me it’s always in the GL rendering function somewhere, when it happens.

I further traced this problem to a call to glGetError(). Note that glGetError() only returns the state of the client commands issued – it should not need to syncronize with the GPU like a frame buffer read-back would. However, even if it does, 5 seconds seems excessive – the game runs at 60 Hz.

Thus this post: Has anyone else seen this? What could possibly cause this? How do I cure this?

dougbinks wrote on Thursday, September 11, 2014:

If you’re using a once per frame window rendering approach, you shouldn’t
need to re-render the scene with a repaint callback. I would remove the
repaint callback and see what happens.

If you have further issues then I would advise debugging by breaking the
execution (Debug -> Break all in Visual Studio) and seeing what’s being
called, or profiling with the free AMD CodeXL. This will work for basic CPU
profiling even on Intel CPUs. Alternatively you could use a trial version
of Intel VTune or the free open source Sleepy profiler:
http://www.codersnotes.com/sleepy

http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/

leeta wrote on Friday, September 12, 2014:

Thanks for the suggestions!

If I don’t have a refresh handler, then the window doesn’t repaint while being moved by the user and covered/uncovered. However, I have since found that moving the window was a false positive – it still happens.

CPU profiling doesn’t help because the time is spent outside my binary. I tried it with the built-in Visual Studio profiler tools.

I narrowed it down by inserting timer bracketings across sections of the code, and it came down to a call to glGetError(). When I removed that, it came down to a glReadPixels(). When I removed that, it came down to glfwSwapBuffers(). And, before the answer “read-backs are bad!” is given, I’d like to suggest that I’m totally OK with stalling the GPU for each frame to run synchronously between CPU and GPU, at 60 Hz. That’s the explicit intention of the ReadPixels. And it kind-of works, except once in a while, the driver decides to stall for a second, or two, or five. Those kinds of stalls are not caused by pipeline flushing, because flushing the pipeline happens every frame, at 60 Hz!

I also enabled debug output feedback using the 4.3 core functions, and that prints nothing.

As far as I can tell, this is an NVIDIA driver bug :frowning:

dougbinks wrote on Friday, September 12, 2014:

Finding driver bugs is rare these days, the better assumption is normally
that your own code has something wrong.

Your approach seems unusual, in that most applications choose to either
regularly render the entire frame or update portions based on changes,
including updating sections due to paint messages. I would advise
proceeding with the first approach with hardware acceleration, and only
moving to an irregular render if performance or power consumption really
proved to be an issue.

Note that if you’re performing irregular rendering on update then you would
expect to see apparent stalls in rendering when no paint messages were
being processed. It’s also possible that you’re processing multiple paint
callbacks before a swapbuffers, thus building up rendering calls and
causing a stall.

A second odd thing you mention is “64 bit system with GeForce GTX 870M
graphics and Intel HD4600 frame buffer”. I’m presuming you’re on a mobile
system with NVIDIA Optimus technology? You might want to test selecting
which GPU your application runs on and seeing if you have any difference
when running on one versus the other.

Instead of using glReadPixels for synchronizing the CPU and GPU, you might
want to try using glFenceSync with GL_SYNC_GPU_COMMANDS_COMPLETE and
glClientWaitSync:
https://www.opengl.org/sdk/docs/man/docbook4/xhtml/glFenceSync.xml

I would also remove the sync for now until you resolve the stall - cutting
down functionality is often a good approach to debugging these issues.

miran46 wrote on Tuesday, December 09, 2014:

Nvidia is doing some wierd stuff with the default settings. Try to set Threaded optimization to Off in Manage 3D settings in the NVIDIA control panel.