Today I’m writing about the Producteev Windows Desktop application for task management. Producteev for Windows allows you to bring the web-based task management software to your Windows PC.
The current release has issues for some users. I will address the issues in this post, as well as their symptoms, causes, and the architectural changes that were required to overcome them.
If you came here looking for an update to Producteev for Windows: Don’t worry, it will be out soon! The remainder of those post is more of a technical paper for those interested.
This post might benefit a few people, especially if you’re having performance trouble with your C# application, seeing slowdowns in GDI+ or WinForms, utilizing multithreading, synchronizing with an online (API) service, or are considering a rewrite because of C# performance issues.
Back when the Producteev Windows application was in the design phase. I knew right away it would require at least 2 threads to run smoothly. Producteev for Windows needs to synchronize with a Web API and making these slow API calls from the main thread would cause the interface to become unresponsive. Additionally, sometimes these calls may happen when the internet connection is unreliable, or the API service is offline, or the user is not connected to the internet. Therefore the synchronization engine must operate on a separate thread and handle all these conditions reliably.
“That’s a start,” I thought, and I ran with it.
The central idea was that all of the web calls would be serialized since order of operations is very important. For instance, consider the following user actions:
- Create a label
- Create a task
- Assign the label to a task
If the actions are sent to the web controller in reverse order, the instructions would be incoherent; you’d immediately trigger an error and not everything would be completed:
- Assign the label to a task (Error: No such task!)
- Create a task (OK)
- Create a label (OK)
Any change to model data needs to happen in a serialized fashion (i.e., in order, one at a time.) This is true both within the synchronization engine as well as locally to the program’s own data model. The latter would happen organically through user actions, and the former would have to use a dynamic circular buffer of sorts to line up HTTPS calls. This was the core requirement of the synchronization engine.
Read more about the Producteev Desktop synchronization engine design.
It was not immediately obvious why, but this design led to some unexpected behavior for some users, including:
- Very slow application performance
- Bulk actions are always slow
- Inconsistent state between desktop and web application
- Interface hanging
The resulting behavior was contradictory to the intended design.
Astute readers may notice that local model data might have to be accessed by more than one thread simultaneously with this design. This is true; both threads did access the local model data, and that would cause symptoms similar to the ones listed above. However this is a red herring. All model data was modified and read with C#’s AsParallel() PLINQ methods and there were no noticeable race conditions or deadlocking as a result.
The problem was deeper
The reported issues were just the symptoms. Some testing revealed even more:
- The interface would refresh multiple times after any update to the data model (e.g., after something simple like renaming a task or a workspace 100+ paint calls were being made)
- The interface could theoretically refresh data that the user was actively modifying (such as a task title) which was rare but is an aggravating UX no-no
- The program was slightly unresponsive, despite low resource usage
I fixed some of these directly by filtering the specific conditions from happening and releasing hotfixes. Assuaging the user experience issues like these is called symptomatic treatment since it doesn’t get to the root of the issue. The original problems remained.
My first insight
It began with investigating the performance issues. I said earlier that CPU and memory usage stayed low despite the application’s lack of responsiveness. I had initially attributed it to the C# language and its notoriously slow managed libraries. The entire interface is composed of controls that are hand-drawn in GDI+ onto WinForms. I made hefty optimizations to these and conceptually they were as fast as I could make them by using some of the same drawing techniques found in game design. Optimizations yielded marginal performance improvements, but I moved it on because the reliability issues were more disconcerting.
Before I continue, I want to note that it’s really easy to place the blame on the operating system, the language, or the libraries as being the cause of all of the above issues. Especially in this case, heavy drawing code and having to iterate through thousands of objects in a language that doesn’t approach the speed of C++ can easily explain away performance issues. But that’s not why the program was slow in my case. Let this be a lesson that you or I should never assume it right off the bat. Running into these issues and expecting a new language, tool, or a rewrite to fix it is often misguided — so Blame Yourself First.
It took a while but I tracked the issue down to being a central architectural issue.
Sometimes you will find that you make an invalid assumption when designing an application. Building layers upon layers based on an invalid assumption will usually cause problems down the road and they will usually be hard to fix. These problems may require some quick refactoring to fix it or they may be a killer that destroys the chances of success for a project. If your problem’s root lies in the core architecture it can spell disaster so take the time to carefully dissect the problem, consider its context, and focus on looking for neat, simple solutions.
The assumptions I made in this case was that the changes to the local data model would all be made serially, would be computationally trivial, and wouldn’t cause any noticeable delay in the interface. Here’s a simplification of the projected workflow:
- The user types in a new title for a task and hits enter
- The application directly updates the local data model to use the new title
- The application refreshes all views that contain the title to reflect the new title
- The application tells the API controller to add a new action to the synchronization engine to update the title
- The task’s title is now updated on the website in the background by the synchronization engine, leaving the main thread free again to handle additional input
- Once the web action is complete, the server returns the complete task object as a callback and I verify the change was completed successfully
- The returned task object is parsed and put into the local model by tying back into the main thread and updating the local model once more
- The application’s view is refreshed once more
Steps 1-5, 7 and 8 are done on the main thread, meaning during these steps the application cannot handle any input. Even during step 3, the calls to invalidate the main view are lined up in the Windows message queue, immediately deferring the WM_PAINT calls. (Quite a few of these can pile up quickly.) Bulk actions such as deleting twenty tasks will hold up the main thread while 20 local model updates 20 API controller calls are fired.
After some testing, I found that each of these steps are computationally trivial alone, but together lend to around 100ms of expensive interface lag for each action. This means updating 20 tasks might take a whole second for the interface to update even though we’re not waiting for the API calls to complete. This is well beyond the measure for being acceptable so it was taken into consideration when attempting a fix.
Resolving all of the issues
I chose a twofold approach that not only dealt with the performance issues, but also–and more importantly–dealt with synchronization reliability. This might not have been an ideal solution to start with, however it was the most elegant and cheapest solution I could implement:
- I created a local synchronization engine that works a lot like the remote one. It has its own separate thread and lines up any local model updates in order. This also means all local model updates are now piped through a single interface, clearing up the inconsistencies end users were seeing.
- Drawing is also on a separate thread now rather than using Invalidate(). Typically with GDI+ you will use Control.Invalidate() to redraw parts of a window. This lines up a new WM_PAINT message for the control which is processed during the main thread in the order that it was spawned. If dozens or hundreds of these line up like when you resize a Window, it can quickly flood the windows queue and make the application less responsive.
I recognize I could have also opted for implementing WPF from the beginning. This would enable hardware rendering to take the stress off the CPU and prevent a lot of the issues handling data from the model. Unfortunately at present there are so many WinForms and custom controls that it would’ve never been completed in a reasonable amount of time. Or ever.
I will reiterate my point earlier that this was more about resolving an issue the best way possible given the circumstances. For me this excludes the possibility of rewriting 10,000 lines of drawing code. The inconsistencies between local and remote (API) state are now resolved because all the updates to the local data model are done atomically and in a centralized way. And finally, threading the drawing code is a great way to prevent clogging up the main thread and the WM queue. The application is solid and responsive and it didn’t take a whole rewrite to do it.