User Evaluation – System Usability Scale
In my last post I stated that I would attempt to gather more SUS forms before interpreting the results. Due to exams, I haven’t gathered as many as I would have liked, but I do have enough to deduce some interesting results. In this post I will discuss the results and findings and see what I can do to correct where I’ve gone wrong or emphasize the points that seem to have been received well.
Most of the students I have been able to add to the list of testers are a few Computer Science people I know from abroad. Skype offered some tools to do these paper prototype sessons despite the distance. With this in mind, most of the results reflect the wishes and preferences of individuals with experience in computer science (apart from the history and drama students). Since the application is related to computer science only as an example, I naturally won’t leave out the valuable feedback from those two students.Before going over the results, I will explain how the results of the SUS questionnaire should be interpreted and turned into a single number that is indicative of the application.
As I stipulated in a previous post, the SUS consists of 10 questions that the users can give a rating of 1 to 5 on, 1 being a ‘strongly disagree’ and 5 being a ‘strongly agree’. The questions are very general and aimed entirely at the usability, practicality and ease of use of the application being tested. Once the results have been gathered, those numbers need to be converted into a single number that will say something about the usability of the system. This is done in the following way:
- For odd items: Subtract one from the result.
- For even items: Subtract the result from 5.
- Now the values are between 0 and 4
- Add up the converted responses and multiply is by 2.5. Now the range of possible values will lie between 0 and 100.
It is important to note, however, that these numbers are not percentages. A result of 70 does not correspond to 70%, even though it is 70% of the maximum value. The average value one can achieve is 68, which means the application is average. Any higher means it is above average. A result of 80+ in this way represents an application that is far above average and has few problems that need tackling. Now that we know this, what was the resulting rank for my application so far?
The Results
For each user separately:
- 75
- 75
- 77.5
- 65
- 72.5
- 77,5
- 80
- 75
Which gives an average score of: 74.68
All-in-all, this is a good result. It signifies that most of the people who used it found it both useful and easily accessible. However, this wouldn’t be a proper analysis if I didn’t assess some things that clearly showed in the results. Also note that this score is still close to the average of 68 and not close enough to the 80.3 mark (which is generally interpreted as the point where people start recommending an application to other users).
Overall, the question that received the lowest marks was whether or not the functions of the system were well integrated. This is along the same lines as the feedback I had addressed previously. It seems that some functionality, primarily the Challenge Tracks, feel out of place or are simply not integrated well enough. I had also not adapted the paper prototype at all after the first results, so as not to polute later results with improvements that could have been made already. So, one of the other main concerns was how to make a new iteration, which was far too cumbersome and not easy to access. This led to lower scores on questions 8 and 9. The former questions whether or not the system was cumbersome, while the latter questions whether the tester felt confident using the system. The confusion caused by the misplacement of some of the UI elements, as well as the overly complicated layout at times, caused many of them to feel less confident and made the application feel sluggish. These matters will definitely need to be addressed in one of the next iterations of the application. A gamified application that feels sluggish and cumbersome will fall flat as those game elements will be unable to do what they are supposed to if attached to such a crippled application.
So, as I stated before, the results are positive, but it did reveal a few issues that need urgent attention. The main surprise to me came in the form of the answer to the very first question. Some of the users said that they would indeed consider using the system frequently, most notably the two that weren’t Computer Science students. This leads me to believe, tentatively, that the application could serve its purpose in motivating students and seeing actual use.
In the next post I will discuss implementation decisions I have made and will be making and, after my exams have concluded, I will finally begin with implementing the application’s first iteration.
We are using the standard SUS test as well for our thesis, but we are concerned on how to exactly interpret the results. You seem very confident in drawing specific conclusions after performing the test. For example, you say:
“It seems that some functionality, primarily the Challenge Tracks, feel out of place or are simply not integrated well enough.”
How do you know this? Did you ask the users to motivate their answers? Or did you combine the results from a think aloud session while paper prototyping with the results from a SUS questionnaire?
Either way, one of the main critiques/suggestions we got with our first evaluation methodology was that SUS is too vague as an evaluation tool. We had to specifically come with some form of evaluation that is less general and more focused on what we were doing. We can strongly advise you to do the same. In your case, that means finding an evaluation method that ties in closely with gamification. The best way to search for something like that would be to look at other gamification projects and see how they evaluate their applications. We are aware of how tight schedules can be and doing some literature study (on evaluation methods) now might seem like a step backwards but it really isn’t.
We’re also not implying you haven’t thought of this before, but the current state of your blog does not reflect your thoughts on this.
To summarize: Testing for usability with SUS is great, but not enough. We (and I guess everybody from the HCI-group) would be very interested in seeing a blogpost about other evaluation methods you want to use (specifically to test some aspects of the gamification), and what results they yield. Good luck :)!