In this section, we provide the motivation for the user study, its goals, and the methods used. We end the section with the results obtained.
6.2.1 Motivation
DifFuzzAR tries to repair timing side-channel vulnerabilities by automatically refactoring the Java code. To understand the impact that changes proposed by DifFuzzAR may have on users, we reviewed previous work on tools that automatically change code (i.e. refactoring tools).
Murphy-Hill et al. (
2011) studied refactoring usage from Eclipse’s user data. Their findings seem to suggest that users refrain from using refactoring tools because of three main factors:
1.
lack of awareness of their existence;
2.
lack of opportunity to use refactoring;
3.
lack of trust in refactoring.
Users refrain from using refactoring tools in part because of a lack of trust about the full impact of the tools in the code. Murphy-Hill et al. (
2011) stated that several developers mentioned they would avoid using a refactoring tool because of worries about introducing errors or unintended side-effects.
Another study by Eilertsen (
2012) investigated the usability of refactoring tools. They interviewed and performed a usability study with 17 developers. They concluded that users of refactoring tools often complain about a lack of control and usability (Eilertsen
2012). Refactoring tools may change a program in unpredictable ways and users often like to review the code.
6.2.3 Design and methods used
In order to answer our RQs we designed a survey study. We followed best practices from Redmiles et al. (
2017) and followed methods similar to Eilertsen (
2012). We recruited participants through our network (e.g., past students and colleagues). The participants did not receive payment upon survey completion. Participants were shown a consent form before filling in the survey. They could remove their consent at any point without giving justification. We did not collect any personal data.
As we wanted to study potential users of our tool, we recruited participants with Java programming experience. We pre-screened participants and only accepted those that have been working with Java for at least two years in the previous 10 years. To characterize our sample, we also asked participants to rate their expertise in Java and if they knew what timing side-channel vulnerabilities were before our study.
We then asked all participants to go through a brief explanation of timing side-channel vulnerabilities with examples, and another explanation about DifFuzzAR and about what the tool does.
To understand if users would trust the results of our tool we provided four
vignette scenarios. A vignette, as described in Lavrakas encyclopedia (Lavrakas
2008), describes a protagonist (or group of protagonists) faced with a realistic situation pertaining to the construct under consideration. The respondent is asked to make a judgment about the protagonist, the situation, or the correct course of action, using some form of closed-ended response. In our vignette scenarios, we presented the following scenario to participants:
Imagine you apply DifFuzzAR to automatically repair timing side-channel vulnerabilities in Java code.
We then provided an example of code before and after a vulnerability is fixed by the tool. We used one scenario for control-based vulnerabilities, one for early-exit, and two for mixed vulnerabilities (one simple and one more complex). For each scenario, we asked participants to indicate, using a 5-point Likert agreement scale, if they would trust that refactoring. We also asked them for feedback. The code snippets used in these scenarios are taken from the examples used in the first part of the evaluation (Sect.
6.1) and can be seen in Appendix A. It is important to mention that in the early-exit example we modified the “before” code to remove the control-flow vulnerability. We did this to provide an example with only an early-exit vulnerability. While the “before” was adapted, the correction of the early-exit vulnerability showed to participants is a direct output of
DifFuzzAR.
DifFuzzAR differs from usual refactoring tools as it aims at producing code that is more secure (instead of “cleaner” code). So, we also wanted to gauge if different situations impact users’ willingness to use a tool like DifFuzzAR. With this goal in mind, we provided users with two scenarios of DifFuzzAR’s usage. One scenario described a programmer that is coding a sensitive part of a program (e.g., authentication) and the other describes a less critical situation (e.g., programming the GUI of an application). We call them critical and non-critical scenarios, respectively, and they can be reviewed in more detail in Appendix B. For each, we asked users to indicate their willingness to use DifFuzzAR using a 5-point Likert agreement scale. We also asked them for feedback. It should be noted that the non-critical scenario (focused on the dark mode change) can also leak information through observing timing differences, since the processing time for switching to dark mode might be slightly longer or shorter than switching to light mode. Despite the absence of secrets, in such a scenario, DifFuzzAR could potentially suggest a refactoring that ensures a constant-time execution. Our goal in including this scenario is to understand whether developers see value in using DifFuzzAR in situations where no potentially sensitive information nor secrets can be retrieved. This can provide insights on how to best integrate tools such as DifFuzzAR into the developer’s workflow.
We finished our survey by asking participants for suggestions to improve DifFuzzAR and by asking demographic questions.
User studies can suffer from bias due to the sample, how the survey questions are made, and even from the response options (Redmiles et al.
2017). To mitigate this problem, we did two cognitive interviews,
5 followed best practices by offering “don’t know” or “prefer not to answer” responses (Redmiles et al.
2017), and used methods previously used in studies on related subjects (Eilertsen
2012).
6.2.4 Results
Our survey was answered by 20 users but only 11 meet our requirements and passed the pre-screening. To pass this pre-screening they needed to answer the question “How many of the last ten years (2012-2022) have you spent developing or maintaining software in Java?” with at least two years. We chose to do this because we wanted participants that could understand Java code well. Our sample of 11 participants has, on average, four years of experience with Java.
We followed the same methods as Eilertsen (
2012) and asked participants to report their self-described proficiency with changing Java code. The respondents rated their proficiency on a scale from 1 to 5 (where 1 indicates no proficiency and five indicates expert proficiency). Most respondents (10 participants) reported at least three (i.e., above average) proficiency in Java. Our sample of users also reported a good familiarity with timing side-channel vulnerabilities as almost half of the respondents (45.45%) self-reported they knew what they were. Moreover, 27.27% of participants were not familiar with the concept and 27.27% were unsure. Most of the participants had a Master’s degree (55%, i.e., 6 participants) with the remaining ones having a Bachelor’s degree (45%). All participants were male with the exception of one female participant. The most common age group was 18-24 (63.6%), followed by 25-34 (18.2%), 35-44 (9.1%), and 45-54 (9.1%).
DifFuzzAR Scenarios
Respondents went through four scenarios. For each scenario, they were presented with a concrete example of
DifFuzzAR’s usage, where we provided Java code before and after applying the tool. After each scenario, we asked participants to use a 5-point Likert scale to indicate if they trust the refactorings produced by
DifFuzzAR. When analyzing Likert scale data, it is recommended to use a non-parametric statistical test as the answers are not normally distributed (Lazar et al.
2017). Therefore, we analyzed the data using the Wilcoxon signed-rank non-parametric test with continuity correction. This test’s null hypothesis is that there is not a significant difference between the samples. To disprove this hypothesis, the resulting
p-value should be less than
\(0.05\) (
\(p < 0.05\)). When this happens, we conclude that there is a significant difference between the samples (Lazar et al.
2017).
The first two scenarios (control-flow and early-exit) were found to be very trustworthy by participants, i.e., most respondents stated they “agreed” or “strongly agreed” that they trusted the refactorings. Only one participant did not trust the early-exit scenario and, from their feedback, this was due to the wrong understanding that the refactoring did not preserve the behavior of the original code. Moreover, the Wilcoxon signed-rank non-parametric test did not find a statistical difference between the first two scenarios (\(p > 0.05\)).
The remaining two scenarios (mixed vulnerabilities, one simple and one more complex) while generally trusted (55% “agreed” or “strongly agreed” that they trusted the refactorings) had a more mixed response from participants. Two (2) out of 11 participants stated they did not trust the refactoring done in scenarios 3 and 4. Their answers seem to indicate that this may be due to two reasons: (a) the participant was unable to understand the changes that had been made to the code (e.g., the participant did not understand the final version of the code); (b) the scenario was complex (e.g., the participant associated complexity with distrust). This lack of trust in more complex refactorings is something that previous literature on refactoring also found to be the case (Eilertsen
2012).
While these scenarios (3 and 4) were not found as trustworthy as the first ones (1 and 2) there was no statistical difference between any of them (\(p > 0.05\)).
Critical and non-critical scenarios
Now that we have established that most participants trust DifFuzzAR’s refactorings, we turn our attention to the critical and non-critical scenarios. The critical scenario describes programming authentication code and the non-critical describes programming an application’s GUI. All users found DifFuzzAR to be useful in the critical scenario (100% answered “agree” or “strongly agree”) but to be less useful in the non-critical scenario ( 27% “agree” or “strongly agree”). These results are statistical different with \(p = {0.00072}\).
After analyzing participants’ answers, our results suggest that users value more DifFuzzAR’s usage in critical use cases, and less in day-to-day coding procedures as one participant stated “In the second (scenario), there is no risk associated with the functionality, no secrets are involved.”. The results also seem to indicate that the prevention of timing side-channel vulnerabilities is more important when their exploitation can lead to loss of valuable secrets (like in the critical scenario).
Further improvements
After the previous questions, we asked participants to suggest improvements and future features that
DifFuzzAR could implement. We coded their answers with an
emergent coding scheme (Lazar et al.
2017) as we had no previous insights about what their answers could be. The frequency of their answers and codes can be seen in Table
2. It is important to note that respondents’ answers may be coded with more than one code and some users may have not answered at all, so, the total frequency of the codes does not necessarily match the number of participants in the study.
Table 2
Coded answers to “What functionalities/improvements would you like to see in this tool?” and frequency of answers
Correct other common vulnerabilities | 3 |
Provide more information about the changes | 3 |
I don’t know | 2 |
Add analysis of nested functions | 1 |
Less complex changes | 1 |
Apply the tool to interpreted languages (e.g., Python) | 1 |
Improve variable names | 1 |
Some participants stated that they would like to see additional features in DifFuzzAR such as the correction of other common vulnerabilities in other languages. One of the participants even goes as far as suggesting they would like “to do this (use DifFuzzAR ) in an interpreted language (e.g., Python) instead of a compiled language”. Another common theme in users’ suggestions for improvements was that DifFuzzAR could provide more information about the changes it makes (e.g., by adding comments explaining the changes, with better variable names or, as one user mentioned, with a “detailed profiling of the modified code, before and after it was modified (...)”.) Overall users’ comments were positive and their feedback suggested useful future features for DifFuzzAR.
6.2.5 Discussion
We were able to successfully answer the proposed research questions (see Sect.
6.2.2) and the insights gathered seem to confirm the usefulness of
DifFuzzAR. We now address briefly each of our research questions.
RQ1. Do users trust DifFuzzAR’s refactoring?
Our results suggest that users trust DifFuzzAR’s code transformations. They also seem to indicate that the complexity of the transformations affect trust, as simpler code transformations were seen as more trustworthy by participants.
RQ2. What would users like to see in a tool such as DifFuzzAR? This study’s results seem to indicate that our participants value
a thorough tool that also corrects other common vulnerabilities in other programming languages. Users’ also seem to value a tool that is transparent about its functioning, this is, a tool that informs the user about the changes it does to the code.
RQ3. Do users value the use of DifFuzzAR differently in critical and non-critical applications?
Our results seem to suggest that users value the use of DifFuzzAR differently in distinct situations. From the data gathered, participants appear to be more willing to use DifFuzzAR in more critical Java code. If this is the case, then the potential impact of DifFuzzAR is greater as this type of code is impactful to the security of a product.
While we gathered significant data in this user study, this data is subjective as our sample size is relatively small. However, this study has still provided valuable insights that can inform future large-scale user studies. It also seems to confirm that users generally trust refactorings produced by DifFuzzAR and that they see value in a tool like this, in particular in more critical Java code. It has also provided suggestions for improvement of DifFuzzAR that can also be useful for other similar tools.