The Next Phase — Chapter 7
The summer semester starts with a bang for the Super Cool Scholars. Our team put into action a whole onslaught of changes and improvements. As a team, we took a careful look at our previous successes and failures. One thing that we changed was that we started to define more responsibilities for roles and people with mixed roles on the team so that everyone can use their strengths and have their weaknesses covered. For example, Ugochi started to seek advice from faculty with a strong research background, supplementing her co-lead, Lia’s strong planning skills, and lack of time. We also set up times for Yuwen, our technical lead, during our meetings to debrief ML research in a digestible way so that the team stays updated and coherent in technical details. In addition, we also made sure that each team member contributed their check-ins to the daily agenda to increase task transparency, think ahead, and help our PM, Emily, distribute work. Another major change we made was to share sketches every day to improve internal and external communication and offload some simple design tasks from Anna, our design lead.
Modified GV Two-Week model
Continuing off our strong skills for organization and planning, we built an even better sprint plan and retrospective that is more visual and easy to understand. More specifically, we are using a Gantt chart to keep track of hard and soft deadlines so we can move around what to do without the worry of forgetting what we should concentrate on. We also decided to use a 2-week sprint model from Google, to quickly prototype and test.
Narrowing Design Directions
Based on our Spring semester’s research insights and client feedback, we narrowed three design directions we wanted to explore: onboarding, data manipulation, and note-taking. For each direction, we summarized what we’ve found and what we want to explore more. We also visualized the results from the ‘What’s On Your Radar’ activity that we did with our clients in the Summer Planning Meeting last semester. It helped us to evaluate the three directions in terms of impacts and risks.
Furthering Our Understanding
We also identified four main personas that are impacted by the ApplyGrad system: admins, reviewers, new reviewers, and admission heads. We would use them as references to remind us of our stakeholders’ needs and goals in future design iterations.
In the previous research, we found that many users had a hard time navigating through ApplyGrad’s website and understanding the system’s prompt. Solving those usability problems could significantly improve the user experience. To have a deeper look into the usability of the interface, we conducted another round of heuristic evaluation. In addition, we also summarized some common features that users found missing in ApplyGrad and used external tools for.
One of the team’s biggest obstacles last semester was recruiting participants for research. One of the ways that we decided to tackle the issue this semester was by recruiting non-reviewers to test prototypes that reviewers might not need to test. For example, one of the directions that we are exploring is onboarding, which would be more suitable for individuals that have never encountered ApplyGrad, as opposed to reviewers who have and are used to the system. Testing with non-reviewers will give us an idea of how new users navigate the ApplyGrad system in order to design an efficient and helpful system for both new and experienced reviewers.
For reviewers, one of our goals for this semester was to recruit participants from a larger number of programs than we did during the spring semester, so we put our efforts towards recruiting reviewers from programs that we had not reached out to, in addition to recruiting individuals that we had previously involved in our research. For non-reviewers, we utilized our personal and professional networks, as well as the help of faculty members to distribute our testing information. As a result, we were able to gather over 40 participants, and plan to continue utilizing participants that are non-reviewers, in addition to reviewers.
Sketching and Voting
The first stage of the sprint is to brainstorm potential solutions. Using what we know from our previous spring research, we asked everyone to identify a major problem within the 3 main areas we wanted to improve: onboarding, data manipulation, and note-taking. We sketched in parallel and then came together as a group to figure out what we wanted to do. We voted on the designs that best expressed our ideas and then designed a lo-fi prototype.
UX Target Tables
When adding more details to our lo-fi prototype, we developed a UX target table based on a user flow that lists the users’ main tasks, rationale, and then metrics that we can use to benchmark improvements. Our objective for this sprint is to grab baseline metrics for how long it takes people to go through tasks on ApplyGrad and have data on how long it takes for people to use a new version of ApplyGrad. We did this because we want to track performance and success by fulfilling the goal of being “efficient and effective.” We want a comparison measure of the before, our control, and after, our new UI based on the designer’s mental model.
For our lo-fi tests, we decided to utilize A/B testing with the original ApplyGrad interface and our new lo-fi prototypes. We utilized the Maze platform to set up our prototypes and tasks, which are based on our user flow, and illustrate the main tasks that reviewers perform in the system. Maze also gives us the opportunity to have unmoderated tests, in addition, to live tests.
For the two kinds of prototypes, we tested with reviewers, as well as participants that have never encountered the ApplyGrad system. Our participants were divided into four main groups: reviewers and non-reviewers testing the original ApplyGrad prototypes, and reviewers and non-reviewers testing the new lo-fi prototypes. For the live tests, we conducted the test with participants in Zoom, while asking them questions to get more feedback on their experience testing the prototypes.
Moving forward, we are planning to analyze and synthesize the quantitative and qualitative data we have generated and use the findings to make changes to our prototype.
Conceptual Prototype Testing
During our spring semester, we had tested a conceptual note-taking system with three reviewers in the Human-Computer Interaction Institute. While we had received generally positive feedback about this possible design direction, we wanted to continue testing this prototype with reviewers from other programs. Doing so would not only either validate or negate our previous hypothesis, but could also help us define which design directions to pursue. Replicating the note-taking experience in low fidelity prototypes could be tricky, so the conceptual prototype can complement our lo-fi testing efforts as well.
Continued testing of our conceptual prototype also provided us an opportunity to modify our prototype for a better experience. Originally, we had users conduct a review of three applications: one as they normally would, one with the ability to add comments and highlights, and one with simulated machine learning annotations based on their previous review behavior. We found, however, that testing all three at once often exceeded our allotted time with participants. Additionally, we were more interested in learning about our riskiest assumption about ML solutions. We elected to have reviewers complete one review normally and one with the simulated ML annotations for this round of testing.
During the course of the week, we tested our conceptual prototype with four reviewers from the MSE, MSAS, and METALS programs. Our initial findings during these testing sessions were vastly different from our initial sessions with HCII reviewers. The most important difference is the variance in the level of use and comfort in using ML solutions. Generally, reviewers did not use any of the available annotation features to annotate the documents. Many commented that the ML annotations were distracting, or that they would not trust any sort of algorithm to highlight or tag information. In contrast, however, some reviewers preferred the idea of using an ML algorithm because they felt that it minimized the potential for bias.
Additionally, in the course of testing, we were able to generate ideas for potential note-taking solutions in our mid-fidelity prototype. While discussing potential note-taking systems, one reviewer suggested that they have trouble finding exact passages or information quickly in admissions committee meetings. We discussed the potential of a tagging system that reviewers can use to highlight key information that would then be automatically populated into a review form. That tag in the review form would then link back to the application materials. The reviewer cited this feature’s usefulness in admissions committee meetings for quick reference. We plan to further explore this possible feature in our mid-fidelity prototypes!
During this sprint, we focused on exploring machine learning opportunities in our design direction of note-taking and annotation. In our conceptual prototype testings, we tried to understand reviewers’ attitudes towards an automated highlighting functionality that pulls out relevant information in application materials. We have received positive responses from reviewers regarding this feature. In this sprint, while we continue testing this conceptual prototype with more reviewers, we searched for proof of technical feasibility for this feature.
We found one article in the domain of electronic medical record system (King et al., 2019) that is closely related to our idea. In King et al.’s paper, they developed a learning electronic medical record (LEMR) system with ML. Their system helps clinicians by auto-highlighting sections of a patient’s clinical record to make it easier to summarize the patient’s conditions.
This paper provided us with the inspiration of building a predictive model for each word in a document. In this way, we can convert this highlighting problem into a traditional ML prediction problem. However, we also had some concerns about our direction from the results of this paper: there was not significant difference in clinicians’ process time of a patient’s case using the highlighted patient record compared to the un-highlighted one. In addition, in a highlight-only version of their interface, reviewers identified 18% of the data items not shown as having a major clinical impact on the diagnosis of patients. This means that 18% of relevant data, which might affect a patient’s health and life, is not highlighted using their model. In our case, such inaccuracy might significantly impact the fairness of reviewers’ work.
However, we need to consider our domain is different from the LEMR case and we will continue exploring other similar application scenario papers in the upcoming sprints.
Synthesis of low-fi prototype testing findings
After our low-fi prototyping testing, we are starting to synthesize the findings we gathered during these sessions. We will put together and look at qualitative and quantitative data we collected throughout the process and use the insights to inform our iterative design process.
Drawing the findings and insights from our low-fi prototype, we will continue fleshing out the details of our prototype and make a mid-fi one during the next sprint. Given that our low-fi prototype has relatively high fidelity, for the next iteration, we will mostly consider iterating on the following aspects:
- Making the user flow in the system more natural for reviewers
- Bridging the gap between mental models of actual reviewers and designers
- Prototyping and testing more innovative ideas that would benefit reviewers, committees, and applicants
Emotional & business benefits of our system
While we have been trying to integrate more emotional aspects of reviewers’ experiences in ApplyGrad, starting from the next sprint, we will take a more holistic approach to understand and improve the emotional aspects of review work in the system. Specifically, we seek to make the review work less stressful but maintain the same level of seriousness, since it impacts the lives of thousands of students every year. In addition, we will help reviewers gain a sense of achievement by using the system as their work is truly impactful and necessary.
Meanwhile, our team will start to focus more on the business value ApplyGard can bring to the School of Computer Science at CMU. Financial gain is never the primary goal of Applygrad, however, we will seek to use design to help relieve reviewers’ workload and burden and use ApplyGrad to gather more valuable data for CMU.
Andrew J. King, Gregory F. Cooper, Gilles Clermont, Harry Hochheiser, Milos Hauskrecht, Dean F. Sittig, Shyam Visweswaran, Using machine learning to selectively highlight patient information, Journal of Biomedical Informatics, Volume 100, 2019, 103327, ISSN 1532–0464, https://doi.org/10.1016/j.jbi.2019.103327.