Chapter 4 — Diving Deep Into Synthesis and Ideation

Image by UX Indonesia on Medium

Over the past two weeks, Team Super Cool Scholars has been working on synthesizing data from the primary and secondary research that we have been conducting to generate deep insights that can guide us moving forward. Alongside continuing research and research synthesis, we have also been working on idea generation to develop a conceptual prototype that we plan to test and validate with our end users.

Contextual Inquiry With Reviewers

We recently started conducting contextual inquiries with reviewers to understand their thought process when reviewing applications, as well as how they use the ApplyGrad system. In addition, we are looking to identify pain points in their process in order to discover areas that could benefit from machine learning. Our plan is to conduct contextual inquiries with programs from three programs of varying sizes — small, medium, and large; we are taking into consideration that the size of a program will influence the issues that it experiences.

We developed application materials based on the middle applicants for each program, using the knowledge we have. Because creating applications for a top, middle, and bottom applicant would require resources that we might not be able to gain access to in the time we have, we decided to focus on middle applicants for the contextual inquiry. Also, findings from our contextual inquiry show us that most programs spend more time on middle applicants than the top or bottom applicants.

Sample application materials that we created for a program

During these contextual inquiry sessions, reviewers can access the fake application materials by using the remote control function in Zoom to control our screen and navigate through our ApplyGrad sandbox account. Currently, we have conducted one interview with a reviewer from a large program. During this interview, we asked the reviewer to go through the fake application materials that we placed into the ApplyGrad sandbox account for review. We also encouraged the reviewer to think aloud as they looked through the application materials and assessed the fit of the applicant for their particular program.

As we finish up our remaining contextual inquiry sessions, we plan to synthesize the data we collect to identify themes and derive insights that will continue to inform idea generation.

Synthesizing Machine Learning Findings

During our last sprint, we started to interview ML experts at CMU to get their input on using ML in the admission processes, specifically about the technical details of different ML models, the potential biases ML might introduce to the admissions process, as well as how can we design an ML system and consider its potential impact without access to real application data (due to FERPA regulations) and implementing the system.

In this sprint, we followed up with some of the experts we previously interviewed and gathered our findings from the interviews. In order to acquire more insightful knowledge, we conducted affinity diagramming to generate insights.

We used affinity diagramming to synthesize ML findings

After analyzing the links and relationships between sticky notes that contain our interview findings, we concluded the following insights:

  1. Efforts need to be put in evaluating ML design using simulated data
  2. Biases are common and generally hard to eliminate in review processes
  3. Reviewers take time to thoroughly evaluate applicants and prioritize effectiveness over efficiency
  4. We need to combine ML and human intelligence to assist task automation and facilitate decision making
  5. Summarization & pulling out sensitive information can help present application data better
  6. We can draw inspirations from existing algorithms in designing the automation of review processes

In particular, for insight 2, one expert suggested we can focus on specific procedures in the review where biases usually occur and use machine learning to mitigate such biases. For example, ML can be used to extract all gendered pronouns in letters of recommendation and substitute them with gender-neutral pronouns. In this way, reviewers are less likely to be affected by the gender factor, which has a long tradition to be a source of bias in recruitment, admission, and similar scenarios.

In addition, since it’s almost impossible for ML to completely remove biases in the admission process, one expert expressed that we can use current practices in admission review as a baseline for mitigating biases. “As long as your system can reduce the biases in current practices, it will be valuable”, said the expert.

We also identified specific algorithms and domains in adjacent areas that we might draw inspiration from, including pairwise comparison, hierarchical clustering for grouping, and peer review in academic conferences and journals. We will refer to work in these areas in the future as we progress into the next stages of our synthesis and ideation.

Synthesizing Speed Dating Findings

We wrapped up our pretotype testing and competitive analysis this week and held two separate meetings to synthesize the research results.

Pretotype Testing

We tested our pretotype with four reviewers from programs of varying sizes. All four participants provided valuable ideas on the opportunities for improvement in ApplyGrad. We found that reviewers’ needs mainly lie in three areas:

  1. Information extraction: Reading application materials is a manual and time-consuming process. Programs receive applications from all over the world, and their transcripts use different grading criteria and course names. Reviewers would like something to assist them with extracting valuable information and standardizing grades & test scores.
  2. Better note-taking system: Three out of four reviewers wished that there could be a better way to take notes in ApplyGrad. Many of them used outside tools such as excel sheets and notepads to organize their notes. Clean and structured notes not only help them to write evaluations but also assists with discussions during the Admission Committee meetings.
  3. Customizable grouping: Three out of four reviewers mentioned that they would group similar applicants together for comparison and to ensure that they use consistent standards. Different programs use different grouping mechanisms: for example, one program groups applicants by letter writers, while another program groups applicants by academic backgrounds.

We also collected people’s opinions towards the use of automation/ML in the review process. We found that most reviewers prefer ML to play a facilitator role rather than make predictions. Reviewers were afraid that an ML system might perpetuate possible bias in the historical data and influence their own judgments.

“I would not feel comfortable recommending anything be implemented without input from someone who studies fairness and can see the issues.” — reviewer

“I will be horrified if it directly makes a decision or recommendation on the admission decisions. ” — reviewer

“When machines affect humans, we have to be extremely careful.” — reviewer

In addition, many solutions brought up by reviewers focused on using automation/ML to improve the fairness of admission decisions: e.g. detecting implicit bias in past and current admissions data and bringing attention to what information people have missed.

Competitive Analysis

To understand how other platforms are approaching similar areas, we performed a competitive analysis on similar systems to ApplyGrad. These platforms include Interfolio and Slate, which some SCS departments use for their admissions process, as well as the recruiting platform Element451. Our goal was to understand how other systems handle and display applicant information, and features that help facilitate review. We found that many of these platforms have integrated spreadsheet functions such as sorting and grouping, which help with efficiency and eliminate the need for external tools. In addition, there is also more support for customization on some of these platforms compared to ApplyGrad as users are able to tailor the review process to their own program’s needs.

Grouping feature in Element 451’s Decision451 module
Slate interface showing application review progress

One direction that we are exploring is an intervention to help reviewers keep track of their progress in committee meetings, or to remind them if there is an important component of an application that they have not yet addressed. Thus, we also looked at speech analysis platforms and speech-to-text AI systems such as Otter.ai, Primer.ai, Abridge, and MedRespond. We found out that these technologies are useful for improving communication between groups of people through summaries and debriefing so that people don’t forget and miss key information. We also found out timelines are helpful in presenting the whole narrative of an individual/event. Perhaps we could utilize this to help reviewers remember the stories of applicants.

Design Opportunities

Based on the insights we got from the speed dating interviews, ML research, and competitive analysis, we constructed a reviewer journey map, visualizing the pain points and design opportunities in different stages during the review process.

Joint journey map highlighting pain points and opportunities

Before Review

Journey before reviewing application materials

Most of the work before the review is handled by administrators. Reviewers simply receive the list of applicants assigned through ApplyGrad. We found that many programs divide the work for reviewers by applicants’ geographical location/academic background, assigning people who would best know the place/area to review. Thus, one opportunity we’ve identified is to have some predefined grouping criteria to help administrators and reviewers group applications.

During Review

Journey while reviewing application materials

This is the stage that reviewers spend the most time on and where most of the design opportunities lie.

  1. Reviewing Transcripts: Programs receive transcripts from all over the world, and different schools use different grading criteria. In addition, many programs only look at relevant courses on transcripts, and identifying those is a manual and time-consuming process. One opportunity to help reviewers here is to have some mechanism (for example, keyword extractions for course names) to extract relevant data from the transcripts and standardize grades & test scores.
  2. Reviewing Statement of Purpose (SOP) & Letters of Recommendation (LOR): SOP and LOR are the two places where many reviewers spent the most time. They’re both textual data and thus take more time for reviewers to digest and extract information from. During the pretotype testing, some reviewers mentioned that they want a ‘Google docs comment system’ where they can highlight important information and leave comments besides. Based on this idea, another opportunity is to use ML to learn reviewers’ reviewing patterns and highlight similar information in other materials or applications to reduce the workload.
    In addition, a potential issue with LOR is that it is subjective to the letter writers’ bias. Different letter writers also have different writing styles and standards, which makes it hard for calibration. To mitigate this problem, we can allow reviewers to pull up all letters submitted by the same letter writer at the same time to help them develop a better understanding of each writer’s writing style.
  3. Comparing Applicants & Calibrating Review: Some reviewers mentioned that they would constantly go back to the previously reviewed applications to check whether they are using the same standards throughout the process. However, it’s hard to do in ApplyGrad now as there is no easy way to group applications and make comparisons. One opportunity to help reviewers here is to allow them to tag applications and easily pull up all applications under the same tag. In addition, we could also implement an audit system to present statistics and visualizations to help reviewers check their scoring trends and identify any implicit bias.
  4. Cross-referencing: We found that besides viewing each material separately, reviewers also cross-reference materials one with another. For example, if they find an interesting internship experience on an applicant’s resume, they will look at their SOP to learn more about it. Inspired by our competitive analysis insights, an opportunity is to use ML/AI to consolidate all the information and present each application using a timeline format. It can help reviewers process and remember the whole narrative of the individual.

After Review

Journey after reviewing application materials

For most programs, after all the reviewers finish the review work, the admission committee will hold an Admission Committee(AC) meeting to discuss applicants and make final admission decisions.

One pain point that reviewers have is that they sometimes have difficulty remembering the details of the applications, especially the ones they reviewed weeks before the AC meeting. We think the note-taking system we proposed previously could mitigate this issue. With it, reviewers can easily reference the direct quotes from the application materials and any notes/comments they made in the review process.

Another opportunity here is to use speech-to-text technology to help track and record discussions so that reviewers could easily go back and see what has or hasn’t been addressed. Reviewers will also be able to edit, organize, comment, and highlight the notes as they want. However, there are some privacy concerns over this solution as people might not want everything they’ve said to be recorded.

Conceptual Prototypes

As we continue to conduct research with our primary users, we are uncovering ways in which the review process can be made more efficient for them. In order to test whether our ideas will be useful to end-users, we developed a conceptual prototype. Using the insights that we had derived from both primary and secondary research, we developed a conceptual prototype focused on assisting reviewers with identifying important information in application materials.

Because reviewers spend the most time on application materials, we want to direct the focus of our prototype to that stage of the review process. Our main objective is to test the need for a better review system in ApplyGrad. Given that our research has shown that reviewers are hesitant about integrating machine learning into the review process, we also want to gauge reviewers’ level of comfort with machine learning and see if machine learning could aid reviewers during this process. Overall, we seek to gauge how end-users interact with the prototype and whether it is something that they believe would be valuable to their review process.

During our conceptual prototype testing, we will first ask reviewers a few overview-level questions regarding how they review applicants and the kinds of tools that they utilize in the process, then they will begin looking through the application materials. Reviewers will assess three fake applications that we have created for their program that will be located in our Google Drive.

Application review #1:

We plan to use a ‘Wizard-of-Oz’ technique for our conceptual prototype. As reviewers look through the first application, they will be provided with a spreadsheet where they can enter notes during their review, as well as asked to think aloud. Through our research, we have found that most reviewers use an external spreadsheet to take notes during their review, so we want to simulate a similar experience for them for the first application review. We used Google spreadsheets to develop a form that mimics the ApplyGrad interface that reviewers can use to assess and comment on certain application materials as they go through the applications.

As reviewers go through the first application, the notetaker of the session will open the third application and highlight the information in the application materials that the reviewer has indicated is important to them when reviewing an application (e.g. if the reviewer indicates that they usually scan for the name of the program, that information should be highlighted in the third application).

Application review #2:

Once reviewers have completed the review of the first application, they will then move on to review the second application. For this application review, the reviewer will highlight and comment on the document using features in Google docs as they see fit. The notetaker will continue to markup the third application with any additional information learned during this review.

Application review #3:

Once they have completed the review of the second application, reviewers will review the third application that has been highlighted and marked up by the notetaker. For the final application, as the reviewer makes comments on application materials, the notetaker will add those comments into the spreadsheet review form for the reviewer. The reviewer may make, remove, alter, or add additional information into the review form as they see fit.

Using these methods, we want to test our hypothesis that integrating machine learning systems that help reviewers through the process instead of making judgments for them can help expedite the process without introducing new or additional bias into the admissions process. We are also looking to evaluate the amount of time it takes to go through all applications to evaluate efficiency. In addition, we have also included questions to evaluate reviewers’ comfortability level with ML and how they believe the intervention will affect fairness and efficiency.

Conceptual Prototype Feedback:

During our class critique, we were given valuable feedback on how to handle ML & fairness issues and quantitative metrics that we can collect. Instead of concentrating on just our domain, we can loop in our competitive analysis and go deeper into the areas of how those products handle laws. In addition, we could break down observations into quantitative measures such as the number of clicks, notes, & highlights.

Next Steps

We are planning to conduct three more contextual inquiries with reviewers this week. Following that, we plan to synthesize the results and develop insights. We are in the process of scheduling conceptual prototype testing and preparing application materials to use during those sessions. Currently, we have a set of three applications for one program. Due to FERPA regulations, our client was not able to share application materials from their programs with us. However, they did give us some suggestions on where to look for templates/sample statements of purpose, resumes, and letters of recommendation, as well as how to recruit students. We are planning to conduct 4–5 sessions with reviewers, each from a different program.

In addition, we will continue with our Machine Learning research and dive deeper into Natural Language Processing/Text Analysis systems (such as Watson Natural Language Understanding) by trying them using sample application materials. Our goal is to get a better understanding of what the current technologies support to ensure the feasibility of our proposed solution.

--

--