Abstract
Objective. To evaluate the validity and reliability evidence of the preceptor assessment of student tool (PAST) which was designed to assess doctor of pharmacy (PharmD) student rotation performance.
Methods. Evaluation data were loaded into WINSTEPS software to conduct a Rasch rating scale analysis. Validity evidence was examined from construct and content validity perspectives, and reliability was assessed via student and item separation index and reliability coefficient. Data from 435 observations were included in the analysis.
Results. All 19 items measured the same construct of interest and the five-point rating scale functioned appropriately and differentiated students’ ability. However, the item/person map indicated an absence of items at the end of the measurement continuum.
Conclusion. Although adding items at the end of the measurement continuum may be beneficial, PAST showed good validity and reliability evidence when used to evaluate PharmD student rotations and is suitable to assess mastery learning.
INTRODUCTION
All doctor of pharmacy (PharmD) programs in the United States require students to gain experience in pharmacy practice through introductory level (IPPE) or advanced pharmacy practice experiences (APPE) before graduation.1,2 The overall goal of the various pharmacy practice experiences is to educate students to be competent health care professionals who will provide patients with optimal pharmaceutical care. This goal is the culmination of multiple entities and reports. Based on the Institute of Medicine (IOM) report on health professions education, health care providers should focus on five competences: delivering patient-centered care, using evidence-based practice, improving the care quality, employing information technology, and integrating with interdisciplinary teams.3 Additionally, the American Association of Colleges of Pharmacy’s (AACP) Center for the Advancement of Pharmacy Education (CAPE) 2013 released a list of terminal educational outcomes for PharmD students that include four main domains (ie, Foundational Knowledge, Essentials for Practice and Care, Approach to Practice and Care, and Personal and Professional Development) with 15 subcategories.4 To ensure that students and colleges are meeting CAPE outcomes and ACPE accreditation standards, frequent assessment of student knowledge, skills and abilities is required and the evaluation instrument used to assess students’ performance should be reliable and valid.5
The College of Pharmacy at the University of Arizona developed a preceptor assessment of student tool (PAST), based on the CAPE 2013 competencies, aimed at assessing students’ performance on rotations and measuring their mastery learning of the CAPE outcomes. Volunteer faculty, known as preceptors, used the instrument to evaluate students’ performance and assigned them a final grade for each rotation. PAST comprises 19 items covering the CAPE competencies (Appendix 1) and was administered via an online platform known as CORE Elms (RxInsider, West Warwick, RI). The rating scales contained five hierarchical anchors: beginner, intermediate, proficient, advanced, and distinguished. PAST was implemented in Fall 2015 and both student and preceptor feedback on it have been positive, with many saying that PAST was easy to use and helpful. Its validity and reliability, however, have not been systematically evaluated. Several studies6-9 have described the use of tools to evaluate student IPPE and APPE performance but, to our knowledge, no paper has been published that measures the performance of an instrument that used the CAPE 2013 competencies as core elements. This study was performed to explore the validity and reliability of PAST, which was developed to assess PharmD student rotation performance. This study was the first attempt to evaluate the validity and reliability of PAST, aimed at laying the groundwork for future research.
METHODS
Data from August 2015 to March 2016, including five rotation blocks, were downloaded from CORE Elms and transformed into a Microsoft Excel (Redmond, WA) file. Then data were loaded into WINSTEPS version 3.92.0 software (SWREG, Minnetonka, MN) to conduct a Rasch analysis. Analysis was performed on PASTs that were completed by preceptors as an evaluation of student performance on rotation. Additionally, construct of interest, in this case, refers to the rotation performance of PharmD students in mastery learning competencies of CAPE outcomes. Item difficulty is the degree of difficulty of each mastery learning goal, and student ability represents student competency in mastering the learning goals as rated by preceptors.
The Rasch rating scale model, a one-parameter item response theory (IRT) approach that accommodates polytomous data, was used to assess the validity and reliability of PAST in this study. The one-parameter IRT model assumes that the only factor differentiating the item characteristics curves (ICCs) of the various items is the item difficulty, and that all items have equal discriminating ability as reflected by the parallel slope of the ICCs.10 We hypothesized that our data fit the one-parameter IRT model and tested it by looking at the ICCs.
Validity evidence was examined from construct and content validity perspectives, respectively, and reliability evidence was assessed by examining student and item separation index and reliability coefficients.11,12 Specifically, construct validity was evaluated from item goodness-of-fit statistics, principal component analysis (PCA) of residuals, and rating scale functioning.13 Content validity was assessed via the item calibration distribution, gap in the measurement continuum, and population targeting.11
Item goodness-of-fit statistics and PCA of residuals aimed at testing the unidimensionality of the instrument to illustrate all items measure the same underlying construct of interest. Item goodness-of-fit statistics was based on discrepancies between observational and predicted values in the model. Accepted Infit and Outfit Mean Square (MNSQ) values should be located between .5 and 1.5.10,14-16 Items with MNSQ values <.5 were considered over fit for the model; they might lead to a misleadingly high reliability estimate, but would not degrade the results. Items with MNSQ values between .5 and 1.5 were deemed as contributing to measuring the construct sufficiently. Items with MNSQ values between 1.5 and 2.0 were treated as under fit for the model; they did not weaken the scale but did not contribute to measuring the construct. Items with MNSQ values >2.0 distorted or degraded the results. Thus, items that fell beyond the range of .5 to 1.5 were discarded or modified to fit the model. Misfit phenomenon was used to indicate if items measured a different construct than the underlying trait of interest. In the PCA of residuals, common criteria were used to identify the dimensionality; that was, the principal component should account for >60% of the total variance and the eigenvalue size of the first contrast component should be <2 or the unexplained variance in the first contrast component was <5%.10,13,16,17
The rating scale functioning was examined using the following criteria: the number of observations in each category should be greater than 10; the average category measures should increase with the rating scale categories; Infit and Outfit MNSQ for the measured categories should be between -2.0 and 2.0 logits; category thresholds should increase with the rating scale categories; step calibrations should be at least 1.4 logits apart, but no more than 5.0 logits; and curve of each rating scale distribution should be peaked, and the peaks should not overlap.11,15
Content validity was evaluated from the following three aspects: item calibration distribution; gap in the measurement continuum; and population targeting.11 These three criteria were examined using an item/person map, in which each item’s difficulty level and student ability as assessed by the preceptor were placed on the same measurement continuum, combined with an expected score map. If a gap occurred among items on the item/person map, a z-test was used to test its significance level. The expected score map reflected the item difficulty hierarchy, student measurement distribution, and ceiling/floor effects. When the average student measurement was ≥ 1.0 logit or ≤ − 1.0 logit, it could be assumed to have the presence of mistargeting and ceiling/floor effects.18
Reliability evidence was reported using student and item separation index and reliability coefficients from the Rasch model. Student and item separation index represent the extent to which items and students spread out to define a distinct level of difficulties and abilities, which could be translated into a reliability value that is equivalent to the traditional test reliability index, the Cronbach’s alpha. The student reliability indicates the range of students being measured and the discrimination ability of the test. For example, a reliability coefficient ≥ .9 suggests that the student’s ability is discriminated into three or four levels.19 Since the rating scale of PAST comprises five categories in total (ie, beginner, intermediate, proficient, advanced, and distinguished), we expected to see a reliability ≥ .9, which indicated a sufficient use of rating scales, high internal consistency of students being measured, and good discrimination ability of the instrument. We also examined the item reliability coefficient that was obtained from the item measurements. To better illustrate the components evaluated in the Rasch analysis, a flowchart is provided (Figure 1).
Flowchart of Components Evaluated in the Rasch Analysis.
RESULTS
PAST was administered to preceptors to evaluate 488 PharmD student rotations at the University of Arizona College of Pharmacy from August 2015 to March 2016. Of those, 435 observations were included in our final analysis due to incomplete evaluations of 53 student rotations. Students performed rotations of five different categories, with 68 rotations in adult acute care, 93 rotations in ambulatory care, 83 rotations in community pharmacy, 71 rotations in hospital/health systems, and 120 rotations in electives, respectively. Rotations were six weeks long and adult acute care, ambulatory care, community and hospital/health systems rotations were mandatory.
As was shown in the multiple ICCs (Appendix 2), the slopes of various items were parallel to each other, illustrating all the items had equivalent discriminating ability and the only differentiating factor of items was attributable to the difficulty levels of various items, which confirmed the appropriateness of using the one-parameter IRT model in this study.
Item goodness-of-fit statistics found that all item Infit and Outfit MNSQs fell within the recommended fit criteria (ie, .5 to 1.5), illustrating they measured the same construct and worked in unison with each other. It was worth noting that several articles have employed a more stringent fit criterion, which was, Infit and Outfit MNSQs should be between .6 and 1.4 or .7 and 1.3.20-27 In order to be comparable with these studies, this criterion was used in our study and found to be applicable to our study as well, with all items MNSQ values located between .7 and 1.3 (Appendix 3). Thus, this demonstrated that all items contributed to the same construct of interest (ie, rotation performance as measured by achieving mastery of the CAPE competencies). In addition, the PCA results revealed that 68.6% of the total variance was explained by the underlying construct of interest, which was good, but an eigenvalue of 3.9 was found to be accounted for by the second major component, which did not meet the criteria for testing unidimensionality, suggesting another construct might exist in PAST. To test this, we divided the instrument into four domains based on the subcomponents, but we were unable to reach a satisfactory eigenvalue or unexpected variance of the second construct. Since all the items were proved to correlate with each other from item goodness-of-fit statistics and unidimensionality is not an all-or-nothing phenomenon, we assumed all the items measured the same construct of interest in this study.
Rating scale functioning showed that it achieved the aforementioned criteria, with more than 40 observations included in each category, non-overlapping peak of distributions of rating scales, and acceptable MNSQ values and step calibration advance between categories. There was some concern about the step calibration distance between category of “proficient” and “advanced” that was 5.5 logits apart, a little larger than the accepted value (5 logits). However, the results of rating scale functioning, in general, indicated the five response categories (ie, beginner, intermediate, proficient, advanced, and distinguished) functioned appropriately in differentiating the performance of students in rotations and thus, were not subject to collapse or rewording.
The item/person variable map is shown in Figure 2, and shows the hierarchical order of item difficulty and student ability (as assessed by preceptors). The map illustrates an overall visual representation of content coverage, distribution of item calibration estimates, gap in the measurement continuum, and population targeting. Item 6 (student effectively solves problems) and item 7 (student effectively assures accurate preparation, labeling, dispensing and distribution of prescriptions and medication orders) were more difficult for students (as per preceptor evaluation) to achieve as these two mastery learning goals were more difficult for students to be observed performing some portions of these tasks by preceptors. Item 17 (student consistently behaves in an ethical manner as would be expected of a pharmacist) and item 18 (student demonstrates professionalism at all times) were easier or more likely for a student to achieve (as rated by the preceptor). In addition, the item measurement showed that all items were clustered in the middle of the continuum, fell within two standard deviations of mean logits, without significant gap within. There were, however, no items at the end of the measurement continuum, and some items were located at the same position of the map, indicating the items in each cluster had the same difficulty level (ie, twosome items 6 and 7; quartet items 4, 5, 9 and 12; quintuplet items 1, 2, 3, 10 and 11) (Figure 2). Meanwhile, the item/person map showed that the student measurements were mostly clustered at the top of the continuum, with few placed at the low and some at the high end of the scale.
Item and Person Map of Preceptor Assessment of Student Tool.
The right side of the expected score map that was shown in Figure 3 shows a hierarchical ordering of item difficulty level, with the easiest ones at the bottom and most difficult ones at the top. The curve at the top represents the distribution of student ability as assessed by the preceptor. It could be seen that student ability was with an average estimate of 5.75 logits and standard deviation of 3.76 logits.
Preceptor Evaluations of Students Using Preceptor Assessment of Student Tool.
Rasch analysis results showed that the student separation index was 3.44 logits and reliability coefficient was .92. Item separation index was 6.50 logits and reliability coefficient was .98. Both student and item measurements indicated a high internal consistency.
DISCUSSION
This study evaluated data from 435 PharmD student rotations at the University of Arizona College of Pharmacy, which might be considered a large sample size when using the Rasch model. Empirical experience, in general, suggests more than 100 observations are needed for exploratory Rasch analysis to produce a reliable result. One study stated that results from small samples (≤ 50) might lead to an opposite conclusion from those based on larger samples(≥ 100) and warned that caution was needed when interpreting results.28 This study, however, was based on 435 observations and, therefore, could be assumed to have enough power to reach an accurate result.
Our study used the rating scale model to determine how well PAST functioned in assessing students’ performance of rotations. Based on the results, PAST is generally a well-designed instrument. All 19 items measured the same construct of interest and worked unidimensionally, with local independence of items and monotonicity of scaling. The five response categories showed an appropriate discriminating ability in evaluating students’ rotation performance. Plus, PAST had high reliability (ie, the student reliability coefficient was .92) evidence and demonstrated good internal consistency.
It is worth noting, however, that there was some concern in step calibration distance and item measurement. Scale functioning showed that the step calibration distance between the categories of “proficient” and “advanced” was 5.5 logits, which were a bit larger than the accepted value (5 logits), which might suggest that the content coverage of these two response categories was broader than it actually should be. In this case, shortening the content coverage by adding another response option between the two rating scales or rewording the two categories to make the space between narrower might help decrease the step calibration distance. Further work needs to be done to test the scale functioning if adding more categories into the rating scales or rewording the current response options as true information might be lost by doing so.
The item/person map showed an absence of items at the end of the measurement continuum, multiple overlapping items, and a potential mistargeting. Item measurement indicated that all the items were in the middle of the continuum (suggesting that they fell within the moderately difficult level), and at the end of the continuum, (ie, both the easiest and most difficult ones were missing). In this case, PAST might perform poorly in differentiating among least and most able students, jeopardizing the content validity. For example, if the population evaluated using PAST was relatively less able students (ie, student ability was rated at the beginner and intermediate level), then results would show students would be placed at the bottom of the map, but might not differentiate the study abilities due to the missing items at the corresponding difficulty level of items. Similarly, this also could occur with the more able students (ie, students were evaluated at the advanced and distinguished level). One possible solution to address this would be to add appropriate items to cover the empty areas of the continuum. However, this would necessitate a systematic approach and specific rationale to avoid adding redundant items and increase the response burden. Another issue was that some items were placed at the same difficulty level in the continuum, including the twosomes of items 6 and 7, items 8 and 13, items 17 and 18; the quartet of items 4, 5, 9 and 12; and the quintuplet of items 1, 2, 3, 10 and 11. From the Rasch model perspective, eliminating items that are redundant according to difficulty level (ie, any one of twosome, three of quartet, and four of quintuplet) could result in similar instrument functioning. However, in this case, because the PAST reflects the CAPE outcomes, essential information regarding the students’ performance of rotations and expected outcomes would be lost by doing so. A possible solution would be to modify the degree of trait that these overlapping items measured, so that they would spread out at the measurement continuum. The last issue in terms of the item/person map was the mismatch of student and item measurement in the continuum. As was shown in the map, items were clustered in the middle of the continuum; whereas student measurements were mostly placed at the upper portion of the scale, with few placed at the low end and some at the high end of the scale. This skewed distribution of student measurements suggested that either students performed well on rotations (as per preceptor perceptions) or there were not enough more difficult items.
However, it is important to note that interpretation of content validity, the coverage of underlying construct of interest, should also take into account the goal of the instrument. PAST was developed based on the CAPE educational outcomes, covering all the learning objectives (ie, terminal outcomes) that PharmD students must achieve before graduation. Specifically, PAST aimed to assess student performance on rotations and measure their mastery learning of the CAPE outcomes. Because all evaluated students were in the final year of the PharmD program, they should, by definition, be able to perform well on most, if not all, items (ie, students were expected to be rated as proficient, advanced, and distinguished) of PAST. By the time they graduate, it is expected that they will have mastered each item. Thus, from a mastery level perspective, additional items at the extremes are not needed and content validity was defensible.
There are limitations to this study. First, independence of the sample was one of the assumptions of IRT and could be tested by examining the differential item functioning (DIF). However, we did not explore that in this study because of the limited data (ie, grouping variables were missing in the dataset). We suggest further studies to address the DIF to examine whether the extent of latent trait that the item measured differed among the subgroups of samples. Second, we did not take into account the impact of preceptors due to the limited data. Further studies could employ the many facets of the Rasch model to incorporate the multiple facets such as items, students, and raters all together into the analysis. Third, this is a cross-sectional study, with all students assessed only once, which might introduce some random error into the data and distort or degrade the assessment results of the instrument. Further studies could focus on the longitudinal evaluation data to reach a more reliable result. Finally, students in this study were all from one university and therefore, might not be representative of the entire population of PharmD students in the United States. Future studies could collect a more diverse source of data from other universities to further confirm the validity and reliability of PAST.
CONCLUSION
Although adding items at the end portion of the measurement continuum may be beneficial, PAST showed good validity and reliability evidence in evaluating PharmD student rotation performance. The current evidence supports the conclusion that PAST provides valid and reliable information in evaluating PharmD student rotations as well as capturing the essence of CAPE outcomes.
Appendix 1. Items of Preceptor Assessment of Student Tool

Appendix 2. Item Characteristic Curves for 19 Items with Equal Discrimination but with Different Levels of Difficulty
Each item (1-19) is represented as an individual curve, illustrating that each item has equivalent discriminating ability and the only differentiating factor for each item was attributable to the difficulty levels of various items, confirming the appropriateness of using the one-parameter IRT model.
Appendix 3. Measures and Fit Statistics of Items for Preceptor Assessment of Student Tool

- Received July 1, 2016.
- Accepted September 21, 2016.
- © 2017 American Association of Colleges of Pharmacy