Abstract
Required discussions of course readings provide motivation for students to learn course content and can be used to validate students' comprehension of course content and processes. A tool for grading the students' weekly discussions of course work was tested to evaluate the interrater reliability of grading by two faculty members. The purpose of this article is to describe psychometric testing of the interrater reliability of this grading method. Using the grading tool, independent ratings of five students' online discussion postings were recorded by both faculty members over a 5-week period, which provided the data for this study. Data were analyzed using Spearman [rho] and Kendall [tau]-b statistics. The findings revealed that the overall correlations of rater scores were satisfactory and indicated that an acceptable level of interrater reliability was obtained through use of the grading tool. Reliable tools for evaluation of students' online discussions contribute to the knowledge needed for the implementation of online courses.
Online programs and courses have been implemented in many universities and colleges, with high percentages of students taking their course work online.1 For example, "in 2000-2001, 90 percent of public 2-year and 89 percent of public 4-year institutions offered distance education courses."1(piii) With online courses, the focus differs from traditional teaching methods.2 Instead of teacher-conducted lectures, students must attain their learning of course content and processes from required readings, materials such as slide presentations on course Web sites, content available through the Internet, and a variety of assignments. With requirements that students read about course content and discuss the content with their peers to show that they understand the readings, specific methods to grade discussion content are needed as part of the assessment process.
Since 2001, the first author has used a method for grading online discussions with three criteria: (1) frequency of postings, (2) the extent to which students' postings reflect comprehension of the required readings, and (3) students' comments on other students' postings (see Table 1). Over the past 5 years, specific aspects of the grading method had been changed each year as indicated by students' responses to the use of the grading method. These three criteria are used to set the expectations for weekly participation in course work, which constitutes 25% of course grades. Discussion questions based on the required readings are provided for each week (see example in Table 2). Each week also includes application of the weekly content to a specific task. For example, clinically related content is applied to an assigned case study and knowledge related to nursing research is applied to critique an assigned research study report (see question 13, Table 2). At the end of each week's discussion, each student is required to send an e-mail message to the instructor with his/her self-evaluations, giving three grades, one for each of the three scoring criteria. The faculty member's grades are returned to the students as soon as possible after this, with the instructor's grades being the same, higher, or lower than the students' grades. When the instructor's grades differ from the student's grades, the reasons are explained using the grading criteria as the basis for explanation. The success of this scoring method was anecdotally determined by a generally high percentage of agreement between grades submitted by students and grades submitted by the faculty members. For students whose grades differ from the instructors, however, sometimes it takes 2 to 5 weeks before the two grades are the same; that is, the students accept the standards set by the instructor and meet the standards at the desired level. For example, some students aim to achieve a 95 every week; other students seem to aspire to achieve grades of 80, 85, or 90.
![]() | Table 1 Criteria for Grading Online Discussions |
![]() | Table 2 Examples of Weekly Discussion Questions |
In the 6 years that these grading criteria were used in two different undergraduate nursing courses, there were no student complaints about the grading criteria or grades. However, the interrater reliability of two faculty members using the grading criteria had never been tested. In fall of 2006, with the advantage of having two instructors assigned to teach the same course, the authors conducted a study to evaluate the interrater reliability of grading by two instructors. The purpose of this article is to describe psychometric testing of the interrater reliability of this grading method.
GRADING: A DIFFICULT AND COMPLEX TASK
The grading of students' assignments is an important and difficult role for instructors.3 The importance of grading is derived from students' grades being used as the criteria for graduating, being accepted into higher level educational programs, and obtaining job positions. The difficulties of grading are that the behaviors or performances of students are complex, and grading has a significant subjective component; for example, it is not clear what constitutes the difference between excellent and good.3 Students may perceive that grading is not fair; that is, they are concerned that biases may exist that influence the grading of their work when compared to the work of other students or the amount of work that is needed to achieve a grade is more than they expect. In a study of educators' recollections of their experiences as students with grading, Guskey4 found that 70% of 320 elementary, middle, and secondary educators perceived unfair grading in college courses.
Scoring rubrics are a means of clarifying the standards for grading to improve validity and reliability, given the inherent difficulties.5 By using scoring rubrics, "students gain a clear understanding of what is expected of them to attain success."5(p22) Student understanding of what is expected is especially challenging with online courses, so rubrics serve an important role in online pedagogy.
Scoring rubrics for grading students' participation in online discussions are needed to meet at least four of the seven principles of good practice in undergraduate education, that is, the principles promoted by the American Association of Higher Education (AAHE).6 The four principles that are served by regular use of scoring rubrics are as follows: (1) encourage contact between students and faculty, (2) encourage active learning, (3) give prompt feedback, and (4) communicate high expectations. In asynchronous online courses, there are no regular face-to-face meetings between faculty and students, so contact must be achieved through e-mail and on the course site. When students are required to conduct self-evaluations each week using a scoring rubric and communicate the results of self-evaluations with the teacher through e-mail, it achieves the goal of regular contact. The second principle, encouraging active learning, is attained through the methods of course participation, that is, reading the assigned content and discussing the readings with expectations set by the instructor through the grading criteria. These expectations include critical thinking for the application of the weekly content to an assigned task. The third principle, giving prompt feedback, is accomplished each week by letting students know how well they met the standards for weekly discussions of the readings. In another document of AAHE, that is, nine principles of good practice for assessing students' learning, the need to use ongoing rather than episodic assessment strategies was emphasized.7 The fourth principle, setting high expectations, is achieved through directions in the scoring rubric of how to achieve high grades each week for class participation.
The use of a scoring rubric for course-related discussions motivates students to achieve active and meaningful participation.2,8,9 In a study of student participation in 18 graduate-level courses delivered at a distance, there were significantly more discussions in the courses in which discussions were graded than in those that were not graded.9 Each scoring rubric should contain two essential components, the evaluation criteria and the quality rating.10 A variety of types of quality ratings can be used, for example, words or phrases such as poor or excellent, numbers8 such as 1 to 4, 1 for "disastrous" to 7 for "excellent,"11 or the traditional 0 to 100 ratings that are familiar to students. The only criterion for choice of quality rating seems to be that the quality rating is clear enough so that students understand the meaning of the rating, that is, their work is excellent, good, fair, poor, or unsatisfactory, and how it relates to the overall course grade. In the scoring rubric used in this study, the traditional percentages are used because the grade is averaged to serve as a percentage of the course grade. The lowest grade is 0, which is assigned when students are absent, and the highest grade is 95. Grades are usually assigned at five-point increments. The rationale for 95 being the highest grade was that all students in the class should be sharing their thoughts about the course readings, and no one student should be so important to the discussion that he/she receives a perfect score.
METHOD
This was a psychometric study of the interrater reliability of a three-part instrument to evaluate students' online discussion of weekly readings and application of the readings to specific tasks. The online discussions took place in a baccalaureate-level nursing research course at the College of Staten Island, The City University of New York. No data were collected from any human subjects. The data were online weekly text postings of students that were graded as part of the course grade. Students did the same work as usual with no disruption in usual course routines. The study was approved by the institutional review board of the College of Staten Island.
For the first 2 weeks that involved online discussion, each faculty member graded one-half of the class and compared their grades to determine whether there was consistency in grading. In grading the students' online discussion of the first week's course readings, for example, the two instructors differed on one of the three grades for four of five students. This prompted faculty discussions about application of the grading criterion and establishment of interrater reliability. For the next 3 weeks of students' discussions, each faculty member used the instrument to independently grade the postings of one-half of the class (n = 11), changing the students they graded each week so that both of them would get to know all of the students' work. Five of these students were selected each week to be graded by both instructors. An average number of 191 postings were evaluated each week. At the end of each class week of this 3-week period, the two faculty members discussed the grades of the five students that they both graded to clarify use of the scoring rubric. The data from weeks 9 to 13 were used to estimate interrater reliability. For these weeks, each week, a different group of five students were selected for both faculty members to grade. The grades were assigned to the students by the responsible faculty member but recorded by both faculty members and not discussed until completion of the course. These independent ratings were used as the basis for this study.
The instrument, Criteria for Grading Online Discussion, is based on three criteria: frequency of postings, quality of postings, and responses to other students' postings (see Table 1). For the first criterion, the faculty member sums the number of postings for each student. In Blackboard (Blackboard, Inc, Washington, DC), this is easily done by changing the default display option in the discussion forum to "author." The postings are then organized by authors of postings in alphabetical order. To verify that all the postings actually contained nonrepeated content, students' postings are collected and read, one student at a time.
The second criterion, discussion of the required readings, is more difficult to determine, requiring that each student's postings be read and considered in comparison to the course readings and discussion questions. The highest level of quality is considered to be the demonstration, through the postings, of both general and specific knowledge of that week's content. The two faculty members agreed that the presence of statements in the students' postings that reflected learning in the cognitive domain of Bloom's Taxonomy12 would be used to evaluate the level of quality in this criterion. For example, the quality of the postings was evaluated for the extent to which the students defined key concepts, explained content, used examples, and applied content to specific case studies.
The third criterion, quality of discussion commentary, relates to commenting on other students' postings. A rationale for including this criterion is that students should read each other's work and incorporate other students' postings into their own. For this criterion, students are told that they must use the other students' names in their postings. The instructor can then count the number of students mentioned and make a judgment about the substantive nature of the comments regarding other students work. Students are informed through course directions, for example, that just writing, "I agree with Mary" does not involve critical thinking, and thus, they were required to add more substantive commentary, observations, and remarks to their discussion commentary.
Data were analyzed using SPSS version 11.0 (SPSS, Chicago, IL). Because this is a criterion-referenced tool with ordinal data,13 bivariate correlations were estimated using Spearman [rho] and Kendall [tau]-b statistics. The Kendall [tau] statistic is reported because a large proportion of the ranks (56%) were tied.14 Based on differences in the three criteria, the two-rater correlations were also determined for each criterion.
FINDINGS
The correlation of the two raters' grades for 75 postings (5 weeks × 3 criteria × 5 students each week) was 0.837 (P = .000). When rater 1 and rater 2 grades for each criterion were analyzed, the following results were obtained. The correlation for criterion 1 was 0.913 (P = .004). The correlation for criterion 2 was 0.739 (P = .000), and the correlation for criterion 3 was 0.805 (P = .000). The data were also analyzed for correlations of each of the 5 weeks. The correlation for week 9 was 0.779 (P = .000); for week 10, it was 0.749 (P = .000); for week 11, it was 0.768 (P = .005); for week 12, it was 0.827 (P = .011); and for week 13, it was 0.956 (P=.000). For criteria 2 and 3, in general, the grades of the first faculty member were slightly higher than the grades of the second faculty member.
DISCUSSION
The overall correlations of rater scores were satisfactory, indicating that the criteria were acceptable for consistency of assigning grades to the students' postings.13,15 The differences in correlations for each criterion were expected. Criterion 2 was the most difficult in which to obtain high agreement because the expectation of quality has a strong subjective component. Although it is desirable to completely remove subjectivity from scoring, the task is generally impossible to accomplish.16 The faculty members endeavored to enhance the objectivity of scoring by integrating measurable learning indicators derived from the cognitive domain of Bloom's Taxonomy12 into criterion 2. The learning indicators required the students to provide definitions, explanations, examples, and application of content of the required readings in their postings. Thus, incorporating objective learning indicators into the various levels of criterion 2 was probably advantageous in establishing a standard for the raters' appraisal of the quality of the online discussion and in producing an acceptable degree of rater agreement. In addition, the discussions held by the two faculty members that clarified use of the rubric and application of the grading criteria and the conduction of the trial runs that preceded data collection were essential procedures that likely contributed to obtaining an acceptable degree of rater agreement. The pattern of grades obtained by the first faculty member that was slightly higher than the pattern of grades obtained by the second faculty member suggests that these grades may have been influenced by errors of leniency or severity.15 The occurrence of errors of leniency or severity could be attributed to the differences in teaching experiences that existed between the two faculty members.13 The first faculty member had been teaching baccalaureate nursing students for 25 years and nursing research course for 5 years and maintained a high degree of familiarity with course content and students, which may have contributed to an error of leniency. The second faculty member previously had been teaching associate degree nursing students for 10 years, and it was her first teaching assignment with baccalaureate students and the nursing research course. The only recent experience of the second faculty member with research was in conducting nursing research studies, which required the application of research knowledge with a high degree of precision and rigor. The rigorous application of research knowledge required for conducting research studies could have influenced the second faculty member's scores toward an error of severity.
The differences in correlations by week were not significant. This supports the consistency of grading by the individual faculty members.
CONCLUSIONS AND IMPLICATIONS
Grading is an important aspect of teachers' many roles. Weekly grading associated with students' self-evaluations of weekly postings in online courses is advantageous because it ensures ongoing contact with the instructor, motivates students to understand the course expectations, gives students frequent feedback, and generates active learning. Students are greatly affected by the fairness and quality of grading. Grading should indicate the teacher's expectations regarding students' learning of content and should not be too stringent or too lenient. Grading of students' online participation in course work is even more of a challenge than traditional test taking, papers, and other assignments. Reliable tools for this purpose contribute to the knowledge needed for implementation of online courses.
REFERENCES
1. Waits T, Lewis L, Greene B. Distance education at degree-granting postsecondary institutions: 2000-2001 (NCES 2003-017). National Center for Educational Statistics. 2003. http://nces.gov/pubsearch/pubsinfo.asp?pubid=2003017 . Accessed August 12, 2007. [Context Link]
2. Pelz B. (My) three principles of effective online pedagogy. JALN. 2004;8(3):33-45. http://www.sloan-c.org/publications/jaln/v8n3/v8n3pelz.asp . Accessed August 12, 2007. [Context Link]
3. Walvoord BE, Anderson VJ. Effective Grading: A Tool for Learning and Assessment. San Francisco, CA: Jossey-Bass; 1998. [Context Link]
4. Guskey T. "It wasn't fair!" Educators' recollections of their experiences as students with grading. 2006. http://eric.gov/ERICDocs/data/ericdocs2sql/contentstorage01/0000019b/80/1b/d5/fa.pdf . Accessed August 12, 2007. [Context Link]
5. Loveland TR. Writing standards-based rubrics for technology education classrooms. Technol Teach. 2005;65(2):19-22. [Context Link]
6. Chickering AW, Ehrmann SC. Implementation of the seven principles: technology as lever. AAHE Bull. 1996:3-6. www.tltgroup.org/programs/seven.html . Accessed August 16, 2007. [Context Link]
7. Astin AW, Banta TW, Cross P, et al. 9 Principles of good practice for assessing student learning American Association for Higher Education, Assessment Forum. 2001. www.assessment.tcu.edu/assessment/aahe.pdf . Accessed August 16, 2007. [Context Link]
8. Andrade HG. Using rubrics to promote thinking and learning. Educ Leadersh. 2000;57(5):13-18. [Context Link]
9. Rovai A. Strategies for grading online discussions: effects on discussions and classroom community in Internet-based university courses. J Comput High Educ. 2003;15(1):89-107. [Context Link]
10. Truemper CM. Using scoring rubrics to facilitate assessment and evaluation of graduate-level nursing students. J Nurs Educ. 2004;43:562-564. [Context Link]
11. Cho K, Schunn CD, Wilson RW. Validity and reliability of scaffolded peer assessment of writing from instructor and students perspectives. J Educ Pyschol. 2006;98:898-901. [Context Link]
12. Bloom B, Englehart M, Furst E, Hill W, Krathwohl D. Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook I. The Cognitive Domain. New York, NY: David McKay Company; 1956. [Context Link]
13. Waltz C, Strickland O, Lenz E. Measurement in Nursing and Health Research. 3rd ed. New York, NY: Springer; 2005. [Context Link]
14. Bryman A, Cramer D. Quantitative Data Analysis With SPSS Release 10 for Windows: A Guide for Social Scientists. Philadelphia, PA: Taylor & Francis; 2001. [Context Link]
15. Polit D, Beck C. Nursing Research: Principles and Methods. Philadelphia, PA: Lippincott Williams & Wilkins; 2004. [Context Link]
16. Nunnally J, Bernstein I. Psychometric Theory. 3rd ed. New York, NY: McGraw Hill; 1994. [Context Link]









