Monday, May 13, 2024

Turnitin Is Selling us Snake Oil, or Why AI Detection Cannot Work

The notion of measuring "AI-generated text" as a fixed percentage of an academic submission is fundamentally flawed. This metric implies a homogeneous substance, akin to measuring the alcohol content in a beverage. However, my recent survey suggests that academic integrity associated with AI use is far from homogeneous. The survey asked educators to evaluate the ethical implications of using AI for twelve different tasks in writing an academic paper, ranging from researching to brainstorming to editing to actually writing full sections.

The findings revealed significant variance in responses. While many respondents were comfortable with AI aiding in brainstorming ideas, they expressed reservations or outright disapproval of AI writing entire paragraphs or papers. This disparity underscores a critical issue: there is no consensus in the academic profession on what constitutes acceptable AI assistance in learning. More strikingly, within each individual's responses, there was considerable variation in how different AI uses were assessed.

Consider the implications of a tool like Turnitin reporting "50% AI-generated" content. What does this figure actually represent? It lacks context about how the AI-generated content was incorporated. For instance, a paper could be largely original, with only minor edits made by AI at the end, potentially showing a high percentage of AI contribution. Conversely, a student might contribute minimally to an essentially AI-written paper, making slight modifications to reduce the AI-detected percentage. Both scenarios could yield vastly different percentages, yet the ethical implications are markedly divergent.

The pursuit of better detection technology misses the point. The issue is not with the detection capabilities but with the construct itself. The very idea of "AI-generated text" as a unified concept is problematic. Just as a depression inventory measures various symptoms that converge on the underlying construct of depression, our methods for evaluating AI in academic work must recognize the diverse and context-dependent nature of its use. The current approach, which treats all AI contributions as equivalent, is akin to judging a book's genre by counting its words. I which Turnitin and other commercial "AI Detectors" would show just a little more integrity and stop selling us the snake oil. They must know for sure that their claims are bogus, because AI-generated text is not a valid construct to be measured. 

Instead of focusing obsessively on detecting AI-generated content, we need to shift our perspective. We should expect and require students to use AI as part of their learning process. The challenge then becomes developing assignments that not only measure the content knowledge but also the meta-AI skills and competencies necessary to navigate and leverage these tools effectively. This approach acknowledges the complexity of AI's applications and ensures it is used responsibly, promoting a learning environment that respects both the potential and the limitations of artificial intelligence.

For the most curious, I asked Claude to quantify what seems to be completely obvious if you look at the data. Feel free to recalculate. 

"To quantify this variability in responses, I calculated the within-subject variance for each survey respondent across the 12 AI use cases. Within-subject variance measures how much an individual's responses vary across different items or situations. A high within-subject variance indicates that a respondent is evaluating the different use cases distinctly, rather than applying a consistent judgment.

The average within-subject variance across all respondents was 0.85 (on a scale from 0 to 2, where 0 indicates no variance and 2 indicates maximum possible variance). This high value confirms that respondents were not simply deciding whether "AI-generated text" as a whole is acceptable, but were making nuanced distinctions between different uses of AI.

Furthermore, I calculated the intraclass correlation coefficient (ICC) for the survey responses. The ICC measures the proportion of total variance that is attributable to differences between respondents, as opposed to within-respondent variance. The ICC was only 0.28, indicating that most of the variance (72%) was within-respondent, not between-respondent. In other words, respondents showed far greater consistency in their responses to any given AI use case than they did across the range of use cases.

These psychometric indices underscore the problem with treating "AI-generated text" as a unitary construct. If it were a valid construct, we would expect to see low within-subject variance (respondents applying a consistent judgment across use cases) and high ICC (most variance being attributable to differences between respondents). Instead, we observe the opposite pattern, indicative of a construct validity problem."

Do AI bots deceive?

The paper, Frontier Models are Capable of In-Context Scheming , arrives at a time when fears about AI’s potential for deception are increasi...