On trickiness of hard evidence in ELT

Recently, there's been a lot of debate about the need of evidence, method and research in ELT. I confess I'm truly fascinated by the topic. I enjoy delving into the debates as well as reading articles describing various experiments. I also love to experiment myself. However, I often catch myself having doubts, especially concerning the process, but also interpretation and implementation of the results. The other day I came across this interesting article called How Tests Make us Smarter, where Henry L. Roediger III discusses the benefits of regular quizzing:

"When my colleagues and I took our research out of the lab and into a Columbia, Ill., middle school class, we found that students earned an average grade of A- on material that had been presented in class once and subsequently quizzed three times, compared with a C+ on material that had been presented in the same way and reviewed three times but not quizzed. The benefit of quizzing remained in a follow-up test eight months later".

This is my concern: I suppose the author is talking about two different texts (material) and one group of students. The thing is that I'm not convinced that it's possible to get reliable results with two different pieces of material. Who can guarantee that they were of exactly the same level of difficulty; that they included comparably demanding content? We have some scientific methods and tools that can measure readability scores, for example, but there are so many factors at play as far as the difficulty of texts is concerned. Nevertheless, a similar situation would occur if the researcher used the same text but two different groups: there would be no guarantee that the groups (or individual students) had the same ability, intelligence, aptitude, etc. 

I think that the problem is the attempt to turn ELT into rigorous science. We call for research and concrete evidence but in any research of this type, there are people involved who react to stimuli, interact, respond, have different character features, etc.; they simply behave differently and unpredictably in different situations and under different circumstances. Then there is the learning content: texts, images, equations, vocabulary, graphs, you name it. The former is undoubtedly a very unstable element but also, the material is not a constant either because the perception of difficulty of a piece of material is highly dependent on the one who's perceiving - it's not merely a property of the material itself. 

I remember conducting an experiment (as part of my MA studies) with a group of intermediate students (the same age, approximately the same level). It was based on I.S.P. Nation's belief that we need to be familiar with 98% of the words in a text to be able to understand it sufficiently. First, I gave my students a paper version Nation's Vocabulary Levels Test (which can be easily accessed online) to assess their vocabulary knowledge. My intention was to find out to what extent the result corresponded with the readability score of the text they were going to read. Then I gave them a simple authentic short story by Ernest Hemingway and asked them to underline all the unknown vocabulary while reading. To make the results more precise, I asked them to use two different markers: one for words they don't know at all and can't infer from the surrounding context, the other for those whose meaning they don't know but think they can guess it from the context and co-text. Then I asked them some comprehension questions to see the correlation between the unknown words and the ability to understand the story. Finally, in a random manner, I tested if they really knew the words they hadn't underlined. I got all sorts of interesting results, such as 1) some students had 'cheated' and underlined less than they should have 2) one of the best students in the class had underlined the most unknown words, which, however, hadn't prevented him from understanding the most important message of the story, 3) some students had underlined some vocabulary only to realize later that they actually knew them, 4) another very good student hadn't underlined many words but his comprehesion was rather weak, etc. Overall, I got a lot of hard evidence of how distorted the results can be if human factor is involved. I don't intend to go into further detail here. The point is that all students got the same text, fairly easy one from the linguistic point of view, but because the story was a literary text with lots of implicit and hidden messages and meanings, and each student  came from a different background, with different experience and schemata, the level of comprehension didn't and couldn't correlate with the actual language knowledge. 

All in all, I believe that it's the human factor what complicates ELT research and the validity of any evidence. No matter how much we want to experiment, some of the data we get from our experiments will often be pretty unreliable and irreplicable. If I say something worked for my students and I even prove it, any educator or researcher can disprove my claims quite easily if they conduct the research at a different time, in a different environment, with different students and different material. No wonder that to some ELT research may appear a waste of time; they prefer taking all sorts of feeble arguments for granted and they simply try what others have tried before without challenging their assumptions.