Despite the growing interest in Question Generation, evaluating these systems remains notably difficult. Many authors rely on metrics like BLEU or ROUGE instead of relying on manual evaluations, as their computation is mostly free. However, corpora generally used as reference is very incomplete, containing just a couple of hypotheses per source sentence. In this paper, we propose MONSERRATE corpus, a dataset specifically built to evaluate Question Generation systems, with, on average, 26 questions associated to each source sentence, attempting to be an “exhaustive” reference. With MONSERRATE we study the impact of the reference size in evaluating Question Generation systems. Several evaluation metrics are used, from more traditional lexical ones to metrics based on word embeddings, and we conclude that these are still a limiting evaluation factor, as they lead to different outcomes. Finally, with MONSERRATE, we benchmark three different Question Generation systems, representing different approaches to this task.