Letter to the editor on concerns of data leakage


We have recently submitted a letter to the editor of the Journal of Imaging Informatics in Medicine, and the letter was published online today. In the letter, we raise concerns about potential data leakage in one of the journal's papers on treatment response prediction from MRI data. The paper reports an average AUC of 0.98 for the prediction of pathologic complete response (pCR) to neoadjuvant chemotherapy (NAC), based on a pre-treatment MRI scan. This is accomplished by training a 2D convolutional neural network (CNN) on slices of the MRI scans.

Such strong prediction performance is unparallelled in the literature on similar prediction scenarios. We suspect that the unexpectedly good performance might be ascribed to accidental data leakage, in the sense that different MRI slices of the same patient may have ended up in both the train and test sets. This would allow the CNN to memorize each patient's response label, as opposed to discovering salient imaging features useful for predicting treatment response in MRI scans of future patients.

In the letter, we explain the reasons for voicing our concerns, and provide support for our claim by an experiment similar to the one reported on in the paper.

Unfortunately, the letter is not available under open access. However, the preprint is freely available here.