The Tower of London - Freiburg version (TOL-F) was developed in three parallel-test versions (A, B, and C) that only differ in their physical appearance by interchanged ball colors, but not in their cognitive demands. We addressed the question whether the test-retest reliability of an identical problem set differs from the parallel test-retest reliability of a structurally identical problem set with a marginally different physical appearance. Reliabilities were assessed in two samples of young adults over a 1-week interval: In the parallel test-retest sample (n = 93; 49 female), half of the participants accomplished version A at the first session and version B at the second session, while the other half started with version B in the first session and continued with A in the second session. In the identical test-retest sample (n = 86; 48 female), half of the participants performed on version A in both the first and the second session, while the other half went through the same procedure with version B. For overall planning accuracy, intraclass correlation coefficients for absolute agreement were r = .501 for the parallel test-retest and r = .605 for the identical test-retest sample, with Pearson correlations of r = .559 and r = .708 respectively. Greatest lower bound estimates of reliability were adequate to high in the two samples (ranging between .765 and .854) confirming previous studies. Although the TOL-F revealed only moderate intraclass correlations for absolute agreement, it showed some of the highest psychometric indices compared to repeated assessments with other TOL tests.