Spin crossover (SCO) plays a major role in biochemistry, catalysis, materials, and emerging technologies such as molecular electronics and sensors, and thus accurate prediction and design of SCO systems is of high priority. However, the main tool for this purpose, density functional theory (DFT), is very sensitive to applied methodology. The most abundant SCO systems are Fe(II) and Fe(III) systems. Even with average good agreement, a functional may be significantly more accurate for Fe(II) or Fe(III) systems, preventing balanced study of SCO candidates of both types. The present work investigates DFT's performance for well-known Fe(II) and Fe(III) SCO complexes, using various design types and customized versions of GGA, hybrid, meta-GGA, meta-hybrid, double-hybrid, and long-range-corrected hybrid functionals. We explore the limits of DFT performance and identify proficient Fe(II)-Fe(III)-balanced functionals. We identify and quantify remarkable differences in the DFT description of Fe(II) and Fe(III) systems. Most functionals become more accurate once Hartree-Fock exchange is adjusted to 10-17%, regardless of the type of functionals involved. However, this typically introduces a clear Fe(II)-Fe(III) bias. The most accurate functionals measured by mean absolute errors <10 kJ/mol are CAMB3LYP-17, B3LYP*, and B97-15 with 15-17% Hartree-Fock exchange, closely followed by CAMB3LYP and CAMB3LYP-15, OPBE, rPBE-10, and B3P86-15. While GGA functionals display a small Fe(II)-Fe(III) bias, they are generally inaccurate, except the O exchange functional. Hybrid functionals (including B2PLYP double hybrids and meta hybrids) tend to favor HS too much in Fe(II) vs Fe(III), which is important in many studies where the oxidation state of iron can vary, e.g. rational SCO design and studies of catalytic processes involving iron. The only functional with a combined bias <5 kJ/mol and a decent MAE (15 kJ/mol) is our customized PBE0-12 functional. Alternatively one has to sacrifice Fe(II)-Fe(III) balance to use the best functionals for each group separately. We also investigated the precision (measured as the standard deviation of errors) and show that the target accuracy for iron SCO is 10 kJ/mol for accuracy and 5 kJ/mol for precision, and DFT is probably not going to break this limit in the near future. Importantly, all four types of functional behavior (accurate/precise, accurate/imprecise, inaccurate/precise, inaccurate/imprecise) are observed. More generally, our work illustrates the importance not only of overall accuracy but also of balanced accuracy for systems likely to occur in context.