Balancing tests are diagnostics designed for use with propensity score methods, a widely used non-experimental approach in the evaluation literature. Such tests provide useful information on whether plausible counterfactuals have been created. Currently, multiple balancing tests exist in the literature but it is unclear which is the most useful. This article highlights the poor size properties of commonly employed balancing tests and attempts to shed some light on the link between the results of balancing tests and bias of the evaluation estimator. The simulation results suggest that in scenarios where the conditional independence assumption holds, a permutation version of the balancing test described in Dehejia and Wahba (Rev Econ Stat 84:151–161, 2002) can be useful in applied study. The proposed test has good size properties. In addition, the test appears to have good power for detecting a misspecification in the link function and some power for detecting an omission of relevant non-linear terms involving variables that are included at a lower order.