Comparing trusts with one another can be useful for performance monitoring and quality improvement. Three different scoring approaches were examined to determine how reliably they differentiated between trusts’ aggregated patient experience results. National survey data for a range of questions were analysed using Generalizabilty theory applied to the Picker ‘problem score’, the partial credit scoring system used for benchmarking by the Care Quality Commission, and a ‘bottom box’ score like that used in Care Quality Commission Quality and Risk Profiles. Variance estimates obtained from multilevel regression models (both with and without case-mix adjustment) were used to calculate trust-level generalizability coefficients. The problem score and partial credit approached produced similarly high levels of reliability, supporting use of both these methods in comparing trusts’ performance and guiding service improvement, while the bottom box approach fared rather less well. The meaning attached to scores needs to be considered in conjunction with reliability when choosing a scoring approach.