Computational science's march toward the petascale demands innovations in analysis and visualization of the resulting datasets. As scientists generate terabyte and petabyte data, it is insufficient to measure the performance of visual analysis algorithms by rendering speed only, because performance is dominated by data movement. We take a systemwide view in analyzing the performance of software volume rendering on the IBM Blue Gene/P at over 10,000 cores by examining the relative costs of the I/O, rendering, and compositing portions of the volume rendering algorithm. This examination uncovers room for improvement in data input, load balancing, memory usage, image compositing, and image output. We present four improvements to the basic algorithm to address these bottlenecks. We show the benefit of an alternative rendering distribution scheme that improves load balance, and how to scale memory usage so that large data and image sizes do not overload system memory. To improve compositing, we experiment with a hybrid MPI-multithread programming model, and to mitigate the high cost of I/O, we implement multiple parallel pipelines to partially hide the I/O cost when rendering many time steps. Measuring the benefits of these techniques at scale reinforces the conclusion that BG/P is an effective platform for volume rendering of large datasets and that our volume rendering algorithm, enhanced by the techniques presented here, scales to large problem and system sizes.