For perfect sampling, as long as there's no aliasing increasing the sampling rate does not improve matters. Note that perfect sampling is not a real thing.
For perfect sampling with quantization, the quantization has the effect of adding noise into the signal. If you're lucky, the added noise appears to be random.
If you assume that the sampling noise is independent from sample to sample and is uniformly distributed, then you can model the sampling process as a bunch of samples, each of which is corrupted by noise with a variance of \$\sigma^2=\frac{1}{12}\$ (if I'm getting my math right). That variance does not change with sampling rate. However, because each sample is independent, the spectral density of the noise does change with sampling rate -- the noise power stays the same, but with faster sampling it's spread out over a wider frequency band.
Now go back -- I made the implicit assumption that the input signal was bandlimited, because there's a sampling rate that causes no aliasing. This means that when we sample faster than that rate, our signal is band-limited. That, in turn, means that we can filter the signal. Depending on how you want to look at it, that filter either rejects some of the noise whose spectrum is now spread out by the faster sampling, or it averages out the noise and reduces its amplitude (those statements are equivalent, by the way -- one is the frequency domain explanation, the other is the time domain explanation).
So for the right signal, an ADC with quantization noise can perform better by "oversampling". Note, however, that for the wrong signal, you're out of luck. In this case a wrong signal is one where the quantization noise is not independent from sample to sample. An easy example of this is a perfectly constant signal that falls on the edge of a quantization step -- there's an error, but it's also constant.
However, there's another process at play -- ADCs are noisy things. A typical 16-bit ADC will have several LSBs of noise. It's often just enough to swamp out the quantization noise. The down side of this is that the noise floor is greater; the upside of this is that the noise is pretty much guaranteed to be independent from sample to sample. In this case, you can always improve things by oversampling.