I think both wav samples and physical modelling (as we know it) are both evolutionary dead-ends. In my opinion, the future of sampling will be neural networks. There are already several absolutely stunning neural networks which can 'hallucinate' images and even videos (I recommend following the youtube channel 'Two Minute Papers'), it is only a matter of time before someone feeds instrumental performances and MIDI files into a neural network and gets "virtual performances" out.
The drawback to WAV samples is indeed the size. It simply has become too uneconomical to produce larger and larger libraries. Yes, there are clever techniques, such as most developers commonly using wholetone sampling to halve the recording time, or much more complicated techniques like "tail stitching" and so on. Musicians and studio time are by far the largest expense, with sample cutting perhaps as the second most. Once those steps are done, everything else in the development process is merely a matter of time and resources. WAV samples of course on their own are flawless representations of an individual performance of a note; if pieces were nothing more than a single note, then that would be perfect (indeed, even the most basic spiccatos and pizzicatos can sound fantastic given the right treatment and a few RR). However, the breakdown occurs when many notes are needed, or samples must be crossfaded or combined. Here there are numerous things which can go wrong and kill the playability and realism of sample-based instruments, and not just in terms of programming; the realism of the instrument is fully reliant on the abilities of the user to understand and work with the samples to create a realistic sound. Worse still, to approach a level of appropriate realism, a task which is nearly impossible, requires a near infinite number of samples, requiring enormous space, time and money.
The drawback to modelling or synthesis imho is that no model can truly recreate a real sound exactly. The problem is the analog world is too messy to be fully represented in algorithms. Yes, one can have the right mix of overtones, at the correct phase relations to each other, with the correct per-overtone volume characteristic, all things which can be uncovered with careful FFT analysis, but eventually no matter what you do it won't sound real because humans, and the instruments we play, are extremely imprecise. Even incorporating real sampled noises and extensive use of clever noise and randomization, the chances of ending up in Uncanny Valley still exist. Worse still, such models tend to rely heavily on user input to behave correctly, and, like purely sample-based instruments, will only 'perform' a part as well as the user is capable of. In most modelling instances I've seen, samples are regularly used as a basis, such as in the Technics WSA1 modelling "synth", which actually relies on a ROM of waveforms and some very clever (for the 90's) programming. The benefit? Much smaller memory/space requirements, but requiring higher CPU, and potentially not as realistic as their much more massive raw sample-based counterparts.
Neural networks, once trained, require very little storage space. A singing neural network, for example, can be trained with as little as 30 minutes of source material, over the course of a few hours with a powerful graphics card, and once trained does not need that material included. By comparison, a sample library requires dozens if not hundreds of hours of recordings, which must be manually cut and processed, and all must be included in the final product. They will also accurately transition between notes, not needing legato transitions or scripting. They even can perform with intelligent phrasing and shaping of individual notes, something which until now has always required an experienced user to perform. Most powerful of all, a neutral network should never perform the exact same way twice... which is also arguably the first weakness. Likewise, they are prone (at least at this time) to generating artifacts, and when they reach 'edge cases' which were not covered in the training data, can sometimes result in unwanted or incorrect sounds.
The most promising future for me is the benefit to composers who prefer working in notation software. No longer will sequencing CC's and manually performing lines be essential skills. Instead, simply mark 'pizzicato' here, 'forte' there, 'col legno' over there, etc. and the neural network will interpret it and truly perform the part.
The other promising bit is the idea that any user will one day be able to record, say, 10-30 minutes of playing a few pieces to a click track, then import that into their "sampler of the future" and get a playing, real performing instrument. One could individually sample every member of a string section in only a few hours and get a section which never repeats itself.
Even better, it may one day be also possible to train a network off, say, a close microphone position, and it will automatically 'hallucinate' further mic perspectives. Even more powerful, such perspectives can be modelled and moved around, with the neural network also generating the reverb/room tone.
Listen to this example of a neural network which not only makes up songs, but sings lyrics too:
https://openai.com/blog/jukebox/
Wavenet is one of the more remarkable speech synthesis models. Much better than any previous method (see examples), with 'parametric' roughly representing the "modelling" approach, and 'concatenative' roughly representing the "sampling" approach:
https://deepmind.com/blog/article/wavene...-raw-audio
(it should be noted, Wavenet is generating waveforms, sample by sample, based only in this case on text input)
I don't think modelling or sampling are going anywhere, at least for the next 5-10 years, and they will probably stick around for decades more as a sort of 'legacy' thing, the same way FM synthesis is still used today... or the same reason people buy candles when light bulbs are superior in every way. However, I am fairly convinced that eventually the job of sample library developer will become extinct.
Edit: Perhaps the scariest thing is that this is 100% possible to do TODAY. It's just no one who makes neural networks has been interested in the issue, and no one in sampling has been interested in neural networks (at least, no one big).