Scoring Central

Full Version: The State of Orchestral Sample Libraries
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
With a little wink to Sam's sci-fi thread, here's a thought I had...

If a VO composer version of Kyle Reese had traveled back from present day to, say, 2005 or even 2010, past me would have been baffled and even a little horrified at learning that in in 2020, orchestral composing tools are STILL based on wave samples, only exponentially larger and more detailed. In the late 00's I was pretty sure this trend of brute-forcing instruments on computers using samples would be a passing thing, to be replaced by physical modeling or some kind of hybrid technology. But nope. We're still doing the Mellotron thing. Record stuff to the nth degree, make a machine play it back.

Seriously, are there no alternatives in sight? With sample libraries becoming increasingly more complex and resource demanding, it feels like they're "tools for audio playback" rather than "instruments", given the lag and latencies people seem to take for granted nowadays. It's a depressing development (if it can be said to be even that) and I find myself longing for something more clever, more efficient, more elegant.

Simply put, even the most advanced orchestral libs around today are based on a 1980's technological paradigm. And people are still buying into it, not looking for alternatives.

Why?
There are some alternatives. Two companies called Sample Modeling and Audio Modeling.

The problem, though, with a modeled instrument, as far as I understand it, is that it comes with exactly zero human expression built in. So it requires quite a lot of time and effort to make it sound like a performance. Samples, for all their flaws, are ultimately produced by humans - and, yes, it's imperfect, but there is still human expression there.

There is also Aaron Venture's Infinite series - it comprises Brass and Woodwinds, though Strings and Percussion are apparently coming later this year. It's sampled, but the instruments have an insane amount of scripting going on, so they have the flexibility to modeled instruments. Not only that, but they way they're programmed means that there's no keyswitches involved. I own both instruments and while they're not perfect - a little bit of the unique character of a performed or sampled instrument is still lost (I've heard Venture's approach called "pseudo-modeled") - but I'm extremely pleased with the flexibility, playability, and tone of both, and eagerly await the strings. (There are, of course, demos on the website.)

With pedantry aside, though, I do agree with you.

As for why, I could not say. Perhaps because of the time it takes to arrive at a good modeled instrument, and then the time after that to arrive at an instrument that gets the tone of sampled instruments (and Venture's are hardly the most characterful samples, to be fair), combines it with the flexibility and low memory usage of the modeled instrument, and then on top of that devises and programs a system of playback that lets you control and "perform" the instrument without needing to memorize a bunch of keyswitches and needing to tweak it for hours and hours and hours. (Obviously, the more tweaking the better, and this is true for sampled instruments as much as modeled instruments as much as Infinite series instruments, but I would think that an even more obsessive level of tweaking is needed for a modeled instrument.) That would be my best guess.

Edit: Found something Aaron Venture said about the instruments:
Aaron Venture Wrote:The physically-modeled vs sample-based discussion is... I don't think I could call Infinite (so far) physically-modeled. While it is a frank-ton of math and psycho-acoustics, it's still based on samples. So if that math and psycho-acoustics part is what you want to call "modeled", then sure.
I think both wav samples and physical modelling (as we know it) are both evolutionary dead-ends. In my opinion, the future of sampling will be neural networks. There are already several absolutely stunning neural networks which can 'hallucinate' images and even videos (I recommend following the youtube channel 'Two Minute Papers'), it is only a matter of time before someone feeds instrumental performances and MIDI files into a neural network and gets "virtual performances" out.

The drawback to WAV samples is indeed the size. It simply has become too uneconomical to produce larger and larger libraries. Yes, there are clever techniques, such as most developers commonly using wholetone sampling to halve the recording time, or much more complicated techniques like "tail stitching" and so on. Musicians and studio time are by far the largest expense, with sample cutting perhaps as the second most. Once those steps are done, everything else in the development process is merely a matter of time and resources. WAV samples of course on their own are flawless representations of an individual performance of a note; if pieces were nothing more than a single note, then that would be perfect (indeed, even the most basic spiccatos and pizzicatos can sound fantastic given the right treatment and a few RR). However, the breakdown occurs when many notes are needed, or samples must be crossfaded or combined. Here there are numerous things which can go wrong and kill the playability and realism of sample-based instruments, and not just in terms of programming; the realism of the instrument is fully reliant on the abilities of the user to understand and work with the samples to create a realistic sound. Worse still, to approach a level of appropriate realism, a task which is nearly impossible, requires a near infinite number of samples, requiring enormous space, time and money.

The drawback to modelling or synthesis imho is that no model can truly recreate a real sound exactly. The problem is the analog world is too messy to be fully represented in algorithms. Yes, one can have the right mix of overtones, at the correct phase relations to each other, with the correct per-overtone volume characteristic, all things which can be uncovered with careful FFT analysis, but eventually no matter what you do it won't sound real because humans, and the instruments we play, are extremely imprecise. Even incorporating real sampled noises and extensive use of clever noise and randomization, the chances of ending up in Uncanny Valley still exist. Worse still, such models tend to rely heavily on user input to behave correctly, and, like purely sample-based instruments, will only 'perform' a part as well as the user is capable of. In most modelling instances I've seen, samples are regularly used as a basis, such as in the Technics WSA1 modelling "synth", which actually relies on a ROM of waveforms and some very clever (for the 90's) programming. The benefit? Much smaller memory/space requirements, but requiring higher CPU, and potentially not as realistic as their much more massive raw sample-based counterparts.

Neural networks, once trained, require very little storage space. A singing neural network, for example, can be trained with as little as 30 minutes of source material, over the course of a few hours with a powerful graphics card, and once trained does not need that material included. By comparison, a sample library requires dozens if not hundreds of hours of recordings, which must be manually cut and processed, and all must be included in the final product. They will also accurately transition between notes, not needing legato transitions or scripting. They even can perform with intelligent phrasing and shaping of individual notes, something which until now has always required an experienced user to perform. Most powerful of all, a neutral network should never perform the exact same way twice... which is also arguably the first weakness. Likewise, they are prone (at least at this time) to generating artifacts, and when they reach 'edge cases' which were not covered in the training data, can sometimes result in unwanted or incorrect sounds.

The most promising future for me is the benefit to composers who prefer working in notation software. No longer will sequencing CC's and manually performing lines be essential skills. Instead, simply mark 'pizzicato' here, 'forte' there, 'col legno' over there, etc. and the neural network will interpret it and truly perform the part.

The other promising bit is the idea that any user will one day be able to record, say, 10-30 minutes of playing a few pieces to a click track, then import that into their "sampler of the future" and get a playing, real performing instrument. One could individually sample every member of a string section in only a few hours and get a section which never repeats itself.

Even better, it may one day be also possible to train a network off, say, a close microphone position, and it will automatically 'hallucinate' further mic perspectives. Even more powerful, such perspectives can be modelled and moved around, with the neural network also generating the reverb/room tone.

Listen to this example of a neural network which not only makes up songs, but sings lyrics too:
https://openai.com/blog/jukebox/

Wavenet is one of the more remarkable speech synthesis models. Much better than any previous method (see examples), with 'parametric' roughly representing the "modelling" approach, and 'concatenative' roughly representing the "sampling" approach:
https://deepmind.com/blog/article/wavene...-raw-audio

(it should be noted, Wavenet is generating waveforms, sample by sample, based only in this case on text input)

I don't think modelling or sampling are going anywhere, at least for the next 5-10 years, and they will probably stick around for decades more as a sort of 'legacy' thing, the same way FM synthesis is still used today... or the same reason people buy candles when light bulbs are superior in every way. However, I am fairly convinced that eventually the job of sample library developer will become extinct.

Edit: Perhaps the scariest thing is that this is 100% possible to do TODAY. It's just no one who makes neural networks has been interested in the issue, and no one in sampling has been interested in neural networks (at least, no one big).
I didn't think neural networks were "there" yet; I've seen the trippy AI-generated images of course, as well as the pictures of people who don't actually exist. But I assumed doing the same thing for audio was still a ways off. I believe bigcat posted some stuff here on the forum a year or two ago with orchestral samples generated by some Google AI (IIRC), and that wasn't exactly hi fidelity.

Still, using neural networks would still require the use of samples, no? I mean, the AI can't possibly generate lifelike instruments in realtime on a regular PC, can it?
(05-19-2020, 12:48 PM)Mattias Westlund Wrote: [ -> ]I didn't think neural networks were "there" yet; I've seen the trippy AI-generated images of course, as well as the pictures of people who don't actually exist. But I assumed doing the same thing for audio was still a ways off. I believe bigcat posted some stuff here on the forum a year or two ago with orchestral samples generated by some Google AI (IIRC), and that wasn't exactly hi fidelity.

Still, using neural networks would still require the use of samples, no? I mean, the AI can't possibly generate lifelike instruments in realtime on a regular PC, can it?

The stuff Bigcat posted were samples (of unknown providence) used to train a neural network, not the output of it. The output can be as high quality as the training is capable of, it's just that using 32 kHz 8-bit mono samples or whatever is faster to train.

Once the neural network is trained, the training data (e.g. a 30 minute clip of someone performing a few songs on said instrument, with corresponding MIDI reference files) is no longer needed and can be left behind. For example, I use a neural network based image upscaler for work; the dataset used to train it was gigabytes upon gigabytes, but the entire package now is only a few MB at most.

The big thing is, every year papers come out offering doublings or even order of magnitude improvements in processing time. Consider NVIDIA's new neural network upscaling, which allows a game to render at a lower resolution (e.g. 720p) and upscale that image to a higher resolution (1080p, 1440p, 4K, etc.), saving tons of resources. By comparison, the neural network upscaler I use takes several minutes to process a photo on my GTX 1060.

They're already doing this stuff with VIDEO, which is several dimensions larger and orders of magnitude more intensive than audio is. It's just a matter of people with the right expertise deciding to refine the technology in that direction, and given how saturated and expensive current sample library development is, I think any big company in this field worth their salt is going to be looking for that lifeline real soon; we can't just keep sampling ad infinitum like this, it is just becoming uneconomical. Eventually someone will put out yet another giant orchestral library that no one wants or needs or can justify and the market will probably crash.  Sad
I have listened to some great examples of productions with modeled instruments and I think it is going to get a lot better as we research it further. The only problem I see now is the need to properly automate all these parameters yourself but even this can be solved with scripting, which can be directly imported from the sampling world.
I totally agree about neural networks. To see where this is going, look at the advances in voice synthesis. Until very recently, voice synthesizers worked the same way instrument libraries do today. They recorded lots of samples of someone speaking all the different phonemes, and then strung them together with plausible attempts at blending between them. It worked, but it didn't sound very natural. It produced the canonical "computer speaking" stilted expression. And each voice required a huge library of samples to sound good, and it was very labor intensive to produce them.

Not anymore. Modern voice synthesizers use neural networks. They work so much better than the old ones. You could easily be tricked into thinking you were hearing a real person.

I'm kind of amazed this hasn't already taken over music synthesis. The same architectures used for spoken voice should work just as well for other instruments with only minor modifications.
(05-20-2020, 04:01 PM)peastman Wrote: [ -> ]I totally agree about neural networks.  To see where this is going, look at the advances in voice synthesis.  Until very recently, voice synthesizers worked the same way instrument libraries do today.  They recorded lots of samples of someone speaking all the different phonemes, and then strung them together with plausible attempts at blending between them.  It worked, but it didn't sound very natural.  It produced the canonical "computer speaking" stilted expression.  And each voice required a huge library of samples to sound good, and it was very labor intensive to produce them.

Not anymore.  Modern voice synthesizers use neural networks.  They work so much better than the old ones.  You could easily be tricked into thinking you were hearing a real person.

I'm kind of amazed this hasn't already taken over music synthesis.  The same architectures used for spoken voice should work just as well for other instruments with only minor modifications.

I really look forward to the day that you can scroll through your GPS's voices and select, say, Stephen Fry. Or if you have a book, you could scroll through voices and voila! instant audiobook w/ your narrator of choice.
(05-20-2020, 06:36 PM)Terry93D Wrote: [ -> ]
(05-20-2020, 04:01 PM)peastman Wrote: [ -> ]I totally agree about neural networks.  To see where this is going, look at the advances in voice synthesis.  Until very recently, voice synthesizers worked the same way instrument libraries do today.  They recorded lots of samples of someone speaking all the different phonemes, and then strung them together with plausible attempts at blending between them.  It worked, but it didn't sound very natural.  It produced the canonical "computer speaking" stilted expression.  And each voice required a huge library of samples to sound good, and it was very labor intensive to produce them.

Not anymore.  Modern voice synthesizers use neural networks.  They work so much better than the old ones.  You could easily be tricked into thinking you were hearing a real person.

I'm kind of amazed this hasn't already taken over music synthesis.  The same architectures used for spoken voice should work just as well for other instruments with only minor modifications.

I really look forward to the day that you can scroll through your GPS's voices and select, say, Stephen Fry. Or if you have a book, you could scroll through voices and voila! instant audiobook w/ your narrator of choice.

We're getting there, it seems! Here's the late Alec Guinness reading Lovecraft's "The Call of Cthulhu." It's definitely got its imperfections but is no less impressive for them.

https://www.youtube.com/watch?v=YU-Iwj4P...=emb_title
Wow. That's amazing.