Habr.com
Article by Alexei Ustinov
Yes, now, thanks to AI, anyone, even those who don’t sing, can sing perfectly in any language. How does it work and what can it lead to?
NVM (Neural Voice Model) - neural voice model
RVC (Retrieval-Based Voice Conversion) - search-based voice conversion
I do not pretend to cover the topic completely - it is capacious and rapidly developing, something new appears in it every week, month. But I hope that my experience will help those interested in entering it faster.
Yes, still. Typically, a professional is considered to be someone who has learned a skill, masters it, or someone who earns money from it. But we all know people who sing beautifully, but they did not graduate from music school, college, conservatory, did not study vocals and do not earn money by singing. By the way, the vocal department is the only one in the conservatory that does not require a music college certificate for admission, and the age threshold is 35 years (at least that was the case 20 years ago).
I have long been interested in voice synthesis, primarily for the purpose of creating vocals.
For the past 3-4 years I have been using online TTS services to provide voiceover commentary for educational games. Previously, there were few Russian voices and those like Svetlana and Nikolai seemed to indicate a real artist, an announcer, brought in to create the bank. But a couple of years ago I noticed a strange thing on one of the sites - some Alisha Howard and Jack Bailey speak English, Portuguese, Hindi, Russian and other languages! Then I didn’t notice the Neural Voice postscript...
I was surprised that TTS technology is not used to create vocals. After all, to do this you only need to control the pitch and duration of the vowels. Such parameters are available in TTS speech synthesizers - for the sake of experiment, I even tried to stretch the vowels by typing “paaaapa iiii maaaaama”.
And so, a couple of months ago, I got answers to many of my questions thanks to McKinley Hibbits, an enthusiast working to preserve the history of audio technology. Having set the task of creating NVM for Leon and the unofficially released BigAl, he is collecting vocal examples of these banks around the world. Since my experiments with them were also found on the Internet, McKinley asked to provide him with the original vocal tracks. I sent him everything I could find, as well as the A Place in the Sun CD. In further communication with McKinley, I learned about RVC, moreover, he made NVM of my voice, as well as vocal renderings for 15 songs in 8 languages , before I myself began to understand this kitchen.
Vocoder
An electronic circuit or program with the following structure:
A set of these modules essentially models the structure of the vocal tract and, with appropriate control, synthesizes speech or vocals. There are many examples of use, most notably in Stevie Wonder's song "I Just Called to Say I Love You". (It is quite possible that the French application was created based on the vocoder principle).
In musical practice, synthesis control is performed by analyzing vocals in real time - information about formants, amplitude, tone and noise is removed from the singer’s voice. As a result, the vibrations of the vocal cords are, as it were, replaced by a signal with a rich spectrum, often harmonic - for example, chords played on an organ.
The quality of the synthesized speech is low, most likely the term robot voice was formed on the basis of the perception of such voices.
Vocaloid. Theory and practical testing were done in Spain (P. Fabre University, Barcelona) in the early 2000s. The program was released by Yamaha (2004), voice banks were developed by several companies, in particular PowerFX.
The technology consists of sequential connection of sound fragments, somewhat similar to Wavetable synthesis and samplers. The artist records a huge number of phrases in different registers and at different volumes. According to Bil Bryant, ex-CEO of PowerFX, it's 60 pages. Then sound engineers process the material, create a bank of phonemes in spectral form, test synthesis - in general, creating and debugging a new bank is a lot of work. As a result, the user in the program, according to the melody and text specified in the piano roll, generates a vocal track.
Vocaloid's voice quality is much higher than that of a vocoder. True, all the banks I worked with have a slightly noisy character, similar to the sound in compressed formats, such as mp3 with a low bit rate. The second noticeable disadvantage is the excessive static nature of the long vowel as a result of the looping of the average spectrum.
Despite the fact that in Vocaloid you can draw a Pitch curve and add vibrato, I lacked expressiveness and freedom in choosing vocal techniques. The weakest point is the inability to extract sound with support and on the breath , no matter what settings you choose. Sampler users would say that Vocaloid cans have few layers . To use this analogy, it seems to me that he is alone there.
However, by generating MIDI files in our application, we were able to increase expressiveness and make Vocaloid sing in a way that the creators of the banks did not expect. By the way, they were surprised by my example of “I Feel Good”, Bill asked: “Alex, how did you do that?”, also the erotic shade in “Please Touch Me Lola” and singing “off the notes” in “In The Darkness” and “ PowerFX Hymn". At the Frankfurt Music Fair (FMM) in 2007, I had the opportunity to chat with Hideki Kenmochi, head of the Vocaloid department at Yamaha. We discussed expanding the functionality of Vocaloid using our algorithms, but it didn’t come to practical work.
UTAU. Appeared quite a long time ago (2008), similar to Vocaloid, but allows you to create your own banks. I've never tried it.
Synthesizer V is a new reincarnation of Vocaloid. I haven’t used it myself yet, but the examples on YouTube are amazing, and... control of vocal techniques has appeared there - the voice can scream and sing almost in a whisper... It also seems that in the last 2-3 years the AI version has also used neural voices models, but only those that are developed by the company itself (more precisely, partners).
- this is already completely AI technology. Briefly the diagram looks like this:
* It is extremely rare that there is a separate and clean (without effects) recording of vocals. Therefore, an AI algorithm is also used to separate music and voice.
This way we will get the same vocal track, in which the singer’s timbre is replaced with the timbre of another person.
Even before the fall, I heard great examples of Synthesizer V and neural voices . I had no idea why McKinley was collecting fragments of BigAl and Leon's vocals. He pointed out the absence of support and aspiration modes in the Volcaloid examples . I sent him my example, where I simply depicted both... In response, McKinley sent me 2 tracks where my voice was replaced by BigAl and Leon... for me it was a shock - it felt like a young technician rejoicing , that I found a cool piece of hardware, but still have no idea where and how it can be used.
I asked how difficult it was to make my model - McKinley asked to send 20-30 minutes of my voice. It was easy for me to collect this volume from past projects - plus I said something, sang it as low/high as possible and sent it. Also, the first song I wanted to try on was “Splender” (about 20 years ago I did an arrangement of this song by Ravi Dattatreya, with vocals by the famous Indian artist Rajkumar Bharati) and... the next day I was already singing in Tamil!
Just like a couple of days ago, I had no idea what the use of all this was for me, but the desire to try other languages was becoming stronger. As a result, I chose a new song, sent it to McKinley, and, like a 3-year-old kid waiting for the next episode of Masha and the Bear, I waited for a new render from him.
Two fundamental points became obvious:
I asked Rick Paul, a songwriter I've known for a long time, to share a clean vocal track and also check out my NVM's accent. That's how "You Knew Me 'Fore You Knew Me" appeared in my set. Rick first wrote that compared to my actual accent, my model's accent was minimal. Then he sent a list of 7 comments, analyzing each syllable, and at the end a note “... in general, from this performance, I would never have guessed that a Russian was singing.”
My old friend Peter Bloemendaal, a musician and journalist, found me a popular song in Dutch - "Tulpen Uit Amsterdam". And after listening to the result, he assured me that I had no problems with Dutch and, most likely, German .
I sent two examples in French to a friend in France, the answer was: “Your French is much better than mine. :).” I showed it to my neighbors in Armenian - “Love!”, that is, good... I didn’t show anyone examples in Italian, but in terms of phonetics it is considered quite close to Russian.
At the end of November, I tried to figure out these technologies myself. I will try to explain everything as simply as possible. But first I want to offer an analogy, which, it seems to me, simplifies the understanding of the result obtained when replacing the timbre.
Both can be ordinary, or maybe bright, special. If a person has an unusual gait, then we can easily recognize him even in ordinary clothes. Also, we cannot help but pay attention to a person in bright, strange clothes, even when his gait does not stand out in any way.
So, the original, reference vocal track is the gait . If the vocalist has a deep and frequent/slow (different from the average) vibrato , if there are noticeable glissandos (approaches to the tone and descents), greater dynamics, both on an individual note and in a musical phrase, then... replacing the original timbre with another , but quite standard, will not be able to hide all these features.
There is a kind of addition of the nature of movement and coloring . If your timbre is normal, and as a reference track you take, for example, the vocals of G. Leps, A. Serov (for men), or Whitney Houston, Mariah Carey (for women), then you most likely will not hear yourself . In fact, the spectrum will be yours, but the intonation and the nature of the movement will obviously be someone else’s. And on the contrary, if you have a very unique timbre, then when you color an ordinary intonation that does not have bright details, you will be quite noticeable.
In my examples , you will probably notice that G. Vitsin’s style in “Wait for the Locomotive” dominates when using both the McKinley model and mine, and the BigAl model clearly prevails on whatever reference track it is used.
It is necessary to collect voice samples, only speech is possible, but in our case, preferably singing, with a total duration of 20-30 minutes. McKinley said that he collected 12 hours of his voice, someone uses only 1 minute, on online services the duration is often limited to 10 minutes. As I understand it, the point is that all sounds are present in the material, preferably in different registers and with different presentations - loud, almost a scream, and quiet, almost a whisper.
It is important that the material is clean - without extraneous noise and room reverberation. Of course, in the absence of such recordings, you can try to clean what is there - those who know how to work with sound know how this is done. There are now many online services that use AI algorithms to remove noise and reverberation, such as Noise Reducer . But all such operations, as a rule, do not take place without losses.
Standard processing of all material:
In general, you need to get a good level, without distortion, without disturbing the overall dynamics of phrases and long sounds. It is better to save in wav (44.1 kHz, mono).
I tried to make NVM on different material, in particular, short duration (2-3 minutes) and from WhatsApp voice messages, using cleaning. I can’t say that increasing the volume definitely improves the quality of the final result (it depends on many factors). In my examples, I noticed errors in the McKinley and BigAl models that my NVM did not. And although NVM McKinley and BigAl were created on a significantly larger amount of data, sometimes at the end of a word, where there is a slight exhalation, they replaced it with “ha-ha-ha” (laughter), there were other flaws.
When collecting material from non-singing people, I would suggest sticking to the following:
For those who sing, the recommendations are generally the same. A small clarification - recording the speech is still necessary and it is not necessary to hit the notes .
As already noted, the original vocal tracks are almost inaccessible. There are many online services for separating voice and music, and they seem to use the same AI algorithm. At first I used VocalRemover , but then, on McKinley's advice, mvsep is a more serious resource with many models, not only for separation, but for noise and reverb removal.
When choosing a song, you have to immediately pay attention to what surrounds the vocals. Polyphony, backing vocals (even in unison), simultaneous sound of winds and strings - all this can end up in a dedicated vocal track. Also, noticeable reverberation and Delay will harm the solution of the problem. We make a separation and see how the reference track sounds - if it is dirty , then, most likely, a successful timbre replacement will not work with it.
From my little experience, I have found that the voice is best extracted from acoustic recordings with a minimal set of instruments, for example, from a song with a guitar. The hardest thing is to make something normal from a live performance, as well as modern mixes, where the vocals are most often heavily processed - compression, Exciter, etc., and in terms of their acoustic parameters are very far from the natural voice.
I tried several notebooks in Google Colab, but the first result I got was what McKinley
recommended - link
We go step by step, fulfill all the requirements, monitor the training of the model in TensorBoard and wait for messages:
Then we save the model for future use.
The time to create and train an NVM depends on the size of the data and the available resources - the calculations are performed in the external environment. It took me 1.5-2 hours per Dataset in 10-15 minutes. I think this is very good. All this can apparently be done on your own machine, but the process will obviously take much more time.
On McKinley's advice, I did not set the number of epochs to more than 300; I took the step of saving intermediate models at 10 or 20 epochs (Google Colab). I did not notice a significant difference in the models at 300, 240 or 120 epochs when I tested them on test examples. Most likely this is due to the material itself. They write online that an overtrained model sounds worse than an undertrained one .
There are online services where there is no need to understand the nuances - just upload your material. I tried kits.ai. The size is up to 10 minutes, the quality of the model in the free account is deliberately low. Moreover, you cannot download your own model, but you can upload another one made on the side or from their library.
Oddly enough, for one particular voice, the result of the simplified model from kits.ai was better than the normal one, debugged in the Google Colab notebook.
Now we get to the magic. Once we have both the NVM and the reference track, we can apply the actual RVC algorithm. There are several Google Colab notebooks for this, in particular - link .
You can run it both in an external environment and on your own machine. I used a similar one ("RVC Inference HF") on HuggingFace .
Load the model and reference track, click "Convert". The free account is allocated 2 vCPUs and 16 GB RAM. The result for a 1.5 minute track is calculated in 3-4 minutes. Most likely, the time depends not only on the resource, but also on the reference track. There are additional settings that change the nature of processing. I tried some of them, but I didn’t notice any significant differences for a specific model and track - I left them as default.
1. It is better to do material for Dataset for at least 10 minutes, trying to include as much as your voice can. If there is something in the reference track that has no analogue in the model, then the algorithm will substitute as closely as possible from the model. For example, Sh instead of Ch if Ch is missing.
2. All errors in the reference track will be reflected in the render, and in this case the rich model may work even worse. So, for example, a saxophone was superimposed in the reference track, the algorithm took it for a different sound, rounded it up to something else. Or there is a peak in the spectrum in the region of 5-7 kHz from the use of Exciter - the algorithm may make a mistake in determining the tone and insert fragments an octave higher on a long vowel.
3. There are many resources, new ones are constantly appearing. Algorithms are updated - what worked yesterday may work differently today. Much is free and available from the Russian Federation. To stay in shape , you need to track all significant changes. There are many ready-made models, about 18 thousand on weights.gg (among them more than a dozen models of Michael Jackson).
4. Naturally, after dancing with a tambourine , both with applications and with paying for a foreign service, the thought arises - to upgrade your PC and set everything up at home. How easy it is - I don’t know yet.
5. I think in the very near future there will be VST (and other format) plugins that allow you to change the timbre of a voice track directly in your favorite DAW (Fl Studio, Ableton, Reaper). By the way, RVC is applicable not only to the voice, but also to musical instruments - such models are also created and used (see kits.ai ).
6. A couple of useful textbooks (" Training RVC v2 models Guide " and " RVC v2 AI Cover Guide " by kalomaze ) .
Online opinions about the results of RVC are diametrically opposed - from enthusiastic to negative. However, as usually happens, time passes and everything falls into place. When in our student years we (a rock-pop group) replaced a brass band on the dance floors, its musicians looked at us almost in bewilderment. After 10 years, we were replaced by guys with speakers who simply turned on the tape recorder. The same thing happened with the advent of mp3, the Internet, and streaming. You either embrace the new or remain in the past.
Creating songs by non-singers was the first thing that came to mind. Doesn't sound very attractive, right? But this was before, when there was no RVC.
For about 10-15 years I had my own studio, aimed at amateurs, that is, weakly singing clients. With one (with a hearing aid!) they recorded 50, with the other - 15 of their own songs. If RVC had existed then, it would have been easier for me to make them NVM, sing these songs myself and change the timbre, rather than waste time and nerves on recording, tuning the tone and rhythm.
There were also ladies with the following attitude: “I sang very well, now I don’t need to study vocals and set my voice - everything is done by a sound engineer! How much does it cost? My man will pay for everything...”. Well, ... if RVC is for everything , then for the sake of one song you can create NVM, hire a professional vocalist who will record a reference track. Then the lady will be right - everything is done on the computer and she really sang very well.
I discussed this issue with Rick Paul, who doubts that there is a demand for such a service - "that is, a recording of a person who cannot sing, ( virtually singing a song) that has no real value - perhaps it is just stroking the ego of that person" . I think Rick hasn't worked with amateurs as much as I have. I believe that as soon as the public discovers this opportunity, everyone will start singing , especially bloggers. They will also boast about who acted as a donor, surrogate for them and recorded the reference track.
I will give a few ideas, without ethical, legal assessment and forecast about the potential artistic value of the result.
It looks like the rise of RVC will have a major impact on the work of musicians and producers. J Rice says that before AI voices he turned down some projects: "The worst part of music production is the waiting - be it session vocalists, studio time or meetings with other artists... AI makes collaboration easier than ever." -or by allowing artists to train NVM, eliminating the need for in-person meetings to record vocal tracks... It's one thing if I'm singing the lyrics, but now I can change the voice, significantly changing how it is heard, creating a more complex and improved sound." .
To be honest, I have not come across any clear materials on how the scope of application of NVM and RVC is regulated. Models are created in the thousands, and there are many freely available models of famous people, created, obviously, without their permission.
What prevents you from creating an NVM, for example, Sergei Chonishvili - a famous artist and sought-after announcer, and then releasing advertising with his voice? I heard that there are already trials with something similar. The best way to protect it, it seems to me, is for the artist himself to give permission to create NVM and receive income from its use. Some artists have already taken this path - Gwyneth Paltrow, Grimes, etc. I think in the near future the creation of high-quality models will be massively initiated by the artists themselves and studios who have original, clean voice tracks. The legislative sphere usually lags behind practice, but if there is demand, the parties find approaches to reconcile interests.
I repeat once again - I do not pretend to cover the topic completely. I decided to write an article a couple of months ago, but even in this short period of time a lot of new things have appeared.
It is clear that we are on the verge of big changes both in the music industry and in the wider cultural sphere. Just 15 years ago, like any engineer familiar with the Kotelnikov/Nyquist theorem, I would have argued that restoring lost information (or dividing a mixed phonogram into separate tracks) is impossible, but... As it turned out, the brain of a person who saw a stool does not analyze every pixel from the retina, but simply pulls out a previously seen image from memory... The AI mocks a person - “I’m doing what your brain has been doing for a long time, what’s impossible?”
I would like to note only a few fundamental points of which I am convinced:
Personally, I am extremely glad that I saw such innovations (I am 66). And although I have not yet decided how exactly I will use RVC, previous experience suggests that I will definitely use something in future experiments. Of course, agreeing with McKinley's words: "... now that technology is so accessible, there has been a surge in the creation of NVM, for better or worse, spanning a wide ethical spectrum... while I am aware of the moral responsibility that the use entails these powerful tools."
Once again, a link to the demo examples .
Article by Alexei Ustinov