Microsoft’s VASA-1 can deepfake a person with one photo and one audio track

Chris Remington@beehaw.org · 2 years ago

Microsoft’s VASA-1 can deepfake a person with one photo and one audio track

some_guy · 2 years ago

They mentioned one potential use that I thought has value and that I hadn’t considered. For video conferencing, this could transmit data without sending video and greatly reduce the amount of bandwidth needed by rendering people’s faces locally. I don’t think that outweighs the massive harms this technology will unleash. But at least there was some use that would be legit and beneficial.

I’m someone who has a moral compass and I don’t like that scammers will abuse this shit so I hate it. But there’s no keeping it locked away. It’s here to stay. I hate the future / now.

flora_explora@beehaw.org · 2 years ago

Wouldn’t you then have to run the AI locally on a machine (which probably draws a lot of power and memory) or use it via cloud (which depends on bandwidth just like a video call). I don’t really see where this technology could actually be useful. Sure, if it is only a minor computation just like if you take a picture/video with any modern smartphone. But computing an entire face and voice seems much more complicated than that and not really feasible for the usual home device.

Markaos@lemmy.one · 2 years ago

Yeah, it’s not practical right now, but in 10 years? Who knows, we might finally have some built-in AI accelerator capable of running big neural networks on consumer CPUs by then (we do have AI accelerators in a large chunk of current CPUs, but they’re not up to the task yet). The system memory should also go up now that memory-hungry AI is inching closer to mainstream use.

Sure, Internet bandwidth will also increase, meaning this compression will be less important, but on the other hand, it’s not like we’ve stopped improving video codecs after h.264 because it was good enough - there are better codecs now even though we have the resources to handle bigger h.264 videos.

The technology doesn’t have to be useful right now - for example, neural networks capable of learning have been studied since the 1940s, even though there would be no way to run them for many decades, and it would take even longer to run them in a useful capacity. But now that we have the technology to do so, they enjoy rapid progress building on top of that original foundation.

flora_explora@beehaw.org · 2 years ago

Fair point, I agree.

barsoap@lemm.ee · edit-2 2 years ago

A model that can only generate frontal to profile views of heads would be quite small, I can totally see that kind of thing running on current consumer GPUs, in real time. Near real time is already possible with SDXL-based models with some speedup tricks applied as long as you have a mid-range gaming GPU and those models are significantly more general. It’s not like the model would need to generate spaghetti and sports cars alongside with the head.

Lem Jukes@lemm.ee · 2 years ago

Also I would argue sending the actual video of what is happening in front of the camera is kind of the entire point of having a video call. I don’t see any utility in having a simulated face to face interaction where neither of you is even looking at an actual image of the other person.