AI that can understand video could be put to a variety of uses
Teaching AI systems to understand what’s happening in videos as completely as a human can is one of the hardest challenges — and biggest potential breakthroughs — in the world of machine learning. Today, Facebook announced a new initiative that it hopes will give it an edge in this consequential work: training its AI on Facebook users’ public videos.
Access to training data is one of the biggest competitive advantages in AI, and by collecting this resource from millions and millions of their users, tech giants like Facebook, Google, and Amazon have been able to forge ahead in various areas. And while Facebook has already trained machine vision models on billions of images collected from Instagram, it hasn’t previously announced projects of similar ambition for video understanding.
“By learning from global streams of publicly available videos spanning nearly every country and hundreds of languages, our AI systems will not just improve accuracy but also adapt to our fast moving world and recognize the nuances and visual cues across different cultures and regions,” said the company in a blog. The project, titled Learning from Videos, is also part of Facebook’s “broader efforts toward building machines that learn like humans do.”
The resulting machine learning models will be used to create new content recommendation systems and moderation tools, says Facebook, but could do so much more in the future. AI that can understand the content of videos could give Facebook unprecedented insight into users’ lives, allowing them to analyze their hobbies and interests, preferences in brands and clothes, and countless other personal details. Of course, Facebook already has access to such information through its current ad-targeting operation, but being able to parse video through AI would add an incredibly rich (and invasive) source of data to its stores.
Facebook is vague about its future plans for AI models trained on users’ videos. The company told The Verge such models could be put to a number of uses, from captioning videos to creating advanced search functions, but did not answer a question on whether or not they would be used to collect information for ad-targeting. Similarly, when asked if users had to consent to having their videos used to train Facebook’s AI or if they could opt out, the company responded only by noting that its Data Policy says users’ uploaded content can be used for “product research and development.” Facebook also did not respond to questions asking exactly how much video will be collected for training its AI systems or how access to this data by the company’s researchers will be overseen.
In its blog post announcing the project, though, the social network did point to one future, speculative use: using AI to retrieve “digital memories” captured by smart glasses.
Facebook plans to release a pair of consumer smart glasses sometime this year. Details about the device are vague, but it’s likely these or future glasses will include integrated cameras to capture the wearer’s point of view. If AI systems can be trained to understand the content of video, then it will allow users to search for past recordings, just as many photo apps allow people to search for specific locations, objects, or people. (This is information, incidentally, that has often been indexed by AI systems trained on user data.)
As recording video with smart glasses “becomes the norm,” says Facebook, “people should be able to recall specific moments from their vast bank of digital memories just as easy as they capture them.” It gives the example of a user conducting a search with the phrase “Show me every time we sang happy birthday to Grandma,” before being served relevant clips. As the company notes, such a search would require that AI systems establish connections between types of data, teaching them “to match the phrase ‘happy birthday’ to cakes, candles, people singing various birthday songs, and more.” Just like humans do, AI would need to understand rich concepts comprised of different types of sensory input.
Looking to the future, the combination of smart glasses and machine learning would enable what’s referred to as “worldscraping” — capturing granular data about the world by turning wearers of smart glasses into roving CCTV cameras. As the practice was described in a report last year from The Guardian: “Every time someone browsed a supermarket, their smart glasses would be recording real-time pricing data, stock levels and browsing habits; every time they opened up a newspaper, their glasses would know which stories they read, which adverts they looked at and which celebrity beach pictures their gaze lingered on.”
This is an extreme outcome and not an avenue of research Facebook says it’s currently exploring. But it does illustrate the potential significance of pairing advanced AI video analysis with smart glasses — which the social network is apparently keen to do.
By comparison, the only use of its new AI video analysis tools that Facebook is currently disclosing is relatively mundane. Along with the announcement of Learning from Videos today, Facebook says it’s deployed a new content recommendation system based on its video work in its TikTok-clone Reels. “Popular videos often consist of the same music set to the same dance moves, but created and acted by different people,” says Facebook. By analyzing the content of videos, Facebook’s AI can suggest similar clips to users.
Such content recommendation algorithms are not without potential problems, though. A recent report from MIT Technology Review highlighted how the social network’s emphasis on growth and user engagement has stopped its AI team from fully addressing how algorithms can spread misinformation and encourage political polarization. As the Technology Review article says: “The [machine learning] models that maximize engagement also favor controversy, misinformation, and extremism.” This creates a conflict between the duties of Facebook’s AI ethics researchers and the company’s credo of maximizing growth.
Facebook isn’t the only big tech company pursuing advanced AI video analysis, nor is it the only one to leverage users’ data to do so. Google, for example, maintains a publicly accessible research dataset containing 8 million curated and partially labeled YouTube videos in order to “help accelerate research on large scale video understanding.” The search giant’s ad operations could similarly benefit from AI that understands the content of videos, even if the end result is simply serving more relevant ads in YouTube.
Facebook, though, thinks it has one particular advantage over its competitors. Not only does it have ample training data, but it’s pushing more and more resources into an AI method known as self-supervised learning.
Usually, when AI models are trained on data, those inputs have be to labeled by humans: tagging objects in pictures or transcribing audio recordings, for example. If you’ve ever solved a CAPTCHA identifying fire hydrants or pedestrian crossing then you’ve likely labeled data that’s helped to train AI. But self-supervised learning does away with the labels, speeding up the training process, and, some researchers believe, resulting in deeper and more meaningful analysis as the AI systems teach themselves to join the dots. Facebook is so optimistic about self-supervised learning it’s called it “the dark matter of intelligence.”
The company says its future work on AI video analysis will focus on semi- and self-supervised learning methods, and that such techniques “have already improved our computer vision and speech recognition systems.” With such an abundance of video content available from Facebook’s 2.8 billion users, skipping the labeling part of AI training certainly makes sense. And if the social network can teach its machine learning models to understand video seamlessly, who knows what they might learn?