So far this year, Mark Zuckerberg (Facebook CEO) built a simple AI that he can talk to on his phone and computer, that can control his home, including lights, temperature, appliances, music and security, that learns my tastes and patterns, that can learn new words and concepts, and that can even entertain his daughter Max. It uses several artificial intelligence techniques, including natural language processing, speech recognition, face recognition, and reinforcement learning, written in Python, PHP and Objective C. He explained what he built and what he learned along the way.
Before he could build any AI, he first needed to write code to connect his home systems, which all speak different languages and protocols.
They use a Crestron system with their lights, thermostat and doors,
a Sonos system with Spotify for music, a Samsung TV, a Nest cam for Max, and
of course his work is connected to Facebook’s systems.
Vision and Face Recognition
About one-third of the human brain is dedicated to vision, and there are many important AI problems related to understanding what is happening in images and videos. These problems include tracking (eg is Max awake and moving around in her crib?), object recognition (eg is that Beast or a rug in that room?), and face recognition (eg who is at the door?).
Face recognition is a particularly difficult version of object recognition because most people look relatively similar compared to telling apart two random objects — for example, a sandwich and a house. But Facebook has gotten very good at face recognition for identifying when your friends are in your photos. That expertise is also useful when your friends are at your door and your AI needs to determine whether to let them in.
To do this, he installed a few cameras at my door that can capture images from all angles. AI systems today cannot identify people from the back of their heads, so having a few angles ensures they see the person’s face. He built a simple server that continuously watches the cameras and runs a two step process: first, it runs face detection to see if any person has come into view, and second, if it finds a face, then it runs face recognition to identify who the person is. Once it identifies the person, it checks a list to confirm he was expecting that person, and if he was then it will let them in and tell him they’re here.
This type of visual AI system is useful for a number of things, including knowing when Max is awake so it can start playing music or a Mandarin lesson, or solving the context problem of knowing which room in the house they’re in so the AI can correctly respond to context-free requests like “turn the lights on” without providing a location. Like most aspects of this AI, vision is most useful when it informs a broader model of the world, connected with other abilities like knowing who your friends are and how to open the door when they’re here. The more context the system has, the smarter is gets overall.
Continue improving Jarvis since he uses it every day
In the near term, the clearest next steps are building an Android app, setting up Jarvis voice terminals in more rooms around my home, and connecting more appliances.
In the longer term, he’d like to explore teaching Jarvis how to learn new skills itself rather than Mark having to teach it how to perform specific tasks. If he spent another year on this challenge, he’d focus more on learning how learning works.
Finally, over time it would be interesting to find ways to make this available to the world. He would considered open sourcing my code, but it’s currently too tightly tied to his own home, appliances and network configuration. If he ever build a layer that abstracts more home automation functionality, he may release that. Or, of course, that could be a great foundation to build a new product.