What is synthetic data? It is data that is not generated through direct measurement or data capture methods, but rather data that is made from scratch. Examples of which are data derived from computer simulations based on variables that emulate real-world situations. Synthetic data is useful in coming up with projections or examining the potential outcomes of changes in business processes.
OneView, a company that seeks to accelerate machine learning, provides an excellent demonstration of how useful synthetic data can be in these applications. The company currently focuses on the remote sensing industry, generating virtual synthetic datasets to be used for large-scale imagery analytics conducted by ML algorithms. It provides a platform that delivers fully-annotated synthetic datasets designed to be ready for immediate feeding to machine learning systems.
Unlimited datasets for unlimited discovery
One clear advantage of synthetic datasets is that they offer limitless possibilities. The datasets are not constrained by realities, so machine learning can proceed with as much information as it needs to achieve learning targets.
MIT Technology Review published an article explaining how self-driving cars can learn by playing Grand Theft Auto. The idea sounds ridiculous, but it is not. “Hyper-realistic computer games may offer an efficient way to teach AI algorithms about the real world,” the article writes while citing the results of the experiments conducted by several research groups.
And so the OneView platform uses advanced gaming engines (mostly Unity) to generate detailed 3D virtual models as the basis for the resulting synthetic images, which are used as a replacement for real satellite or aerial images.
As it turned out, training ML algorithms do not necessarily require real remote sensing images. What’s especially important is to provide a wide range of information that introduces various settings and parameters to allow the system to get acquainted with different scenarios.
Additionally, ML model training using real data exclusively is a highly cumbersome process. Aside from being costly – satellite and aerial images, though widely available, are still very expensive – it is lengthy as well, which poses a problem for companies that need to move fast.
Moreover, it is impossible to use real datasets to represent all possible situations or configurations. It will take forever, for example, to capture images of various car makers and models on the streets under different lighting, traffic circumstances, and orientations. Not to mention rare occurrences, that pose another challenge when using only real data for algorithm training; sometimes you just don’t have the data that you need. The logical solution is to use simulated visual data so machine learning systems can be fed with sufficient amounts of data that depict an extensive range of possibilities.
Adapting synthetic data to resemble remote sensing real data
As part of their synthetic data generation process, OneView adapts the data to remote sensing characteristics. This means adding textures, color smears, noises, and other attributes of remote sensing imagery, like blur and variation of color.
OneView also mimics the attributes of remote sensing systems as lens resolution and off-nadir range to make the synthetic data usable. Subsequently, randomization is applied to approximate real data.
Annotations for guided machine learning
Just like how annotation works for human learners, data annotation helps machines learn by providing guidance or assistance; algorithms don’t have pre-existing knowledge, and annotation is used to define everything, especially in instances where multiple interpretations are possible. In the case of Manhattan area satellite views, for example, the annotation may entail the labeling of yellow cabs as shown in different angles, time of the day, weather conditions, and orientations. Annotation helps AI get acquainted with something that it would otherwise confuse for something else.
A major advantage of annotations in synthetic data, particularly with the OneView platform, is that they are generated as an integral part of the synthetic data creation process. In contrast, when using real data, annotations would have to be done manually. Such a process is not only time-consuming and tedious; it is also prone to errors. Synthetic data is generated with controlled variables, so errors in the identification and contextualization of objects are most improbable.
With OneView, annotation is undertaken alongside the synthetic data generation process. Since the visual data is synthesized by the OneView system, annotations are readily baked into the resulting data according to customer specifications.
Harnessing the benefits of synthetic data
OneView’s utilization of synthetic data is not derived from a hunch or the isolated consensus of the company’s data scientists. It is anchored on meticulous studies and even supported by the findings of external research. Gartner, for one, predicts that synthetic data will comprise 25 percent of the data used to train AI systems by 2022.
OneView is confident in the system they have developed especially when it comes to visual data presentation. The company’s proof of concept (POC) for object detection has produced consistent results in terms of satellite imagery. It is reportedly three times better than real data.
Interestingly, the use of synthetic in a way connects the remote sensing and video gaming industries. OneView generates synthetic data with the help of 3D engines used in video games. This synthetic data is then used to train machine learning systems for the remote sensing industry. This desegregation of technologies used in video gaming and real-world applications is a testament to how much simulation technologies have progressed dramatically and how synthetic data can be an extremely viable tool to advance machine learning.