Fully Simulated Training Data: How we’re using synthetic Expo Markers to train an object detection network [Part Ⅰ]

July 22, 2020 Daniel Liberman, Datagen

In this post, we’re excited to present how we’ve trained a network to identify dry-erase markers, exclusively using synthetic training data. GPU capacity and neural network size has increased over time, while the ability of teams to obtain the necessary quantities of training data has stagnated. Furthermore, data collection and labeling is an expensive process, and the result may not be satisfying, especially in pixel-wise tasks such as segmentation. In some tasks, it’s almost impossible for humans to extract labels such as depth map, or surface normal map.

Despite these challenges, there is still skepticism about the ability of synthetic or simulated data to achieve strong training results. This is Part 1 in a series of research papers that will explore the effectiveness of training networks with fully synthetic data and generalization to other training environments and applications. 

In particular, we are sharing our results for using synthetic training data to train a network in object detection/segmentation of “Exact Objects” (object classes that have a single 3D and texture form). Expo Markers were chosen for this task because of their standard texture, size, and 3D shape. An additional advantage is the availability of this object in offices around the world, making them easy for testing and validating results. This same methodology can be used for any type of exact objects.

Part 1 includes a number of different elements:

  1. Our white-paper presenting our motivations, methods, and findings
  2. The synthetic dataset of 5000 images, which was used as training data. We are glad to be able to make this dataset free for non-commercial uses. Click here to request a download link sent to your email address. You will also have the option to download a real dataset of 200 manually labeled images, used for testing, as well as pre-trained weights.
  3. Our code, which enables ‘plug and play’ training and evaluation of Mask R-CNN using the detectron2 framework.  You can go over our friendly Colab Notebook that includes visualizations and shows how to use the code. 

Our Approach: Domain Randomization + Distractions

When using synthetic data as the source domain (aka training data) and real-world data as the target domain (for testing), it is not obvious that you can achieve satisfying results. The problem arises from the so-called “data distribution gap. We would like there to be maximal overlap between the source and target domains, allowing the trained network to generalize well when presented with the testing data. Since we can only control the source distribution, how can we maximize the overlap when we can only control this one domain? One strategy is to make the source variance broad enough that the target domain will be contained within it. This approach is called Domain Randomization. The second approach is to make the source domain similar to the target domain. The main method of accomplishing this is called Domain Adaptation.

In our synthetic dataset, we chose to pursue the Domain Randomization approach. We took photorealistic models of Expo Markers and randomized them in nonrealistic ways to force the neural network to learn the essential features of the object of interest. The data creation process results in scale variation, occlusions, lighting variation, high object density, and similar object classes, we added large numbers of other distraction items, using other simulated assets in our library.

(DR illustration) https://lilianweng.github.io/lil-log/2019/05/05/domain-randomization.html


To measure how well our synthetic dataset trained the network to perform instance segmentation on real images, we use mAP and mAR as metrics.

  • mAP (mean average precision) measures how accurate your predictions are. i.e. the percentage of your predictions are correct.
  • mAR (mean average recall) measures the percentage of actual positives that was identified correctly.

We achieved the following results:


The results were achieved on the real image test set, while the network was trained exclusively on our synthetic training data, using a dataset size of 4096 synthetic images (this number is smaller than the full 5,000-image dataset because some images were used for validation and testing and thus not used to train the model itself). We would like to mention that these results were achieved without any special manipulations, using detectron2’s default training routine. Below you can see some of the visual results of our trained network working on the real image test set:

Admittedly, these results lack meaning without additional context. In particular, they must be compared to mAP and mAR metrics achieved when the network is trained with non-synthetic data.

Part II of this series will provide this necessary context so that the efficacy of synthetic data can be quantitatively compared to the efficacy of manually gathered data. 

Click here to request a download link for our synthetic Expo Marker data set. Instructions will be sent to the email address provided. If you have feedback or questions about this research or the data we used, we invite you to reach out by emailing us at research@datagen.tech.