In this post, we are excited to present how we have trained a network to identify dry-erase markers exclusively with synthetic training data and achieved comparable results to manually gathered data. GPU capacity and neural network size have increased over time, while the ability of teams to obtain the necessary quantities of training data has stagnated. Furthermore, data collection and labeling is an expensive process, and the result may not be satisfying, especially in pixel-wise tasks such as segmentation. In some tasks, it’s almost impossible for humans to extract labels such as depth map or surface normal map.
Despite these challenges, there is still skepticism about the ability of synthetic or simulated data to achieve strong training results. This is Part 1 in a series of research papers that will explore the effectiveness of training networks with synthetic data and generalization to other training environments and applications.
In particular, we are sharing our results from using purely synthetic and a mixture of synthetic and manually gathered training data to train a network in object detection/segmentation of custom objects (object classes that have a single 3D and texture form). Expo Markers were chosen for this task because of their standard texture, size, and 3D shape. An additional advantage is the availability of this object in offices around the world, making them easy for testing and validating results. This same methodology can be used for any type of custom object.
Part 1 includes a number of different elements:
- Our white-paper presenting our motivations, methods, and findings.
- The synthetic dataset of 5000 images, which was used as training data. We are glad to be able to make this dataset free for non-commercial uses. Click here to request a download link sent to your email address. You will also have the option to download two additional manually gathered (real) datasets of 1000 images and 250 images, used for performance comparison and evaluation, as well as pre-trained weights.
- Our code, which enables ‘plug and play’ training and evaluation of Mask R-CNN and Faster R-CNN using the detectron2 framework. You can go over our friendly Colab Notebook that includes visualizations and shows how to use the code.
In order to evaluate the quality of our synthetic data, we have manually gathered two additional real datasets, which we call “Real A” and “Real B”. The first dataset – “Real A” – was used for evaluation purposes only. The second dataset – “Real B” – was used both for training and evaluation. As we will see later on, using 2 different datasets for evaluation enabled us to assess the networks’ ability to cope with different levels of domain gaps.
Evaluating the results with images both from Real A and Real B enables us to assess the network’s generalization capabilities. Indeed, the Real B test set assesses performance in case of small domain gap since the images for evaluation are from the same dataset that was used for training, and the Real A dataset simulates a larger domain gap, since the images for training were taken in different conditions from those taken for evaluation (different devices, environments, etc…).
During our experiments, we used the datasets for 3 possible usages: training, validation and test. Each data point was used for one purpose only.
Our Approach: Domain Randomization + Distractions
To measure how well our synthetic dataset trained the network to perform instance segmentation on real images, we used mAP (click for more details about the metric) at different IoU thresholds as our metric.
At a high level, mAP (mean average precision) measures how accurate your predictions are. i.e. the percentage of your predictions that are correct.
As shown in the graph below:
- The networks that were trained on Synthetic and on Real B achieved approximately the same performance when evaluated with Real A (80.08 and 80.76 respectively).
- About 5% of performance improvement was observed when the network was trained with a mixture of Synthetic and Real B and then evaluated on Real A (relative to training with only Synthetic or only Real B and then evaluating on Real A).