How “Deep Learning” Technology is Revolutionizing Sports Production
An overhead automated camera tracks the field action based on deep-learning algorithms.
Deep learning technology is more common than one might think. This technology is used to identify objects in images, texts or audio, achieving results that were not possible before. This article will examine how deep learning is revolutionizing sports production to enable low-cost, fully automated production for semi-professional and amateur sports broadcasts.
To understand how deep learning works, let's examine how our brains work. A human brain is made up of nerve cells, called "neurons," which are connected in adjacent layers to each other, forming an elaborate "neural network." In an artificial neural network, signals also travel between "neurons.” Instead of firing an electrical signal, a neural network assigns "weights" to various neurons.
Deep learning neural networks comprise as many as 150 connected layers. The more layers developed, the “deeper” the network. Deep learning models are trained by using large sets of labeled or annotated data. The neural network architectures learn features directly from the data, so you do not need to identify the features used to classify images. The relevant features are not pretrained either; they are learned while the network trains on a collection of images. This automated feature extraction makes deep learning models highly accurate for computer vision tasks such as object classification.
Although there is no need to manually extract each feature, there is a need to create a large enough training data set with annotations. So, for example, to identify a ball, you will need a data set of hundreds of thousands of unique images, which are annotated by humans and present the "ground truth" for the deep learning model. If you consider the fact that you would usually annotate other elements, such as players, this can add up to millions of annotations. The result is a "trained model" that can identify the objects it was trained on.
Deep Learning in Sports Production
Deep learning is used to generate fully automated sports production that looks very similar to professional sports broadcast, including camera zoom ins on the action, panning, etc. The basis for any decent-level automated sports production is the ability to at least identify the ball and the players. Identifying the ball is not an easy task, if you consider the fact that the ball can be on the ground and sometimes held by a goal keeper or a player (e.g. before kicking a foul).
Deep learning technologies enable software to identify all of the required elements of a sports broadcast to automate its live production.
If you think about it, in all these different situations the ball "looks" different, yet, we, as humans, have no problem identifying it as ball from a single frame. Identifying the players is not simple either, as the system will have to distinguish between "real" players and referees, bench players, etc.
Identifying the Field/Court
In sports production, one of the ways used to help identify the ball and the players is to define to the system the area that constitutes the field/court. This process -- "calibration" -- limits the scope of options for the DL algorithm by establishing within each frame which pixels are part of the field and court and which ones are not. It then translates these pixels to physical dimensions based on real-world coordinates.
By establishing the area of the field/court, it is possible to distinguish between players who are inside the field/court versus others outside of it, such as bench players, and between players on the field and spectators, who are outside the field.
Data Annotation for Sports
As mentioned above, as part of the deep learning model training is a need for a large data set to establish the "ground truth" for the deep learning algorithm. This is a major undertaking that should be done on an ongoing basis as more data is gathered and the algorithm evolves.
There are several options to achieve this. A minimal number of frames must be annotated by humans. In addition, several methods that require less effort, including:
- Google/YouTube images - it is possible to augment the data set by searching "soccer players" on Google or YouTube. This will yield frames or images that include soccer players, or, in other words, have been "pre-annotated" as soccer players.
- Unsupervised learning – this technique uses un-labeled data by applying an additional non-deep-learning algorithm to first segment the area of the potential players. For example, we can use known background subtractors such as MOG to roughly identify players.
- Augmentations – another commonly used technique is to modify or augment the images, for example to stretch them, modify angles, etc. These augmentations produce an additional data set that has been already labeled.
One key to proper camera tracking is for the system to recognize the area of the field or court. The software must distinguish between players who are inside the field/court versus others outside of it.
As we've seen with deep learning technologies, computers can understand the sports action, opening new opportunities in sports production that were never possible before. In its highest level, this technology can mimic the decision-making process of a human camera operator and video editor, providing almost the same experience of a professional live sports broadcast, at a fraction of the cost. This technological revolution will allow semi-professional and amateur sport clubs to broadcast the games to their fans and even monetize their content.
Yoav Liberman is Director of Computer Vision & Deep Learning Algorithms at Pixellot.
You might also like...
Designing IP Broadcast Systems - The Book
Designing IP Broadcast Systems is another massive body of research driven work - with over 27,000 words in 18 articles, in a free 84 page eBook. It provides extensive insight into the technology and engineering methodology required to create practical IP based broadcast…
Demands On Production With HDR & WCG
The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.
NDI For Broadcast: Part 3 – Bridging The Gap
This third and for now, final part of our mini-series exploring NDI and its place in broadcast infrastructure moves on to a trio of tools released with NDI 5.0 which are all aimed at facilitating remote and collaborative workflows; NDI Audio,…
Designing An LED Wall Display For Virtual Production - Part 2
We conclude our discussion of how the LED wall is far more than just a backdrop for the actors on a virtual production stage - it must be calibrated to work in harmony with camera, tracking and lighting systems in…
Microphones: Part 2 - Design Principles
Successful microphones have been built working on a number of different principles. Those ideas will be looked at here.