Making Machine Learning Simple to use
This post provides a step-by-step tutorial to run an object detection model on a drone’s live video feed. The demonstration here can be trivially extended to running any deep learning model on the video capture by drone in real-time.

Drones entered the commercial space as exciting, recreational albeit expensive toys, slowly transforming into a multi-billion dollar industry with myriad commercial applications ranging from asset inspections to military surveillance. Artificial Intelligence, with its recent advancements and disruptive technology, has been a game changer for the drone industry. AI has opened doors in this domain to avenues that were unimaginable just a few years back. Who would have thought that “killer drones” could pose an actual threat to human life, and not just in the Terminator world?

A fictional video showing how AI can convert drones into horrifying killing machines. credits:

AI can replace humans at various levels of commercial drone use — they can autonomously control the drone flight, analyse sensor data in real time or even examine the data post-flight to generate insights. At any of these levels, it is often required to identify and locate objects-of-interest around the drone through the data captured by its sensors, making Object Detection fundamentally important to impart artificial intelligence to a drone. You can find a detailed explanation of object detection in another post.

Detecting people from a drone's live video feed. You will be able to achieve similar results by following this tutorial.

Let us jump right into running your own object detection model on a drone's video feed in real time. We also discuss training your own object detection model in the latter half. If you just want to stream and display your drone's live video to your laptop/computer, follow STEP1.


  1. Drone (duh!): DJI dominates the market in mid-long range drones. Though the method detailed here is for a DJI drone (we used a Phantom 4 advanced), the same principle can be extrapolated to drones of a few other companies. You can check out the list of DJI drones and their prices at their official site.
  2. Phone: That connects to the drone RC controller. You can have a look at some of the phones that are compatible with DJI drones here.
  3. Laptop/Computer (with a WiFi Adapter): This is where the deep learning models will actually run. It is highly recommended to use a computer with a GPU as that would speed up the inference time significantly making the results truly real time. If your purpose is to just display the drone's live video, any laptop should work.
Architectural diagram showing the flow of data for real time object detection on drones. 

The process can be broken down into 3 parts:
1. Stream the drone's video to a computer/laptop (drone -> your computer)
2. Run an object detection model on the streaming video and display results (on the your computer)
3. Train your own object detection model (to detect new kinds of objects).

STEP 1: Stream the drone's video to your computer

We exploit the DJI GO 4 mobile App’s ability to live stream video. A DJI drone sends real-time HD video to it's controller. The controller is connected to the smartphone, which can be used to manage the drone through the DJI GO 4 mobile app. This app contains a live streaming option where the stream can be forwarded to any RTMP (real time messaging protocol) server address. The idea is to set up an rtmp server on your computer and send the stream from the drone to this server. This stream can then be accessed programmatically frame-by-frame in Python (using libraries like opencv).

i. Install and run a RTMP server on your computer
ii. Create a Wifi hotspot (on your computer) -
iii. Forward drone's feed to RTMP server over WiFi
iv. Access video stream from RTMP server

i. Install and run a RTMP server
"Nginx" is a lightweight web server which can be used to host RTMP streams. It does not come installed with the RTMP module.
If running a MacOS, you can start a local RTMP server simply by downloading and running For linux, we need to compile nginx from source along with the RTMP module. Steps below:

#  Install prerequisites
sudo apt-get install build-essential libpcre3 libpcre3-dev libssl-dev

#  Download nginx

#  Download RTMP module

#  Build
tar -zxvf nginx-1.15.2.tar.gz
cd nginx-1.15.2
./configure --with-http_ssl_module --add-module=../nginx-rtmp-module-dev
sudo make install

We now need to configure nginx to use RTMP. Paste the following lines at the end of the config file, which can be found at the location /usr/local/nginx/conf/nginx.conf.

rtmp {
server {
    listen 1935;
    chunk_size 4096;
    application live {
        live on;
        record off;

Start the nginx server.

#  Start nginx
sudo /usr/local/nginx/sbin/nginx

#  Stop the nginx server using
sudo /usr/local/nginx/sbin/nginx -s stop

ii. Create a Wifi hotspot (Optional)
You will now need to connect your phone and computer over a Wifi network.
You can do this by either:
a. Ensuring they are connected to the same WiFi network
b. Creating a WiFi hotspot on your computer and connecting the phone to this network.
Option (a) may not be always possible. Also it can lead to a lagged stream (upto 5 seconds) while Option (b) does not result in any such problem.
Option (b): We create a WiFi hotspot on our computer and connect our controller to this WiFi using our mobile. I followed the instructions given here to start a wifi hotspot on a Linux machine. Once the hotspot has started, find the IP of your computer using ifconfig (e.g. This is the address to which you will forward the live feed from the mobile.
Note: Make sure that your firewall allows TCP 1935. (link)
Now start your RTMP nginxserver:
sudo /usr/local/nginx/sbin/nginx

iii. Forward drone's feed to RTMP server over WiFi
Ensure that your phone is connected to the WiFi hotspot you created above and connect your drone remote controller to your phone using the DJI Go 4 app. Assuming your drone is paired with the controller, you should be able to see a “Choose Live Streaming Platform” in the options menu. Select the custom RTMP option and enter the nginx RTMP server address:
rtmp:// (“drone” can be any unique string)
The drone now starts sending its live feed to our computer at the above address. If your phone is successfully forwarding the drone stream to the RTMP server it should look something like this (yellow oval):

Streaming your drone video to a RTMP server through the DJI Go 4 App.

iv. Access video stream from RTMP server
The python code below gets the live feed from our RTMP server and displays it in a window.

import cv2
cam = cv2.VideoCapture('rtmp://')
ret, img =
while True:
    ret, img =
    cv2.imshow('drone', img)
    # press esc to exit
    if cv2.waitKey(1) == 27: 

Once you access the drone’s live feed programmatically, you can run a deep learning inference on each frame in any framework of your choice (Theano, Keras, Pytorch, MXNet, Lasagne). The next section shows how to run an object detector model using tensorflow.

STEP 2: Run an object detection model and display results (on your computer)

The code snippets below demonstrate how to use a trained model for inference. Look at the next section to find out how to train your own model for detecting custom objects. Make sure you have tensorflow and opencv installed before you start. The code has been tested on tensorflow version 1.10.0 but should work for other versions with minimal modifications.

# Imports
import numpy as np
import cv2
import tensorflow as tf
if tf.__version__ < '1.10.0':
  raise ImportError('Please upgrade your tensorflow installation to v1.10.* or later!')

Set the path to the frozen detection graph and load it into memory. This is the tensorflow model that is used for the object detection. You can find more details on creating this trained model in the next section (STEP 3). You can download the person detector that I trained on aerial images from here (frozen_inference_graph.pb).

PATH_TO_CKPT = './frozen_inference_graph.pb'

detection_graph = tf.Graph()
with detection_graph.as_default():
  od_graph_def = tf.GraphDef()
  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
    serialized_graph =
    tf.import_graph_def(od_graph_def, name='')

Run the detection model frame-by-frame and display the results to a window.

THRESHOLD = 0.6  # score threshold for detections

with detection_graph.as_default():
  with tf.Session(graph=detection_graph) as sess:
    cap = cv2.VideoCapture('rtmp://')
    # Check if the drone feed is opened correctly
    if not cap.isOpened():
        raise IOError("Cannot connect to drone feed.")

    # Definite input and output Tensors for detection_graph
    image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
    # Each box represents a part of the image where a particular object was detected.
    detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
    # Each score represent how level of confidence for each of the objects.
    # Score is shown on the result image, together with the class label.
    detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
    detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
    num_detections = detection_graph.get_tensor_by_name('num_detections:0')

    ret, frame =
    height, width, _ = frame.shape

    while True:
        ret, frame =

        (boxes, scores, classes, num) =
          [detection_boxes, detection_scores, detection_classes, num_detections],
          feed_dict={image_tensor: np.expand_dims(frame, axis=0)})

        for i, box in enumerate(boxes[0]):
            if scores[0][i] > THRESHOLD:
                frame = cv2.rectangle(frame,(int(box[1]*width),int(box[0]*height)),(int(box[3]*width),int(box[2]*height)),(0,0,255),2)
                # model gives back detections sorted with decreasing score, hence break if score is below threshold. This only works for single class detector.

        cv2.imshow('bird', frame)

        # Press esc to exit/stop
        c = cv2.waitKey(1)
        if c == 27:

STEP 3: Train your own object detection model

Alright, you can detect pedestrians now, but what if you cared about detecting cars or a racoon in your backyard?
You might be tempted to use one of the many publicly available pre-trained tensorflow models, but be forewarned!  The accuracy of any deep learning model is highly dependent upon the data it is trained on. Since most of the publicly available models are not trained on aerial images, they will not work well on the images taken from a drone. Training your own object detection model is therefore inevitable.
A simple Google search will lead you to plenty of beginner to advanced tutorials delineating the steps required to train an object detection model for locating custom objects in images. Here are a few tutorial links to build your own object detection model:

Any tutorial will broadly require you to perform the following steps:
i. Gather and Annotate images.
ii. Convert training data to a format consumable by the model-train script.
iii. Select model architecture and search for the best hyper parameters.
iv. Export and host the best model.
Step (iii) is the most time consuming of all since it involves carefully selecting and tuning a large number of parameters, each having some kind of speed or accuracy tradeoff. It is often tedious to setup your machine for deep learning development – right from installing GPU Nvidia drivers, CUDA, cuDNN and getting the versions right to installing "tensorflow" optimised for your platform. All this can quickly turn into a nightmare, especially for a rookie.

The next section describes how to build and use an object detection model through the Nanonets APIs. Give us flak for promoting our product and jump ahead or take a few moments playing on our website and save a ton of time and effort building a model from scratch.


We at Nanonets have a goal of making working with Deep Learning super easy. Object Detection is a major focus area for us and we have made a workflow that solves a lot of the challenges of implementing Deep Learning models.

Nanonets makes building and deploying object detection models as easy as it gets. All you need to do is upload images and annotations for the objects that you want to detect. It employs Transfer Learning and intelligently selects the best architecture along with hyper parameter optimisation. This not only ensures that the final model works best on the sort of data you have but also lowers the amount of training data required. Nanonets has automated the entire pipeline of building models (running experiments with different architectures in parallel, selecting the right hyperparameters and evaluating each model to find the best one) and then deploying them. You also do not need to worry about any of that tedious setup, once a model is trained you can either use these models through API calls over the web (in a programming language of your choice) or run them locally in a Docker image.

End-to-end flow of Nanonets API

Try building your own object detection model for free:
1. Through the Web based GUI:
2. Using Nanonets API:
Detailed steps on how to use Nanonets APIs can be found in one of our other blogs under the section "Build your Own NanoNet".

Once you have the trained a model, you can download it in a Docker Image by selecting the "Integrate" tab on the top.

This tab also contains instructions to install Docker, download your docker image containing the trained model and run the docker container. Using docker alleviates the need to set up your machine environment to support deep learning capabilities. Below are the steps to download and run one of our publicly available docker images which contains the person detector (in aerial images) model.  

Step1: Install Docker

curl -fsSL -o && sudo sh

We recommend to install NVIDIA Docker to ensure near real-time inferences.

Step2: Login to the Docker Registry

sudo docker login --username --password API_KEY

Step3: Run the Docker Container

sudo nvidia-docker run -p 8081:8080

To run the docker on a computer without GPU, run:

sudo docker run -p 8081:8080

Once you have run Step3, your model should be hosted and ready to make inferences on images programmatically through web requests. The code below shows how to get detections on one image:

import requests
url = 'http://localhost:8081/'
data = {'file': open('REPLACE_IMAGE_PATH.jpg', 'rb')}
response =, auth=requests.auth.HTTPBasicAuth('API_KEY', ''), files=data)

Here is the complete code to run object detection on the drones video feed using Nanonet's docker image:

import requests
import cv2
import json

url = 'http://localhost:8081/'
cap = cv2.VideoCapture('rtmp://')
# Check if the drone feed is opened correctly
if not cap.isOpened():
    raise IOError("Cannot connect to drone feed.")

ret, frame =
height, width, _ = frame.shape

while True:
    ret, frame =
    _, img_encoded = cv2.imencode('.jpg', frame)
    response =
        url, auth=requests.auth.HTTPBasicAuth('YOUR_API_KEY', ''),
        files={"file": ("frame.jpg", img_encoded.tostring())},
    response = json.loads(response.text)
    prediction = response["result"][0]["prediction"]
    for i in prediction:
        frame = cv2.rectangle(frame,(i['xmin'],i['ymin']),(i['xmax'],i['ymax']),(0,0,255),2)
    cv2.imshow('bird', frame)
    # Press esc to exit/stop
    c = cv2.waitKey(1)
    if c == 27:

Hardware based solutions for running deep learning models on-board the drone

There are other ways to run object detection on drones in real-time making use of additional hardware.
1. One can make use of high performance embedded computers (companion computers) like DJI’s Manifold, which can be fitted to a drone. You can then run the deep learning models on board the drone by programming the Manifold using DJI Onboard SDK.
Companion computers are a small form-factor Linux system-on-modules that can be physically attached to a drone and are capable of handling computationally demanding deep learning inferences. The table below compares some of the popular embedded platforms (companion computers).

Companion computers for on-board processing (** value not available) 

2. Alternatively, one can get the video output from the controller into a machine where the deep learning models can be run. You might need to buy a HDMI output module (~$100) in case it doesn’t have one and also an HDMI-to-usb convert (~$500, cheap ones do not give good performance on HD videos which can affect a model’s accuracy), as laptops do not accept HDMI-in.

About Nanonets: Nanonets is building APIs to simplify deep learning for developers. Visit us at for more information.