Post

Detecting objects on image with AWS rekognition and labeling them with OpenCV

Abstract

Amazon Rekognition is a cloud-based image and video analysis service provided by Amazon Web Services (AWS). This powerful service enables developers to incorporate sophisticated computer vision capabilities into their applications without the need for extensive machine learning expertise. Amazon Rekognition makes it easy to analyze and extract valuable information from visual content, ranging from images to videos.

All the code snippets are prepared in a single Jupyter notebook that is shared at Links section.

Key Features:

  • Object and Scene Detection
  • Facial Analysis and comparison
  • Text Detection
  • Content Moderation
  • Celebrity Recognition
  • Custom Labels based on user trained set

Image for analysis

Let’s take random picture of New York street, where multiple object are present - cars, city lights, buildings, pedestrians, etc.

We will use it to perform analysis with aws rekognition.

gf.jpeg

Uploading the Image

Service supports 2 modes of image uploading, the first case is when the image is in S3, it was uploaded and based on lifecycle rule rekognition logic was triggered - it is the most widely used scenario.

But also there is other option to send image as byte array without storing it in S3 - we will use this option.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
rekognition = boto3.client(service_name='rekognition', region_name='us-east-1')

# Read the image file
image_path = 'samp.jpeg'
with open(image_path, 'rb') as image_file:
  image_data = image_file.read()

# Call AWS Rekognition API to detect labels in the image
max_labels = 200
min_confidence = 95
response = rekognition.detect_labels(
  Image={
    'Bytes': image_data
  },
  MaxLabels=max_labels,  # You can adjust this parameter based on your needs
  MinConfidence=min_confidence  # You can adjust this parameter based on your needs
)

print(json.dumps(response, sort_keys=True, indent=4))

# Print detected labels
print("Detected labels:")
for label in response['Labels']:
  print(f"{label['Name']} ({label['Confidence']:.2f}%)")

As a result we can see the following table that gives a basic overview what is represented on the picture. I have aggregated information into the table with category name and probability of this category was on image.

CategoryProbability
Building(99.55%)
Cityscape(99.55%)
Light(99.42%)
Traffic Light(99.42%)
Tarmac(98.98%)
Accessories(97.87%)
Bag(97.87%)
Handbag(97.87%)
Person(97.64%)
Car(95.22%)
Transportation(95.22%)
Vehicle(95.22%)

Checking the response details

The actual service response contains a lot of metadata, here is a single entity for transport and logistics only, but there are multiple JSON objects in response:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
            "Aliases": [],
            "Categories": [
                {
                    "Name": "Transport and Logistics"
                }
            ],
            "Confidence": 99.41839599609375,
            "Instances": [
                {
                    "BoundingBox": {
                        "Height": 0.07998945564031601,
                        "Left": 0.8064739108085632,
                        "Top": 0.4241378605365753,
                        "Width": 0.02277914620935917
                    },
                    "Confidence": 99.41839599609375
                },
                {
                    "BoundingBox": {
                        "Height": 0.03180865943431854,
                        "Left": 0.7512333393096924,
                        "Top": 0.6746096611022949,
                        "Width": 0.010767364874482155
                    },
                    "Confidence": 95.53616333007812
                }
            ],
            "Name": "Traffic Light",
            "Parents": [
                {
                    "Name": "Light"
                }
            ]
        },

The key element here is BoundingBox that has information about the area on source image where this Object was detected. By parsing this response in a loop we can add layer on top of source image with augmented rectangles to show objects that were detected.

Identifies the bounding box around the label, face, text, object of interest, or personal protective equipment. The left (x-coordinate) and top (y-coordinate) are coordinates representing the top and left sides of the bounding box. Note that the upper-left corner of the image is the origin (0,0).

The top and left values returned are ratios of the overall image size. For example, if the input image is 700x200 pixels, and the top-left coordinate of the bounding box is 350x50 pixels, the API returns a left value of 0.5 (350/700) and a top value of 0.25 (50/200).

The bounding box coordinates can have negative values. For example, if Amazon Rekognition is able to detect a face that is at the image edge and is only partially visible, the service can return coordinates that are outside the image bounds and, depending on the image edge, you might get negative values or values greater than 1 for the left or top values.

OpenCV to mark detected objects

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
image = cv2.imread('samp.jpeg')

print(image.shape[1])
print(image.shape[0])

for label in response['Labels']:
    if 'Instances' in label:
        red = random.randint(0, 255)
        green = random.randint(0, 255)
        blue = random.randint(0, 255)
        for instance in label['Instances']:

            annotation = instance['BoundingBox']
            width = int(annotation['Width'] * image.shape[1])
            height = int(annotation['Height'] * image.shape[0])
            left = int(annotation['Left'] * image.shape[1])
            top = int(annotation['Top'] * image.shape[0])

            # Draw a rectangle on the image
            cv2.rectangle(image, (left, top), (left + width, top + height), (red, green, blue), 2)
            x =  int((left + width/2))
            y =  int((top + height/2))

            cv2.putText(image, f"{label['Name']}: {round(instance['Confidence'])}%", (x,y), cv2.FONT_HERSHEY_SIMPLEX, 1, (red,green,blue), 2)

cv2.imwrite(f'augmented-{max_labels}_{min_confidence}.png',image)

We are choosing random colors for each group category, adding rectange for each Bounding Box of this group and also add Text with group name:

Image: probability=70%, max_labels=50

gf.jpeg

When invoking rekognition there are 2 params for tunning:

  • max_labels to detect on image
  • min_confidence in %

Image: probability=90%, max_labels=50

Previous augmented image had max_labels=50,min_confidence=70. Here is next augmented image with max_labels=50,min_confidence=90. And it changes a little gf.jpeg

Image: probability=95%, max_labels=200

Same way we can add any additional info, here I’m showing the probability of each label:

gf.jpeg

Image: probability=70%, max_labels=200

Much more objects, and detection is still pretty accurate.

gf.jpeg

Conclusions

Amazon Rekognition simplifies the integration of advanced image and video analysis capabilities into applications, empowering developers to build more intelligent and interactive solutions. Whether you’re working on security systems, content moderation, or personalized user experiences, Rekognition provides a scalable and reliable set of tools for computer vision tasks. By understanding its features, use cases, and best practices, developers can harness the power of Rekognition to enhance the functionality and user experience of their applications.

AWS Rekognition use cases, scenarios

  • security and surveillance
  • user authentication
  • media and entertainment
  • retail
  • healthcare
This post is licensed under CC BY 4.0 by the author.