Post

Sentimental Analysis of popular movies using Managed AWS ML Comprehend service

Abstract

In the ever-growing landscape of data, understanding and extracting valuable information from text data can be a daunting task. AWS Comprehend is a natural language processing (NLP) service provided by Amazon Web Services (AWS) that offers a powerful solution to this challenge.

In this article, we will explore the capabilities and features of AWS Comprehend, and how it can be leveraged to gain insights from text data. All the code snippets are prepared in a single Jupyter notebook that is shared at Links section.

Acknowledgements

AWS Comprehend is a fully managed NLP service that allows developers to analyze text in a variety of ways to extract insights and relationships. It supports multiple languages and provides a suite of tools for tasks such as sentiment analysis, entity recognition, language detection, key phrase extraction, and topic modeling.

Let’s delve into some of the key features and use cases of AWS Comprehend.

AWS Comprehend access

As usual Comprehend provides access to its features from AWS Console:

img.png

Also, it is possible to use AWS SDK for programmatic usage, in this post we will use boto3 library.

Detecting PII information

Using comprehend we can detect PII data, chore entities in the text, key phrases, detect languages, sentiments and syntax.

1
2
3
4
5
6
7
8
import boto3
import json

comprehend = boto3.client(service_name='comprehend', region_name='us-east-1')
text = "My name is John and I am 25 years old. My phone number is 000000000000 and my email is XXXXXXXXXXXXXXXX@gmail.com my card is 1111-1111-1111-1111 and pin was 1234"

json_data = comprehend.detect_pii_entities(Text=text, LanguageCode='en')
print(json.dumps(json_data, sort_keys=True, indent=4))

Following response after calling comprehend service we receive what PII categories were detected in the source string and its offsets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
  "Entities": [
    {
      "BeginOffset": 11,
      "EndOffset": 15,
      "Score": 0.9998698234558105,
      "Type": "NAME"
    },
    {
      "BeginOffset": 25,
      "EndOffset": 33,
      "Score": 0.9999417662620544,
      "Type": "AGE"
    },
    {
      "BeginOffset": 58,
      "EndOffset": 70,
      "Score": 0.9999783039093018,
      "Type": "PHONE"
    },
    {
      "BeginOffset": 87,
      "EndOffset": 113,
      "Score": 0.9991216063499451,
      "Type": "EMAIL"
    },
    {
      "BeginOffset": 125,
      "EndOffset": 144,
      "Score": 0.9999704360961914,
      "Type": "CREDIT_DEBIT_NUMBER"
    },
    {
      "BeginOffset": 157,
      "EndOffset": 161,
      "Score": 0.9999898672103882,
      "Type": "PIN"
    }
  ],
  "ResponseMetadata": {
    "HTTPHeaders": {
      "content-length": "484",
      "content-type": "application/x-amz-json-1.1",
      "date": "Sun, 12 Nov 2023 10:27:44 GMT",
      "x-amzn-requestid": "a1d06735-0c52-4ff6-9f83-ae3b9f474383"
    },
    "HTTPStatusCode": 200,
    "RequestId": "a1d06735-0c52-4ff6-9f83-ae3b9f474383",
    "RetryAttempts": 0
  }
}

The results contain classification of PII data types detected with the probability score abd offset position in the input string.

As we can see the results are pretty accurate - all the PII information was fully identified in our test sentence.

IDScoreTypeBeginOffsetEndOffsetExtractedText
00.999870NAME1115John
10.999942AGE253325 years
20.999978PHONE5870000000000000
30.999122EMAIL87113XXXXXXXXXXXXXXXX@gmail.com
40.999970CREDIT_DEBIT_NUMBER1251441111-1111-1111-1111
50.999990PIN1571611234

Detecting sentiments

king

1
2
3
4
5
6
7
8
9
martin_luter_talk = """
    I say to you today, my friends, so even though we face the difficulties of today and tomorrow,
    I still have a dream. It is a dream deeply rooted in the American dream.
    I have a dream that one day this nation will rise up and live out the true meaning of its creed:
    'We hold these truths to be self-evident, that all men are created equal.
    """

sentiment_output = comprehend.detect_sentiment(Text=martin_luter_talk, LanguageCode='en')
print(json.dumps(sentiment_output, sort_keys=True, indent=4))

Response contains different types of sentiments with the probability for each of them. Our test speech was marked as positive with probability 86%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  "ResponseMetadata": {
    "HTTPHeaders": {
      "content-length": "164",
      "content-type": "application/x-amz-json-1.1",
      "date": "Sun, 12 Nov 2023 10:28:18 GMT",
      "x-amzn-requestid": "343f5d13-41b8-41a9-b2d3-74b7bc5542c7"
    },
    "HTTPStatusCode": 200,
    "RequestId": "343f5d13-41b8-41a9-b2d3-74b7bc5542c7",
    "RetryAttempts": 0
  },
  "Sentiment": "POSITIVE",
  "SentimentScore": {
    "Mixed": 0.0012894617393612862,
    "Negative": 0.0006538676097989082,
    "Neutral": 0.1380418837070465,
    "Positive": 0.8600147366523743
  }
}

Let’s visualise composition of sentiments and probability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import matplotlib.pyplot as plt

# SentimentScore data
sentiment_scores = sentiment_output['SentimentScore']

# Plotting
fig, ax = plt.subplots()
colors = ['red', 'blue', 'orange', 'green']  # Define colors for each sentiment
bars = ax.bar(sentiment_scores.keys(), sentiment_scores.values(), color=colors)

# Add labels and title
ax.set_ylabel('Sentiment Score')
ax.set_title('Sentiment Analysis')
ax.set_ylim(0, 1)  # Set the y-axis limit to 0-1

# Display the values on top of the bars
for bar in bars:
  yval = bar.get_height()
  ax.text(bar.get_x() + bar.get_width() / 2, yval, round(yval, 3), ha='center', va='bottom')

# Show the plot
plt.show()

img.png

Detect Toxic content Terminator movie

term.jpeg

Let’s try to analyse some popular movies for toxic content detection.

First one is from Terminator movie Bar scene, when T200 arrived:

1
2
3
4
5
6
7
8
9
response = comprehend.detect_toxic_content(
  TextSegments=[
    {
      'Text': 'I Need Your Clothes, Your Boots and Your Motorcycle.'
    },
  ],
  LanguageCode='en'
)
print(json.dumps(response, sort_keys=True, indent=4))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"ResultList": [
{
"Labels": [
{
"Name": "PROFANITY",
"Score": 0.07029999792575836
},
{
"Name": "HATE_SPEECH",
"Score": 0.07109999656677246
},
{
"Name": "INSULT",
"Score": 0.20020000636577606
},
{
"Name": "GRAPHIC",
"Score": 0.01860000006854534
},
{
"Name": "HARASSMENT_OR_ABUSE",
"Score": 0.08330000191926956
},
{
"Name": "SEXUAL",
"Score": 0.15649999678134918
},
{
"Name": "VIOLENCE_OR_THREAT",
"Score": 0.027699999511241913
}
],
"Toxicity": 0.2313999980688095
}
]

aws comprehend could not detect any violence in this sentence, how ever we remember what happened in the Bar after this phrase was said. But it is actually great because T200 was not using any hard expression and had zero emotions :).

Detect Toxic content God Father movie

gf.jpeg

1
2
3
4
5
6
7
8
9
response = comprehend.detect_toxic_content(
  TextSegments=[
    {
      'Text': 'Bonasera, Bonasera, what have I ever done to make you treat me so disrespectfully? If you’d come to me in friendship, this scum who ruined your daughter would be suffering this very day. And if by some chance an honest man like yourself made enemies, they would become my enemies. And then, they would fear you.'
    },
  ],
  LanguageCode='en'
)
print(json.dumps(response, sort_keys=True, indent=4))

Here results are showing bigger probability of INSULT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"ResultList": [
{
"Labels": [
{
"Name": "PROFANITY",
"Score": 0.14669999480247498
},
{
"Name": "HATE_SPEECH",
"Score": 0.1316000074148178
},
{
"Name": "INSULT",
"Score": 0.8208000063896179
},
{
"Name": "GRAPHIC",
"Score": 0.05389999970793724
},
{
"Name": "HARASSMENT_OR_ABUSE",
"Score": 0.20350000262260437
},
{
"Name": "SEXUAL",
"Score": 0.10809999704360962
},
{
"Name": "VIOLENCE_OR_THREAT",
"Score": 0.35569998621940613
}
],
"Toxicity": 0.6919000148773193
}
]

Visual representation

img.png

This post is licensed under CC BY 4.0 by the author.