Intro

Former Member · ‎09-30-2014

You can find the series and project from Real-time sentiment rating of movies on SAP HANA (part 6) - wrap-up and looking ahead

Intro

In the previous blog Real-time sentiment rating of movies on SAP HANA (part 1) - create schema and tables, we've determined APIs which we'll use in our app and create some tables like movies and tweets with CDS. In this blog, we will prepare the data. In the first version of the smart app, I just crawled the metadata of new release movies once which means there were only about 20 movies at all. With the lack of Outbound Connectivity and Job Scheduling features, I could only use Java multithreading to crawl tweets continuously. With the introduction of Outbound Connectivity since SPS06 and Job Scheduling since SPS07, now I can use SAP HANA XS to prepare data. We will discuss it in this blog.

Create roles

At the end of Real-time sentiment rating of movies on SAP HANA (part 1) - create schema and tables, we can see tables are already successfully created. However, they are owned by _SYS_REPO instead of the user which you use to add the SAP HANA system and login. So, first of all, we need to create few roles and grant these roles to some users. We can create two roles as follows. The user role has the select privilege on our design-time schema which means this role can only select data from this schema, while the administration role extends the user role and has additional insert/delete/update privileges on the schema which means the admin role can do select/insert/update/delete on this schema.

Notice: we use schema instead of catalog schema in our role definitions. You're recommended to do so since we will build a pure SAP HANA native app.

User.hdbrole



role movieRating.roles::User {

  schema movieRating.data:MOVIE_RATING.hdbschema: SELECT;

}

Admin.hdbrole



role movieRating.roles::Admin

extends role movieRating.roles::User {

  schema movieRating.data:MOVIE_RATING.hdbschema: INSERT, DELETE, UPDATE;

}

After the activation, we can use the following SQL to grant the activated roles. In this example, we grant the admin role to a user who will crawl data from APIs and insert into SAP HANA.



CALL "_SYS_REPO"."GRANT_ACTIVATED_ROLE"('movieRating.roles::Admin', '<USERNAME>');

Create HTTP/HTTPS destinations

Since we want to use Outbound Connectivity in SAP HANA XS, the first thing is to create HTTP/HTTPS destinations. We will call APIs from Rotten Tomatoes API and Twitter API, so we need to create two destinations, one for Rotten Tomatoes API, the other for Twitter API.

rottenTomatoesApi.xshttpdest



description = "rotten tomatoes api";

host = "api.rottentomatoes.com";

port = 80;

pathPrefix = "/api/public/v1.0";

useProxy = true;

proxyHost = "proxy.pal.sap.corp";

proxyPort = 8080;

authType = none;

useSSL = false;

timeout = 0;

twitterApi.xshttpdest



description = "twitter api";

host = "api.twitter.com";

port = 443;

pathPrefix = "/1.1";

useProxy = true;

proxyHost = "proxy.pal.sap.corp";

proxyPort = 8080;

authType = none;

useSSL = true;

timeout = 0;

As we can call Rotten Tomatoes API via HTTP, you don't need to configure anything. However for Twitter API, we can only call APIs via HTTPS and it is kind of complex to configure. It will take some time here. Please make sure you've finished Use XSJS outbound connectivity to search tweets before you go forward. So, for twitterApi.xshttpdest, we need to configure the trust store as showed in the red box below.

Create XSJS to crawl data

Now we can code XSJS to crawl data. We can create two XSJS files. One is for getting metadata of new release movies, the other is for searching tweets.

searchMovies.xsjs



function hashtag(title) {

  return "#" + title.split(":")[0].replace(/\W/g, "");

}

function searchMovies() {

  var baseURL = "/lists/movies/opening.json?limit=50&country=us";

  var apikey = "<your_api_key>";

  var destination = $.net.http.readDestination("movieRating.services", "rottenTomatoesApi");

  var client = new $.net.http.Client();

  var request = new $.net.http.Request($.net.http.GET, baseURL + "&apikey=" + apikey);

  var response = client.request(request, destination).getResponse();

  var movies = JSON.parse(response.body.asString()).movies;

  if (movies) {

  var movie;

  var conn = $.db.getConnection();

  var pstmtMovies = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.Movies" ("id", "title", "year", "mpaa_rating", "runtime", "release_date", "synopsis", "poster", "studio", "hashtag", "timestamp", "since_id") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)');

  var pstmtGenres = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.Genres" ("movie_id", "genre") VALUES (?, ?)');

  var pstmtAbridgedCast = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.AbridgedCast" ("movie_id", "cast") VALUES (?, ?)');

  var pstmtAbridgedDirectors = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.AbridgedDirectors" ("movie_id", "director") VALUES (?, ?)');

  for (var i in movies) {

  request = new $.net.http.Request($.net.http.GET, "/movies/" + movies[i].id + ".json?apikey=" + apikey);

  response = client.request(request, destination).getResponse();

  movie = JSON.parse(response.body.asString());

  //Movie

  pstmtMovies.setInteger(1, movie.id);

  pstmtMovies.setString(2, movie.title);

  pstmtMovies.setInteger(3, movie.year, 10);

  pstmtMovies.setString(4, movie.mpaa_rating);

  pstmtMovies.setString(5, movie.runtime + "");

  pstmtMovies.setDate(6, movie.release_dates.theater, "YYYY-MM-DD");

  pstmtMovies.setString(7, movie.synopsis);

  pstmtMovies.setString(8, movie.posters.thumbnail);

  pstmtMovies.setString(9, movie.studio === undefined ? "" : movie.studio);

  pstmtMovies.setString(10, hashtag(movie.title));

  pstmtMovies.setTimestamp(11, new Date());

  pstmtMovies.setString(12, '0');

  pstmtMovies.execute();

  //Genres

  if (movie.genres) {

  for (var i in movie.genres) {

  pstmtGenres.setInteger(1, movie.id);

  pstmtGenres.setString(2, movie.genres[i]);

  pstmtGenres.execute();

  }

  }

  //AbridgedCast

  if (movie.abridged_cast) {

  for (var i in movie.abridged_cast) {

  pstmtAbridgedCast.setInteger(1, movie.id);

  pstmtAbridgedCast.setString(2, movie.abridged_cast[i].name);

  pstmtAbridgedCast.execute();

  }

  }

  //AbridgedDirectors

  if (movie.abridged_directors) {

  for (var i in movie.abridged_directors) {

  pstmtAbridgedDirectors.setInteger(1, movie.id);

  pstmtAbridgedDirectors.setString(2, movie.abridged_directors[i].name);

  pstmtAbridgedDirectors.execute();

  }

  }

  }

  pstmtMovies.close();

  pstmtGenres.close();

  pstmtAbridgedCast.close();

  pstmtAbridgedDirectors.close();

  conn.commit();

  conn.close();

  }

}

I don't plan to explain the whole code, since you can find the XSJS reference from JSDoc: Index. I just want to mention several key points.

1. We first call Rotten Tomatoes API - Opening Movies to get all IDs of new release movies of current week, then we use Rotten Tomatoes API - Movie Info to get the detail info of each movie and insert into SAP HANA.

2. The hashtag() function: There are two steps: First, we remove the subtitle, for example there is a new release movie named "My Little Pony: Equestria Girls - Rainbow Rocks", we just use "My Little Pony" as the movie title. Second, we just keep A-Za-z0-9_. Otherwise, some movie titles will be too long or there will be some marks in the movie title which are both not good for searching tweets.

3. For Rotten Tomatoes API - Opening Movies, you cannot get more than 50 new release movies and the default value of "limit" is 16 which is few to us. So, we set "limit" to 50.

4. We just use the release date in theaters and we change the data type from string to date.

5. since_id is 0 by default.

searchTweets.xsjs



function searchTweets() {

  var baseURL = "/search/tweets.json?lang=en&result_type=recent&count=100";

  var token = "<your_bearer_token>";

  var destination = $.net.http.readDestination("movieRating.services", "twitterApi");

  var client = new $.net.http.Client();

  var request, response, result, tweets, max_id_str;

  var conn = $.db.getConnection();

  var pstmtSelectMovies = conn.prepareStatement('SELECT "id", "hashtag", "since_id" FROM "MOVIE_RATING"."movieRating.data::movieRating.Movies" WHERE DAYS_BETWEEN("release_date", CURRENT_DATE) BETWEEN 0 AND 6 ORDER BY "release_date" DESC');

  var pstmtUpdateMovies = conn.prepareStatement('UPDATE "MOVIE_RATING"."movieRating.data::movieRating.Movies" SET "since_id" = ? WHERE "id" = ?');

  var pstmtTweets = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ("id", "created_at", "text", "source", "user_screen_name", "user_profile_image_url", "longitude", "latitude", "movie_id", "timestamp") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)');

  var rs = pstmtSelectMovies.executeQuery();

  while (rs.next()) {

  try {

  request = new $.net.http.Request($.net.http.GET, baseURL + "&q=" + rs.getString(2) + "&since_id=" + rs.getString(3));

  request.headers.set('Authorization', token);

  response = client.request(request, destination).getResponse();

  result = JSON.parse(response.body.asString());

  tweets = result.statuses;

  max_id_str = result.search_metadata.max_id_str;

  if (tweets.length === 1) {

  pstmtTweets.setString(1, tweets[0].id_str);

  pstmtTweets.setTimestamp(2, tweets[0].created_at.slice(4, 20) + tweets[0].created_at.slice(-4), "MON DD HH24:MI:SS YYYY");

  pstmtTweets.setString(3, tweets[0].text);

  pstmtTweets.setString(4, tweets[0].source);

  pstmtTweets.setString(5, tweets[0].user.screen_name);

  pstmtTweets.setString(6, tweets[0].user.profile_image_url);

  if (tweets[0].coordinates === null) {

  pstmtTweets.setNull(7);

  pstmtTweets.setNull(8);

  } else {

  pstmtTweets.setDecimal(7, tweets[0].coordinates.coordinates[0]);

  pstmtTweets.setDecimal(8, tweets[0].coordinates.coordinates[1]);

  }

  pstmtTweets.setInteger(9, rs.getInteger(1));

  pstmtTweets.setTimestamp(10, new Date());

  pstmtTweets.execute();

  } else if (tweets.length > 1) {

  pstmtTweets.setBatchSize(tweets.length);

  for (var i in tweets) {

  pstmtTweets.setString(1, tweets[i].id_str);

  pstmtTweets.setTimestamp(2, tweets[i].created_at.slice(4, 20) + tweets[i].created_at.slice(-4), "MON DD HH24:MI:SS YYYY");

  pstmtTweets.setString(3, tweets[i].text);

  pstmtTweets.setString(4, tweets[i].source);

  pstmtTweets.setString(5, tweets[i].user.screen_name);

  pstmtTweets.setString(6, tweets[i].user.profile_image_url);

  if (tweets[i].coordinates === null) {

  pstmtTweets.setNull(7);

  pstmtTweets.setNull(8);

  } else {

  pstmtTweets.setDecimal(7, tweets[i].coordinates.coordinates[0]);

  pstmtTweets.setDecimal(8, tweets[i].coordinates.coordinates[1]);

  }

  pstmtTweets.setInteger(9, rs.getInteger(1));

  pstmtTweets.setTimestamp(10, new Date());

  pstmtTweets.addBatch();

  }

  pstmtTweets.executeBatch();

  }

  //Update "since_id" of the movie

  pstmtUpdateMovies.setString(1, max_id_str);

  pstmtUpdateMovies.setInteger(2, rs.getInteger(1));

  pstmtUpdateMovies.executeUpdate();

  } catch (e) {

  }

  }

  conn.commit();

  rs.close();

  pstmtSelectMovies.close();

  pstmtUpdateMovies.close();

  pstmtTweets.close();

  conn.close();

}

Similar to searchMovies.xsjs, I just explain some key points.

1. Regarding the following SQL, the first thing we want to do in searchTweets.xsjs is to get new release movies, because we will search tweets based on the hashtags of movies. Since there were only about 20 movies in the first version of this app, I can always search tweets about all movies. However, in the second version of the smart app, we plan to get new release movies continuously which means we will get new release movies of each week. So, we will have huge amounts of movies in our movie table. It's impossible to crawl tweets about all these movies because of API Rate Limits | Twitter Developers. Imagine there are about 30 new release movies each week in the US, after one year there will be 1500+ movies in the movie table. The result is that we'll be not able to crawl tweets in real time! So my current solution is just searching tweets about new release movies which has been released less than one week. For example, if a movie is released on 2014-09-30, the app will crawl tweets about this movie from 2014-09-30 to 2014-10-06.



SELECT "id", "hashtag", "since_id" FROM "MOVIE_RATING"."movieRating.data::movieRating.Movies" WHERE DAYS_BETWEEN("release_date", CURRENT_DATE) BETWEEN 0 AND 6 ORDER BY "release_date" DESC

2. We use since_id as one of the parameters, because we don't expect old results which we have already crawled. And we keep since_id for each movie, so after we insert tweets into SAP HANA, we will update since_id of the movie. Next time we want to crawl tweets about this movie, we can use the updated since_id directly.

3. Parameters "lang", "result_type", "count"

- For "lang=en", we just search tweets in English for simplicity. Currently SAP HANA supports sentiment analysis the following five languages. See Voice of the Customer Content - SAP HANA Text Analysis Language Reference Guide - SAP Library

English
German
French
Spanish
Simplified Chinese

- For "result_type=recent", since we want to search tweets in real-time, we just need recent tweets. You can find other options from GET search / tweets | Twitter Developers

- For "count=100", the maximum count of each return is 100, so we set to maximum.

4. We use batch when inserting tweets if possible. If we get only one tweet, cannot use batch. You can find the logic from line 27 - 67.

5. We change the data type of "created_at" from string to timestamp.

Notice: Before going to the job scheduling step, you are highly recommended to activate all objects by now and call searchMovies.xsjs and searchTweets.xsjs manually. If everything is OK, please jump into the job scheduling part. Since I will use searchMovies.xsjs and searchTweets.xsjs in the job scheduling, I wrapped them in two functions. If you call them manually, please remove function() {}, that is the first line and the last line

Create job scheduling

Now you should be able to search new release movies and tweets manually. Since SAP HANA XS now supports the job scheduling feature, why not use this awesome feature?

searchMovies.xsjob



{

    "description": "Search movies",

    "action": "movieRating.services:searchMovies.xsjs::searchMovies",

    "schedules": [

       {

          "description": "00:00 every Tuesday (UTC)",

          "xscron": "* * * tue 0 0 0"

       }

    ]

}

Since the new release movies of current week does not change usually, we can just search new release movies once per week. Here we call the "searchMovies" function at 00:00 every Tuesday. You may ask why this specific time? I noticed that most movies are released on Friday. The reason is obvious, we can go to the theaters at weekends. See When Is Opening Your Film on Wednesday a Good Idea? - Film.com So, Tuesday is early enough.

Notice: The time in .xsjob is UTC time and you cannot change it to other timezones.

searchTweets.xsjob



{

    "description": "Search tweets",

    "action": "movieRating.services:searchTweets.xsjs::searchTweets",

    "schedules": [

       {

          "description": "every 2 minutes",

          "xscron": "* * * * * */2 0"

       }

    ]

}

The above is the job scheduling we will use to search tweets. Due to the API Rate Limits | Twitter Developers, two minutes interval is currently a good choice which means the app will crawl tweets about a range of new release movies every two minutes.

After the activation, do not forget to activate both job scheduling in the SAP HANA XS Administration Tool as follows. If you are not familiar with that, please have a look at Scheduling XS Jobs - SAP HANA Developer Guide - SAP Library

Next steps

So far we've finished the data preparation which means we are now searching new release movies and related tweets and inserting into SAP HANA in real-time! You may have crawled the similar results as follows. With the metadata of new release movies and huge amounts of tweets, we can do the sentiment analysis in the next blog.

Movies

Tweets

Hope you enjoyed reading my blog. :smile:

Real-time sentiment rating of movies on SAP HANA (part 2) - data preparation

Intro

Create roles

Create HTTP/HTTPS destinations

Create XSJS to crawl data

Create job scheduling

Next steps

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win