Reddit-analytics: Daily NLP via API on posts from various subreddits to produce data visualization

Published: by Creative Commons Licence

This post will contain recordings of some reflections after deployment of my most recent project, reddit-analytics.

Idea

I have been meaning to get my hands dirty with a machine learning related project for a while now. And I was putting it off, telling myself I will actually start a project once I get a new laptop with dedicated GPU until I found a reason not to. (I could always ssh into amazon ec2s dedicated for ml but then, not developing on localhost=puke)

Enter Rosette API, a NLP as a service provider.

I found out about it on /r/python while I was browsing for python based ML libraries. I was looking forward to starting a new personal project at the time and this seemed like the perfect gateway into machine learning since I would get to learn and use the power of NLP without having to deploy classifier (although implementing and fine tuning the classifier would be more valuable from knowledge gain perspective). My rough estimation showed that the free tier provided by Rosette API, 10k free calls every month was enough for the project. I would eventually then switch from Rosette API by running a native NLP classifier on one of my reliable ec2 instances.

Thus reddit-analytics came to fruition.

Implementation

The implementation portion turned out to be simple, my rough time estimate: I probably spent around 150 hours on the project. Not bad considering in between those hours probably around mid project was a gap of ~1.5 months. I couldn't complete reddit-analytics on one sprint of coding session over weeks because I had to switch my contracting job in between. This break slowed my velocity & my intentions towards the project. Oops!

When I first started out, I had just finished reddit-timestamp-bot project, and in my mind I was looking to explore other cool stuff that could be done with reddit's API. As soon as I started looking into Rosette API and consequently NLPs, I realized I wanted to do time based analysis on reddit's post.

A quick google fu returned search results that had similar projects ideas but they were not time based analysis, not exactly what I was looking for. I finally double checked /r/dataisbeautiful & /r/internetisbeautiful just to be sure I wasn't re-inventing something. My project idea simple: reddit-analytics would serve as a tool for various moderators of subreddit, to perform easy analytics on their subreddit data by letting them query against time based metadata about their subreddit posts.

Eventually, I would realize time based analysis can be done much easily with ELK stack by elastic.co. Kibana does an awesome job of visualizing your data while providing decent granularity on the visuals. But the Tradeoff of using kibana versus implementing your own visuals being something along the lines of: If your data is a distant star, and analytics is your telescope, would you like to buy a set of most commonly used lens to look into your data, or would you like to create your own lens from scratch so you control every bit of that telescope's processing?

Further Implementation Details:

I wrote up a scraper using python. Why python instead of node js?

  • It was a batch type scraping that didnt need node's asynchonisity goodness
  • praw seemed better to interact with reddit api compared to some npm module I found
  • python has native sqlite binding, so it can easily initialize the metadata table and store the metadata in chunks throughout the day without dependencies. (I indexed the timestamp field, since our only lookups are going to be based off of timestamps.)
  • Reproducibility - If some data scientist who is not a web dev by profession wants to use just the scraper, they most likely will have an easier time setting it up via python.

I wrote the API using node js. Why node js instead of python?

  • Node js is much more suited for writing API's with easy middleware integration and async tasks to query the database.
  • I implemented a simple caching layer on top of Node JS APIs so the database query would get cached every hour.
  • Eventually it would be easy to scale the API's with login system or combine it with other microservice.

I wrote the Front End using Angular.js. Why?

  • While I could have stuck with vannilla javascript and d3.js, I wasn't looking forward to setting everything up myself manually. This project was intended for me to learn about NLP not front end technologies!
  • Angualr js has awesome d3 & nvd3 bindings.
  • The application would be a SPA.
  • Some details about front end:
    • I spent the most time writing the factories and controller for data visualization. The angular app itself has logic to cache the cached data from API into localstorage and check if the data is stale to optimize our data flow further.
    • The angular app is divided into various factories which does chunks of jobs: dataprocessorfactory (normalize api data), charftactory(convert normalized data to chart compatible object), datadecoratorfactory(provide green/red or other colors based on sentiments) and controllers that actually uses those factory to load the data into graph.

Setting up the application:

  • The python scraper runs as a cron job every 8 hours (yay crontab -e), dumping metadata into a sqlite database.
  • The API runs with pm2 an awesome node process manager.
  • The files are hosted on an amazon ec2 instance behind an elb.

Findings about reddit comments:

Any interesting results from the weekly analysis? Yes.

  • Comments on /r/politics & /r/worldnews were categorized with high number of neutral comments. On the other side /r/videos usually encouraged positive comments while, comments on /r/askreddit were mostly either negative or positive. Quick follow up on askreddit posts for negative visuals showed, the askreddit questions were somewhat morbid questions(!) with similar answers or had a lot of good natured self depreating humour in comments that were categorized as negative.
  • Around, thanksgiving time, the number of "green" comments in /r/all was higher than other week's timesets. It will be interesting to see if the trends for various subreddits that appeared on the timeset of thanksgiving week appear again on timeset of christmas week!

Reflections

Looking back this was a valuable project because it made me realize couple of important things that might be worh sharing:

  • Machinelearning is fun & powerful! I knew that, we all know that. But harnessing classifiers as API based service so its easily available to all the hobbyists would be even more useful. This was a very small subset of use case, and we were able to create such a powerful tool with only ~150 hours spent! I think one of the main things that's stopping people from doing ML related projects is computing hosepower required for good classifier (aside from the willingness to spend time learning & fine tuning ML techniques). If those are abstracted out as a simple API service and documentation so every John Doe can come up with and implement project ideas we would have a bigger pool of ml related projects.

    Obviously API based classifiers wouldn't provide the enthusiasts fine grained level of control that running your own classifier would but it would get them drawing projects nonetheless.

  • Exercising with complex json schema structure is a good practice to identify possible ways you can visualize your data. Consequently, it is important to design & determine the nestedness and overall strcuture for your data so it can eventually be used in a verstaile way by referencing it directly instead of having to clone it under a different structure. Usually, analytics involve decent amount of data and optimizing is important if the plan is to do the data processing on fron-end browser instead of server.
  • Importance of modular programming, because sometimes life happens and you might have to take a break during your project!