Tuesday, December 13, 2016

What I Learned While Fighting Fake News

In a couple of upcoming blog entries I will share what I learned while building Realnews/Fakenews, a Chrome extension that adds visual indicators to sites such as Facebook and Twitter to warn readers of possible Fake News links:


npr.org: Fake News

I will talk about the extension itself, the architecture and implementation of the server, choices in hosting plans and pricing for the server, and how to use machine learning to predict fake news from just looking at the URL. 

Fake News is serious. It can lead to vigilantes storming pizza parlors and foreign parties influencing US elections or upcoming UK elections. It is hard to root out fake news as it feeds on a powerful psychological trait called confirmation bias. What that means is that fake news is often designed to cater to a specific target audience. Fake news authors write specific content their targets like to hear, whether it is factual, hyperbolic, or not. Goal is to increase the likeliness of them sharing, thereby strengthening the original cause, or more likely, generating advertisement income.

What can we do about fake news? In a simple diagram, this is how fake news spreads:



There is not much we can do to stop fake news publishers from actually creating content. What the industry can do, social media outlets in particular, is to stop giving fake news publishers a free distribution channel. That's a good step in the right direction.

Where the Realnews/Fakenews Chrome extension helps is at the reading stage. When a link is presented on Facebook or Twitter to a third party site as a shared post, a small button is added to the top right to indicate the category of the site:


In this case, the site this tweet links to is marked as real news. Of course, the subject itself is fake news, but that's not the point. When the green button is pressed, a category selector shows up as follows:



When a different category is chosen by the reader, it counts as one vote. A consensus model is used on the server so that future readers will be presented with the popular vote. Voting is non-moderated. 

Once enough pages have been voted on, features can be extracted such as the hosting domain of the link, words used in the URL, or the occurrence of specific content, such as lots of ads. Machine learning is then used to classify unknown sites and seed the prediction with a meaningful starting point. 

Like I said at the start, more details on the implementation will come in future blog entries. For now, try it out!

No comments:

Post a Comment