Statistics in “NSA Files: Decoded”
On November 1st, 2013 The Guardian published the article ‘NSA Files: Decoded’ written by Ewen MacAskill and Gabriel Dance. The article incorporates strong visual elements, produced by Feilding Cage and Greg Chen. The article is divided into 6 parts, each part illustrating the conduct of the NSA – revealed by Snowden’s whistleblowing – and the consequent impact on both a societal and individual level. The article contains a vast amount of information, which is displayed to the reader in an enticing way through a variety of interactive visual elements. The amount of statistics – in the sense of cold, hard numbers – is not excessive. However, the purpose of the article can be interpreted as an engaging rendition of what Edward Snowden’s feat exposed; and how it might affect the reader. One noteworthy aspect in the article, is the lack of direct quotation along with some of the graphics. Presumably the information is embedded in the NSA Files, as the article predominantly interprets the information in the files released by Snowden.
‘Three degrees of separation’ and ‘Your digital trail’ are two of the interactive graphics depicting the extensive length of the prying arm of digital espionage. ‘Three degrees of separation’ visualises the three ‘hops’ the NSA can make from their intended target, in order to create a web of interactions and, subsequently, garner more information on a greater amount of people. The interactive graphic allows you to slide a pointer to indicate how many friends you have on Facebook and see how many people the NSA could technically monitor through you – personally my results came up as follows: tier 1 (your own friends), around 1100; tier 2 (friends of friends), amounted to almost 180,000; and tier 3 (friends of friends of friends) 29,000,000. The amounts are based on an estimate that the average user has 190 friends, but considering the article is almost six years old, the numbers are likely far bigger in today’s world. Presumably the graphic is based on information available in the NSA files, but no direct quotation is made.
‘Your digital trail’ is another interactive graphic which allows the user to find out what information they are relinquishing through various forms of communication. The graphic allows the reader to click on logos of; different social media (e.g. Facebook, Twitter), e-mail, Google Search, phone call, web browsing, and camera, to see what information the provider – and consequently the NSA – can gather through your use of these services. Equivalent to ‘Three degrees of separation’ no source is listed, but it allows the user to visualise the vast amount of information surrendered through the mere use of services.
Other noteworthy motion-graphics include the one calculating the amount of data gathered for review by the NSA since opening the article. Like the two previous graphics no direct source is available, but the graphic does the job of portraying how much information the NSA gathers. One motion graphic where the source is directly credited is ‘Connected by cables’, which depicts how many countries the US is directly connected to through fibre-optic cables. The graphic is used to illustrate the ‘upstream’ flow perpetrated by the NSA, meaning they directly intercept information transferred along fibre-optic cables connected to the US
Apart from the interactive motion graphics, a variety of statistics are used throughout the article to strengthen the points written and clarify technological jargon in picture form. Such statistics include public opinion on the government’s ability to protect privacy, opposition towards governmental monitoring of communication, amongst others.
Overall, the statistics in the article mostly serve the purpose of clarification. The examples mentioned above all seem to simplify matters revealed in the NSA files, which might be hard to grasp in writing, through interactive motion graphics; which allows the reader to put things into a personal perspective.
NSA Files Decoded
Tom Westoe, Merve Aytas, Édua Varga, Tjerk de Vries, Pascal Friedrich Degenfeld, Dragoș Octavian Culcear
Problematization of this journalistic research
There has been quite an extensive debate in media scholarship and journalism about the ethics and potential dangers of the social media filter bubbles. Filter bubbles are the result of personalization algorithms. These algorithms learn what a user is interested in, likes and dislikes, and accordingly shows them content to keep them engaged on the platform. It is also used to target users with specific advertisements that are of interest to the user. The result is that each user gets put in his or her own filter bubble, where they are not exposed to, for instance, differing political views. The algorithms determine the kind of ads each of us encounters by constantly gathering extensive data about our previously bought or liked items. This is the reason why Facebook claims that their service shall always be “free”: the social media platform generates its revenue from the way it delivers personalized ads to different people.
Everyone gets stuck in filter bubbles. They can influence our political views by promoting fake or one-sided news materials and propaganda, as well as causing discrimination and social class divisions. One of the biggest events that triggered world-wide discussions about this topic was centered around Donald Trump’s election, specifically about how they aided his victory. Since then, many researches tried to uncover how the system works, but sadly none of them have been successful. Here’s where Facebook Tracking Exposed comes into play.
Facebook Tracking Exposed, or simply fbTREX, is a Chrome/Firefox extension that tracks the data received by its users in their Facebook News Feeds in order to try to uncover how the algorithm actually functions.
The tool can be used by Facebook users that want to know more about their own filter bubbles, researchers collecting data with control groups in Facebook, and journalists interested in echo chambers and algorithm personalization. The extension displays which elements are recorded, and which elements are considered too private for collection, and chooses to not collect those.
The tool was created specifically for Facebook because it generates so much data that it is almost impossible for users to acquire and process this information in a meaningful way. The aim of this tool was to increase transparency behind personalisation algorithms, so that people have more effective control over their Facebook experience. This then benefits Facebook users and researchers. It does not, however, benefit Facebook. This is because fbTREX is trying to expose the algorithm, which Facebook is trying to keep secret in the interest of staying competitive with other social media platforms.
We created Lucie Dvorak, a 30 year old lesbian, who grew up in Prague and now lives in Amsterdam. She works at DAF as a trucker and she’s is interested in country music, cats (she owns cats), cars, Orange Is The New Black, Tasty, photography, Opera, and WWE. But she is also a flat earther and anti-vaxxer. We decided to focus on these last two interests, because that gives us a good angle of inquiry for investigating Facebook’s personalization algorithm and the filter bubbles it creates.
So we liked pages and posts that are about Lucie’s interests to make it as real as possible. We only received suggestions for other pages but no ads were shown on her newspage.
We found that Facebook’s algorithm does not do a great job detecting fake accounts. Our group members logged in from a variety of places but despite that the account remained active. We also received more than 400 friend requests, but most of these seem to be fake accounts as well. Facebook usually requires its members to verify their accounts through mobile phone number identification but we easily created our fake persona with only our fake Gmail account. Those two observations reinforce the idea that Facebook seems to pay little to no attention to bots and fakes invading their platform. Throughout the week we only received one ad and two page recommendations and they did not match Lucie’s interests. The advertisement showed us special pants “for people with reduced mobility” when in reality our fake persona was a totally capable lesbian truck driver.
Meet Lucie – a 30-year-old trucker from Prague who resides in Amsterdam and has a variety of knacks and wits about her. Lucie is lesbian, and in a complicated stage of her romantic life. The complications compel Lucie to look for comfort through other channels, such as cats or TV series. She owns a cat and due to her sexuality, she is drawn to the highly-regarded Netflix series ‘Orange Is the New Black’; as she relates to the characters’ issues. She also spends a lot of time on social media. She feels alienated in life and seeks self-approval through strangers online. This has lead Lucie down a dark path – engaging in conspiracy theories and disregarding scientific counter-evidence. She spews propaganda and feels empowered when like-minded souls ravish her with praise. It is evident to people around her that she is spiralling further day-by-day and embracing the warm cocoon of her own filter bubble. This is evident to a specific someone, or should I say something? Not a person, but a computer-generated mind – Facebook’s algorithm. And it feels no remorse for sending her further and further into the abyss.
Luckily, Lucie is not real. She is one of many unfortunate guinea pigs used in an attempt to break down Facebook’s News Feed-algorithm. She is the creation of six creative 20-something-year-olds, and she is quite popular.
We created Lucie about two weeks ago and have already garnered over 400 friends on Facebook. Granted, some of the accounts are obvious bot accounts but the remainder seem to be genuine people. The vast majority are desperate, lonely, and some even downright distasteful – sending explicit photos and videos – men of various age. Lucie’s purpose is to trick Facebook’s algorithm into feeding us content which will strengthen her views no matter how preposterous. We like, we share, we join groups, we do everything in our power to feed as much information as possible to the algorithm and simultaneously record the posts through the fbtrex web browser extension. Over the course of two weeks we have recorded over 1000 posts through Lucie’s profile. When downloaded in an .csv file the data makes little sense. When organised, some patterns start to appear, but all-in-all the results are underwhelming.
Most posts recorded, almost fifty percent, are from groups Lucie has joined. The groups are all based around her interests, such as cats, country music, Orange Is the New Black etc. and then the more controversial themes we were hoping to be fed – conspiracy theories about flat earth theories and anti-vaccine propaganda. Photos, posts by friends, and videos all come up at around fifteen percent each of the total amount of posts, and the remaining 5 are made up of ads. The small number of ads is a backlash as we were hoping to produce ads related to the darker themes we instilled in Lucie’s persona.
The ads do teach you things about Lucie: she’s a cat owner and lives in The Netherlands. This can be derived from the various ads promoting parties and other social events in and around Amsterdam, as well as ads for cat food. Other than that, the ads are largely arbitrary. The rest of the recorded posts paint a picture of Lucie which can just as easily be painted by looking at her profile. The posts are in direct accordance with her interests, which are clearly visible in her feed as well.
To juxtapose Lucie’s results, one of the members of our group had a very different experience with his recorded posts; almost sixty-five percent were ads. This represents a clear shortcoming in the project – the timeframe. Two weeks is apparently not enough time to ‘trick’ Facebook’s algorithm by feeding it data. This is further emphasised by my own results; my ads and feed are very clear representations of my interests. But when compared, Facebook has over a decade of data on me, and only two weeks’ worth on Lucie.
To conclude, there are limitations to our experiment, but we won’t stop trying. Even though at this point it seems near impossible to achieve the desired results with so little time, we keep on feeding the algorithm with data hoping we will learn more about how it works.
Filter Bubble Fallacy:
Fake Personas & Flawed Results
Research Report Data Journalism
Tjerk de Vries
Pascal Friedrich Degenfeld
1. Reverse Engineering a Data Journalism Project
Throughout the past weeks, we have explored the field of Data Journalism. We’ve done so by reading and analysing several articles about data journalism projects like ‘La Tierra Esclava’ (‘From Coffee to Colonialism’) (Sánchez et al, 2017) and ‘Broken Homes: A Record Year of Home Demolitions in Occupied East Jerusalem’ (Megan O’Toole et al. 2017). Both articles on the projects showed the value of a data driven approach to storytelling, and the way it can enrich traditional reports by connecting stories about individuals to broader (social) phenomena.
Through reverse engineering we have also analysed a well-known Guardian article that employs data journalism elements to enhance its storytelling. ‘NSA Files: Decoded’ (MacAskill and Dance, 2013), divided into six parts, incorporates strong visual elements, produced by Feilding Cage and Greg Chen. Throughout each of the six parts, the NSA’s information gathering process is described as revealed by the well-known whistleblower Edward Snowden. Each part is supported by data visualisations such as the interactive ‘three degrees of separation’ and ‘Your digital trail’ graphics. These graphics include sliders and examples for making the extent of the NSA’s data gathering practices relatable to the everyday citizen. In ‘three degrees of separation’, the slider and corresponding illustrations visualises the three ‘hops’ the NSA can make from their intended target, in order to create a web of interactions and, subsequently, garner more information on a greater amount of people. This way, if a possible suspect has 190 friends, the NSA could gather data on friends of them, three hops away (5.072.916 people). ‘Your digital trail’ allows the user to find out what information they are relinquishing through various forms of communication. The graphic allows the reader to click on logos of; different social media (e.g. Facebook, Twitter), e-mail, Google Search, phone call, web browsing, and camera, to see what information the provider – and consequently the NSA – can gather through your use of these services. The only criticism for the use of these data visualisations is that the sources for the numbers used are not directly cited. This could impact the credibility of the article in a negative way.
These different forms of data visualisation allow the journalists to discuss relatively complicated subjects and still make them relatable to a general audience. In order to tell a complicated, political story, the authors used the graphics mostly for the purpose of clarification. They are used to simplify matters and so make the story more relatable. What can be gathered from this article is that good data journalism projects need to make their often abstract data more relatable. This will allow the journalists to tell a more complicated story without losing the interest of a general audience. In order to make sure a data journalism project also is judged as credible, the data used must also be cited.
2. Critical assessment about fbTREX and working with filter bubbles
Although the algorithms of social media platforms are shaped by user input, their functionality is not transparent, and users are not really in control of the information displayed in their feeds. According to the filter bubble theory, social media sites are believed to decide themselves which content seems most relevant for each user. As this is undoubtedly problematic, the Tracking Exposed project aims to empower users by giving them control over their own data. Their belief is that individuals should have the right to decide which information is deemed particularly significant for their needs.
Making use of their fbTREX extension for Google Chrome, we conducted an experiment in the hopes of understanding how Facebook’s personalization algorithms actually work. FbTREX is a Chrome/Firefox extension that tracks the data received by its users in their Facebook News Feeds in order to try to uncover how the algorithm actually functions. Using it, we gathered information about the amount of posts that appeared in the feed of our bot account, and compared these results with those of our group members. Sadly, the results were pretty underwhelming.
The extension itself did not prove to substantially empower us with control over our data, allowing us to only view the frequency in which different types of items (e.g. images, ads) appeared in our accounts’ feeds. Deriving meaning from this information (which was provided in a spreadsheet document) turned out to be quite difficult, especially as we were not familiarised with computer programming. As a result, we were unable to determine whether or not the filter bubble actually existed. We’ve also experienced minor glitches with the software, mainly due to its inability to categorize certain items (some posts were displayed as “null”, and were not part of any item category).
That is not to say that fbTREX does not show a decent amount of potential. Had we used the extension for a longer timeframe, our results might have been more significant. The tool itself is fairly intuitive, and it values its users’ privacy by anonymizing the collected data. It should, however, narrow the newsfeed item categorization from “images” or “ads” to something more specific, which could potentially help users become aware of how they are perceived by the algorithm. Filter bubble or not, there is still need for a better understanding of how algorithms work, and tools like fbTREX are definitely headed in the right direction for achieving this goal.
3. Critical observation and analysis of our data
For our third part we wanted share thoughts and comparisons of our fake persona’s data and a rather old profile. For this instance we used Pascal’s own data.
To begin with Pascal’s data broken down in numbers out of 462 entries using the fbTREX tool. From what was recorded only four entries have been videos, posts shown 16. Groups, photos, types have been the minorities as they accumulate only 6 entries in total. Events is the second highest rubric after posts with 12. Out of all these the one and only and highest showing rubric is the one of advertisement with a staggering number of 290 entries.
As a representative visualization here you can see the numbers as a pie chart.
In percentages the advertisements make up 89.783% out of all the entries. Out of these ads three kinds were shown the most. Gillette and other company’s new razor blades and other beard shaving product, governmental institutions promoting a specific time to act upon such as voting or making sure one has done their taxes, and the third being events happening nearby which mostly are in Amsterdam but on a rare occasion included events happening in Vienna or Hamburg (two places Pascal has lived in the past). On a smaller note, advertisements included promotions of sports highlights such as the NBA, the English Premier League, and esports highlights. Even rarer were promotions of other service based websites such as Netflix or LinkedIn.
Some of these promotions or advertisements were quite obvious as Pascal has liked the football club Chelsea F.C and is also in various esports groups on Facebook. The events on the other hand are based solely on geolocation as well as the reminders by the governmental institutions. The one type of advertisement which stands out as quite random are the ones for beard shaving and beard care. Pascal has got the same two shavers for the past seven years and both were not purchased on the Internet. Given that Pascal uses the same email address to login into Netflix and Facebook the common overlap is obvious, but he does not use LinkedIn. One could think that Facebook has collected data about his education progress hence Facebook showing advertisements for LinkedIn.
Pascal’s data was collected through the fbTREX extension, and apart from out personal data we also created a bot and collected data about our creation. Our creation, Lucie, was a 28-year old lesbian trucker from Prague who resides in Amsterdam. Lucie’s intention was to aid in proving the filter bubble exists but unfortunately, she proved rather useless. We recorded over 1000 posts over the course of about a month but very few patterns were evident in the data. In accordance with Pascal’s results, the advertisements recorded were mostly based on geolocation, apart from cat food advertisements. Everything visible in the data, is corroborated by looking at Lucie’s profile – the data itself does not reveal anything. Generally, the experiment was flawed at dawn. The time frame itself proved too short, and the directions unclear. In the beginning we fed the algorithm as much as possible, but the content was inconsistent. Thus, nothing of value came out of it. Afterwards, we changed discourse, yet the results were the same. This was a clear indication that more time and data was needed in order to ‘trick’ the algorithm. However, one of the problems with this research is the ‘tricking’ – for lack of a better word. It seems unclear how merely looking at data is going to teach you how Facebook’s algorithm works. Simply put, the generating factor is money. The algorithm needs to know you and purvey content you are likely to engage with, in essence: Facebook’s business model.
In Lucie’s case the results were arbitrary, because Lucy herself was arbitrary; and herein lies the biggest headache about this experiment – disregarding human individuality and how it is perpetuated. Creating a fake persona and expecting significant results within a month is naïve and unrealistic. Even if the persona is designed intricately, with controversial and uncontroversial interests alike. Furthermore, even if the results would have indicated that the filter bubble exists, the process becomes biased and one-sided – forcing the results you hope to achieve. Humans are not machines, we are erratic and mimicking human behaviour precisely is impossible.
This being said, a while back I noticed something new on my Facebook feed, posts from pages which I do not like pop up under the guise ‘posts similar to ones you previously interacted with’. This is somewhat of a confirmation that the filter bubble exists, Facebook knows what kind of videos you like to watch and articles you like to read – and suggest content accordingly. Whether this is good or bad is debatable, although; what is the alternative? Facebook aimlessly showing content on your news feed from various sources, which may or may not entice you? What good would that do? At the end of the day, Facebook is a business. And knowing you, makes a lot of money. It is your choice how much information you relinquish.
4. Our experiences with Data journalism
Data journalism is getting more popular in the last few years, and it is a rather new way to find data and create a journalistic story so there are a lot of misconceptions surrounding the field. It creates a good approach to make strong journalistic stories with data. Many suggest that data journalism is another annoying trend that will fade in the next few years but during this course, we realized that it is here to stay for a variety of reasons. To some extent, it is fashionable, and yes, there are irritating aspects of it, but it is no longer considered a hype at international level but as an integral part of journalism. Journalistic revelations connected with data such as the Snowden case we reverse engineered during the course showed us why data journalism is the best method that can be used in the online space. To make a good journalistic story with data you need to keep in mind that you need different skills compared with a normal journalist.
Many people working in the journalistic field believe that specific skills are required to fully emerge themselves in the world of data journalism but this anxiety has been proven false many times before. You can think about skills like investigative research, statistical, programming, and visualization. Our digital environment is made up of texts, static images, and video materials so data journalism is presented as a solution for analyzing those types of information. Nevertheless, we realized that visualization plays a huge role in presenting a journalistic story and it is often a great way to introduce a complex story in a more user-friendly way. One of the drawback data journalism has is that a lot of times the data you search for is proprietary, so it is important to find sources from a variety of platforms to state your case. Many people in general had a problem with the timeframe which suggests that finding and creating a meaningful story takes a lot of time.
A big challenge is that you really need to keep in mind what kind of data you need for your research. This makes it easier to gather data you need and to filter data out that you won’t use for the story. It is also important that you know what the data tells you. We realized during our project that without an interesting research question or journalistic purpose, data journalism (and the methods and tools used with it) loses its significance and purpose. If the data does not tell the story you want then you need to look for another plan to gather more data or for another journalistic story. Throughout our journey, this was one of the challenges we faced. The data that we received from the tool was telling us nothing about a filter bubble. So, we needed to search for another story through the data which we found by analyzing the advertisements that our fake profile received. This explains that statistical and analyzing data skills are powerful to have as a data journalist.
Another important aspect of data journalism is data visualization. Many people think that data journalism is the same as data charting. Visualization is more of system-level thinking, as well as systematic collection and cataloging of data. There is no requirement that the data should be visualized. We learned through the course that the visualizations really need to make the journalistic story more understandable for the reader. You always have to find a way to use your data in the most appropriate way, and it should aim to hide the data behind the media content. With our project, we made a pie chart of the data of our fake profile and also a pie chart of the data of our own personal profiles. We could easily see the differences and that the filter bubble of our own was already made through the years because Facebook already figured us out but the data of our fake profile was not active because Facebook couldn’t make a filter bubble in two weeks.
Even though, data journalism seems new, cool, and easy to learn. You need to keep in mind that there are some challenges and skills to make a good journalistic story. With skills like investigating, statistical, visualization and programming you can solve different parts of a puzzle but you need to know which data to use and how to use it to make it a whole journalistic story.
Sánchez et al, ‘La tierra esclava’ , April 2017.
Macaskill, Ewen, and Garbiel Dance. 2013. “NSA Files Decoded: Edward Snowden’s Surveillance Revelations Explained.” The Guardian. November 2013. https://www.theguardian.com/world/interactive/2013/nov/01/snowden-nsa-files-surveillance-revelations-decoded#section/3.
Megan O’Toole et al., ‘Broken Homes: A Record Year of Home Demolitions in Occupied East Jerusalem’, Al Jazeera, 2017.
Contributions to report
1. Reverse Engineering a Data Journalism Project
by Tjerk de Vries
2. Critical assessment about fbTREX and working with filter bubbles
by Dragos Culcear
3. Critical observation and analysis of our data
by Pascal Friedrich Degenfeld (1st Half) and Tom Westö (2nd Half)
4. Our experiences with data journalism
by Merve Aytas and Edua Varga