Twitter Archives: A Discussion of Systems, Methods, Visualizations, and Ethics
Contributor: William I. Wolff
Affiliation: Saint Joseph's University
Email: wwolff at sju.edu
Released: 1 October 2017
Published: Spring 2018 Issue of Kairos (Issue 22.2)
Submitted: June 23, 2017
Initial Comments Received: July 8, 2017
Revised Version Submitted: July 13, 2017
Current Status: Published
Introduction
The last several years have seen growth in Computers and Writing scholars publishing articles based on studies using archived tweets. Brian J. McNely (2010), for example, considered how conference tweets around a hashtag can help form an information ecology. Melody A. Bowdon (2014) looked into how organizations used Twitter in 2011 in response to Hurricane Irene. John Jones investigated a network exchange initiated in 2008 by U.S. Congressman John Culberson (2014a) and tweets that used the #heathcare hashtag (2014b). Jodie Nicotra (2016) analyzed the rhetoric of public shaming on Twitter. Tracey Hayes (2017) focused on how responses to the #MyNYPD hashtag transformed Twitter into a space for public protest. And Douglas M. Walls (2017) discussed the tweets of a young African American woman technical communicator. While each of these articles (and the many I have not mentioned) add to our field’s appreciation of the many ways authors use Twitter to compose, organize, and communicate, they also highlight what is missing in scholarship on Twitter: in-depth discussions of the methods and ethics of archiving tweets. By archiving, I'm referring to a collection and retrieval process which can be completed using software or manually by, for example, taking and saving screenshots. Of the articles mentioned above, only McNely (2010) and Jones (2014a; 2014b) briefly describe how they gained access to or archived the tweets. None of the authors discuss the how they untangled the ethical questions they faced—questions, as Heidi A. McKee and James E. Porter (2009) observed, “that are woven throughout the research process” (p. xvii) and that no doubt each of the authors faced. And I add my name to that list: in my articles on the composing practices of Bruce Springsteen fans on Twitter (Wolff, 2015a; 2015b), I discussed in detail neither the archiving systems I used nor the ethical decisions I made, beyond briefly mentioning the archiving system used and that I received permission to use tweeted fan photographs.
When authors do not discuss archiving methods and ethical decisions, scholars interested in joining Twitter discussions do not have models to help navigate the daunting challenges that emerge when embarking on a new project that will require learning unfamiliar methodologies, software, coding languages, research environments, and ethical concerns. This webtext is an attempt to fill that gap by offering suggestions and tutorials based on eight years of experience archiving tweets and studying how Twitter users compose. In that time, I have often wondered: What do I need to do to prepare? How do I learn how to use the software? What do I do if I don’t understand the necessary programming languages? What help resources are available? What are the ethical concerns specific to my project? How will I feel confident enough to write up results if I am just getting started?
These concerns are very real and have the potential to halt a project before it even begins. To help others address these questions that I have struggled with myself, in this piece I describe the tools and methods I have learned to use to conduct Twitter-based research. I briefly introduce five archiving systems, present two Research Scenarios, offer tutorials for getting started, discuss the ethics of archiving and (re)publishing social media posts, and encourage scholars to prioritize discussions of archiving methods and ethical decisions. Specifically, I introduce and highlight my experiences with five current archiving systems: NodeXL, designed and maintained by the Social Media Research Foundation; yourTwapperKeeper, designed (but no longer consistently maintained) by John O’Brien; TAGS 6.x, designed and maintained by Martin Hawksey; DMI-TCAT, designed and maintained by the Digital Methods Initiative; and the Gephi Twitter Streaming Importer, designed and maintained by Matthieu Totet. My hope is that by the end of this webtext, digital media scholars will have a more nuanced understanding of what is expected and what is possible when beginning the process of archiving and visualizing tweets.
Please note, however, that my suggestions are informed by and limited to my experiences with each archiving system, as well as my comfort levels attempting new software and coding languages. When creating difficulty scales and learning curves, I have attempted to keep in mind a scholar who, like myself, has limited programming skills and experience with the command line. Each reader’s experiences and comfort levels will, undoubtedly, be different. Further, though I do offer tutorials for how to set up network visualizations, I do not go into the details of social network analysis.
The Archive Systems
In past studies, I have used both yourTwapperKeeper (Wolff, 2015a) and TAGS (Wolff, 2015b). I now use DMI-TCAT and TAGS for nearly all my archiving needs. These tools offer the greatest flexibility, ease of use, and export options; I strongly recommend them to other scholars. However, I encourage researchers to explore each of the below archiving systems to determine which is the right one for their project, skillset, budget, and hardware.
NodeXL and NodeXL Pro
In July 2008, the Social Media Research Foundation (SMRF) released NodeXL, a free, open source social network viewer created for the Windows Microsoft Excel environment; the first stable version was released in April 2013. On October 12, 2015, SMRF announced the creation of NodeXL Basic, a free version that allows limited streaming imports from Twitter, and NodeXL Pro, a tiered pay version, which provides the “ability to import social networks directly from Twitter, Facebook, Exchange, Wikis, YouTube, Flickr and email, or use one of several available plug-ins to get networks from Surveys, WWW hyperlinks and social media cloud storage lockers” (Smith, 2017). One of the benefits of NodeXL is that users import social media data and create network visualizations in one software environment. The several drawbacks, however, are that users are limited to one proprietary operating system (Windows) and one proprietary software application (Microsoft Excel), and that they must pay for NodeXL Pro because Node XL Basic is much too limited in what can be imported (even from Twitter) and the kinds of visualizations that can be generated (“Features: NodeXL Features Overview,” 2016). I have found setting up Twitter imports with NodeXL very easy, but have had significant difficulty generating meaningful network visualizations. There is, however, a robust community of users and advocates available for help on an active discussion board and Marc Smith, one of the principal creators of NodeXL, has been very open and responsive to questions posed to him on Twitter. Though NodeXL offers an impressive number of options, because it is only available for a fee and only able to be used on Windows machines with Excel installed, I have opted to use other applications instead. Please see Nasri Messarra's Simple Walkthrough to Visualizing Twitter Data on NodeXL (2015) for setting up and getting started using NodeXL.
TwapperKeeper and yourTwapperKeeper
On June 5, 2009, programmer and technology consultant John O’Brien III announced he was dedicating his weekend to experimenting with the Twitter API. Two days later, O’Brien released TwapperKeeper, a free service that allowed users to archive tweets on a server hosted and maintained by O'Brien. TwapperKeeper was an immediate sensation, with thousands of individuals using it to store millions of tweets. In August 2010, O’Brien released yourTwapperKeeper, which allows users to run the software on their own dedicated server; later, professor and social media researcher, Axel Bruns, released new and revised PHP scripts that extended yourTwapperKeeper features and fixed bugs, including one that replaced the last three digits of tweet ID numbers with three zeros (Bruns, 2011). On September 16, 2011, Hootsuite announced it was buying TwapperKeeper for 3 million dollars. On May 2, 2013, O’Brien released version 2 of yourTwapperKeeper that reflected changes to the Twitter API (O’Brien, 2013). TwapperKeeeper was revolutionary when first released and very easy to use; yourTwapperKeeper was similarly important because it gave individuals the opportunity to archive millions of tweets on their own server. The installation, however, is extraordinarily complex for people with limited programming skills and comfort levels, and, other than a rarely used Google message board, offers little-to-no support when problems arise. As of this writing, yourTwapperKeeper only exports to Excel, HTML tables, and JSON, though the Digital Methods Initiative FAQ page provides instructions for how to import yourTwapperKeeper databses into DMI-TCAT. When exported as a spreadsheet, the data look like this:
Twitter Archiving Google Spreadsheet (TAGS 6.x)
On June 18, 2010, Martin Hawksey, Chief Innovation, Community and Technology Officer at the Association for Learning Technology, announced the release of the free, browser-based Twitteralytics Google Spreadsheet, which would later become the Twitter Archiving Google Spreadsheet (TAGS). Originally, TAGS grabbed weekly or daily results from Twitter, compiled a summary of the results, and emailed the summary to the user. It also stored in a Google Drive Sheet “a copy of the sampled tweets which could be used for further analysis down the line but I would recommend you only use this as a backup for a separate twitter archive on TwapperKeeper” (Hawksey, 2010). Hawksey has updated and streamlined TAGS six times and has added information visualization features, including TAGSExplorer and TAGSArchive. As of this writing, users can create archives based on a keyword or hashtag, a list, or a Twitter user’s timeline. TAGS is the easiest archiving system to install and use; the most recent version takes less than five minutes to set up and the tweets appear within seconds. Hawksey has also been extremely responsive to questions and concerns when they arise. As of this writing, TAGS will export as a .csv file, though Hawksey has posted about preparing TAGS archives to be imported into NodeXL (Hawksey, 2011) and though the blog post suggests the process should work in Gephi, I have not been able to do so successfully.
The following tutorial, which I set up for an undergraduate course, walks you through setting up a TAGS archive:
Video 1: Setting up the TAGS 6.x Archiving Tool
Transcript
Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT)
On June 19, 2013, the Digital Methods Initiative released on GitHub the Twitter Capture and Analysis Toolset (DMI-TCAT), a free, open source application that allows users to archive and export tweets in a variety of forms for spreadsheets and Gephi (Borra & Rieder, 2014). DMI-TCAT requires a dedicated server and the use of the command line to successfully install it. Once they've installed it, users set up a query on a Capture page and investigate the tweets on an Analysis page. The Analysis page provides users with the ability to export tweets based on tweet statistics and activity metrics (such as overall tweet stats, overall user stats, hashtag frequencies, full datasets, random selection datasets, and all retweets) and data prepared to be imported into Gephi to create a variety of graphs (such as social graphs and co-hashtag graphs) and several experimental visualizations created by the Digital Methods Initiative. Though the installation can be difficult, it is extremely easy to use and provides the most options for researchers to be able to analyze tweets by multiple methods.
Please see the detailed tutorial I created for installing DMI-TCAT on an Amazon EC2 server. The tutorial has been tested on an undergraduate class of 18 students. See the below scenarios for using DMI-TCAT.
Gephi Twitter Streaming Importer
On April 25, 2016, computer engineer Matthieu Totet announced the free Gephi Twitter Streaming Importer as a plug-in to enhance the free, open-source, cross-platform network visualization software, Gephi (Totet, 2016). Users install the plugin through the Gephi plugins interface, set up their Twitter credentials, and can create visualizations for full Twitter, hashtag, or user networks. Users have the ability to save a query to be analyzed in Excel or OpenOffice, added to, or visualized later. As with NodeXL, one of the benefits of the Streaming Importer is that users import social media data and create network visualizations in one software environment. One of the very cool features of this method is watching the network bloom in real time as the importer continues to grab tweets, which you can see in the below tutorial:
Figure 2 compares many of the features of the five archiving systems. On the left, the features of NodeXL, yourTwapperKeeper (YTK), TAGS, DMI-TCAT, and Gephi Streaming Importer are compared side-by-side. I borrow the ski slope difficulty scale, where a green circle means easy, a blue square means moderately difficult, a black diamond means difficult, and two black diamonds means very difficult. Under yourTwapperKeeper, I add NS to the black diamond because there is currently No Support. On the right, I present six mini-charts, with camparisons for Setup Comfort Level, After Setup Learning Curve, Archive Size, Archive Reach, Archive Duration, and Data Usago Goals. I again borrow the ski slope difficulty scale, but also include graphics that visually represent the result of particular archive systems. For example, once the archives are set up, the learning curve for the Gephi application is much steeper than for TAGS and yourTwapperKeeper. DMI-TCAT and yourTwapperKeeper offer a significantly larger archive size athan TAGS, but TAGS and yourTwapperKeeper will reach back farther than DMI-TCAT to collect tweets prior to the moment the archive is created. DMI-TCAT and yourTwapperKeeper will allow scholars to archive over longer periods of time. Each of these affordances and constraints should be considered prior to starting to archive tweets.
Research Scenario 1: #notokay Tweets
In this scenario, I highlight how researchers might be able to capture the start of a trending hashtag if they are able to begin archiving soon after the hashtag gains traction.
Archiving Systems Used
TAGS 6.1 and DMI-TCAT
Background
On October 7, 2016, the Washington Post released “Trump Recorded Having Extremely Lewd Conversation About Women in 2005” (Farenthold, 2016), which contained video and audio of then-presidential candidate Donald Trump claiming to have sexually assaulted women multiple times. On October 8, author Kelly Oxford released a tweet in which she called for women to share their sexual assault stories and shared her own. Several minutes later, she retweeted her call, this time adding the #notokay hashtag.
Oxford’s call went viral and hundreds of thousands of women began tweeting their sexual assaults, with and without the #notokay hashtag, in doing so revealing an epidemic of sexual assault against women. It was a stunning example of what Catherine Fosl (2008) has described: “The personal narrative recounted in the service of social justice also has the dimension of ‘witnessing’ or ‘author-izing’ an experience previously marginalized. Telling one’s own story thus also has a collective purpose and can work as a consciousness-raising, even a community-organizing tactic” (p. 220).
Tracing the Research Process
At approximately 10:30 p.m. on October 7, 2016, I became aware of the #notokay tweets and, thinking that it could go viral, immediately created archives of tweets that contained #notokay using both TAGS 6.1 and DMI-TCAT. I created both sets of archives for specific reasons: First, I created the TAGS 6.1 archive because I wanted to be sure to grab tweets prior to 10:30pm and, if able, TAGS 6.1 will grab up to 10,000 tweets from the prior week. (I write “if able” because an overwhelming volume of tweets could prevent the system from going back very far.) I knew from prior experience that DMI-TCAT has the potential not to reach back as far as TAGS even though it is searching the same Twitter API as TAGS 6.1. Second, I created the DMI-TCAT archive because I knew I would want to create visualizations in Gephi and DMI-TCAT affords easy export to Gephi-formatted files.
I also created two additional archives on TAGS 6.1: First, I created an archive of all tweets that mentioned @kellyoxford because I was interested in seeing how people were replying to her. Second, I created an archive of all tweets from Kelly Oxford’s account (@kellyoxford) because I wanted to be sure to grab her initial tweets calling for women to share their stories. I thought there would be a possibility that there were so many #notokay tweets that TAGS 6.1 wouldn’t grab the first mentions in the #notokay archive and also I knew that her first tweet didn’t contain the #notokay hashtag. In addition, I created an archive on DMI-TCAT to collect all of @kellyoxford's mentions so I could create maps of her network on Gephi.
In summary, by 10:55 p.m. on the evening of October 7, 2016, I had five archives dedicated to collecting tweets related to the #notokay hashtag. I had three archives on TAGS 6.1 collecting all #notokay tweets, all tweets from @kellyoxford, and all tweets mentioning @kellyoxford. And I had two DMI-TCAT archives collecting all #notokay tweets and all of @kellyoxford's mentions.
Sifting Through the Archives
Because data is streaming into the archives in real time, researchers can see immediately if they have the data they hoped they were going to get. It is often helpful to compare the results from each archive. For example, consider the following for the #notokay tweet archives, which were started within a minute of each other on October 7, 2016:
TAGS 6.1
First Tweet in Archive Date and Time: September 9, 2016, 11:48 p.m.
First Tweet about the Trump Video Date and Time: October 7, 2016, 2:45 p.m.
DMI-TCAT
First Tweet in Archive Date and Time: October 7, 2016, 7:34 p.m.
First Tweet about the Trump Video Date and Time: October 7, 2016, 7:34 p.m.
TAGS 6.1 collected tweets dating back to September 9, 2016, which might not initially seem important for studying this material, but I would argue that it gives important context for the new #notokay tweets inspired by the Trump video. Among other things, it shows how a hashtag being used to discuss a range of unrelated issues can be brought to singular focus under the right circumstances.
In this case, the hashtag went viral as a result of two complementary events. First, author Laura Jingalls (@ljingalls) added the #notokay hashtag in a reply to two users she follows who had retweeted and commented on the Washington Post article, which was originally tweeted by New York Times reporter, Sopan Deb (@SopanDeb). Jingalls was the first user to apply the #notokay hashtag to the Trump video topic (Figure 4).
Second, @kellyoxford tweeted a call for women to share their sexual assaults and added her own. Eight minutes later, she repeated her call and added the #notokay hashtag (Figure 3). Three minutes later, she tweeted she was receiving "2 sex assault stories per second" and reinforced her call to have women share their stories by adding the hashtag again (Figure 5).
It is not clear from any of my archives why Oxford added the hashtag—that is, if she saw other tweets that had it or decided to add it herself. My archive of @kellyoxford's mentions did not capture the early replies to her first tweet, though there are thousands of retweets and replies to all three of the tweets I've included here. Only one tweet with the #notokay hashtag about the Trump video appears in the #notokay archive between when Jingalls and Oxford posted theirs. According to my archive of Oxford’s tweets, she did not retweet either of the tweets that used the #notokay hashtag about the Trump video. So, we don’t know if she saw them or, like the other two users, decided to add the #notokay hashtag on her own.
Though seemingly minor, Research Scenario 1 shows the importance of not relying on just one archiving system and on just one archive. If I had just relied on DMI-TCAT because I knew I would want to make visuals, I would have missed uncovering the dual origins of the hashtag. If I hadn't archived Kelly Oxford's tweets, I wouldn't have seen both of her initial tweets calling for women to share their sexual assualts, nor the several tweets about the Trump video she posted prior to her call to women. This Scenario also suggests that tweet archives, like all archives, are more than just data repositories. Rather, they are artifacts with contents subject to a multitude of constraints, such as system features, initial setup time, Twitter API constraints, and curator skillsets.
Research Scenario 2: #terencecrutcher tweets
In this scenario, I highlight how to create Directed Mention and Undirected Co-Hashtag network maps.
Archiving System and Application Used
DMI-TCAT and Gephi
Background
On September 16, 2016, 40 year-old Terence Crutcher was shot and killed by Tulsa, Oklahoma, police officer Betty Shelby during a routine investigation into Crutcher’s vehicle, which had broken down in the middle of a road (King, 2016). On Monday, September 19, 2016, the Tulsa Police Department released helicoptor and dashcam video of the event, which showed Shelby shooting Crutcher after he’d been tasered by fellow officer Tyler Turnbough while Crutcher was leaning against a car, hands raised above his head (Stack, 2016). On May 16, 2017, after nine hours of deliberation, a Tulsa jury found Shelby not-guilty of first-degree manslaughter (Vicent, 2017).
I began archiving the tweets because I was interested in seeing how different communities were tweeting about Mr. Crutcher's death and how Gephi might distinguish those communities, but I was not able to begin archiving immediately after the police video emerged. As a result, the visualizations discussed below were created using an archive of tweets with the #terenececrutcher hashtag captured between September 20 and September 21, 2016. Visualizations, however, should never be the end goal of a study. They are used to reveal nuances and relationships previously unseen in the data (here, the tweets), which must then be analyzed by going back to the tweets themselves. As Franco Moretti (2007) explained, in order to see patterns in a text, “we first must extract it from the narrative flow, and the only way to do so is with a map. Not, of course, that the map is already an explanation; but at least it shows us that there is something that needs to be explained” (p. 39).
Very generally, there are three initial dynamics to look for when approaching network maps. The first is to recognize one-to-one relationships between users and/or hashtags, which will lead to a better understanding about why certain Twitter users are mentioning one another and/or why certain hashtags are being used together in a tweet. The second way is to search for communities (which Gephi does using color), which will lead to a better understanding of the different groups of people who are tweeting. The third is to look at who is tweeting and how often they are tweeting, which will lead to a better understanding of who the influencers and central figures are in a given network. Each of these dynamics should force a researcher back to the tweets to help explain the visualized phenomena.
Directed Mention Network Maps
Directed mention network maps display relationships among users who, in this case, are replying, mentioning, or retweeting another account in tweets that use a given hashtag (in this case, #terencecrutcher). DMI-TCAT provides an option to export a Gephi-ready file of the whole corpus or based on the top users which will result in a directed graph. Directed graphs show direct connections between individual Twitter accounts. The accounts are represented as nodes, and the connections are represented as edges. The thicker the edge, the more often the Twitter accounts are connected. Gephi attempts to group similar communities by node color.
Two revealing directed graphs are in-degree and out-degree. In-degree networks reveal how often a particular user appears in tweets through mentions, replies, or retweets. The larger the node, the more that user is mentioned. Out-degree networks show how often a particular user is tweeting. The larger the node, the more that user is tweeting. It is also important to look for cohesive subgroups, which are small clusters of users often on the outskirts of the network who are communicating with one another. These can reveal smaller communities within the larger network.
In Figure 6, for example, New York Daily News writer and racial justice activist Shaun King is mentioned in the tweets more than any other user (left map) even though he is not tweeting as often as other users (right map). The sizeable pink cohesive subgroup that appears at the bottom left of both maps is a group of students in the same fraternity at the University of Maryland having a conversation based on one or more tweets by Shomari Stone, a reporter for Washington, D.C.'s NBC affiliate channel, NBC4. I uncovered that information by going to each cohesive subgroup user's Twitter page and reading their bios. By investigating their tweets further, we have the potential to learn about how and, perhaps, why they are communicating about Terence Crutcher's death. Just one example shows how visualizations can reveal patterns in data that need to be explored in further depth.
The following tutorial, which I set up for an undergraduate course, walks a researcher through exporting a Gephi-ready file in DMI-TCAT to creating in-degree and out-degree network maps.
Co-Hashtag Undirected Network Maps
Co-hashtag network maps show how often two hashtags appear in the same tweet. In the maps, each node is a hashtag and each edge represents the fact that they appear together. The map is undirected because there is not a direct relationship being established between two users; the hashtags are merely appearing together in the same space. The larger the node, the more often a particular hashtag is being used. The thicker the edge between two nodes, the more often the hashtags appear together in tweets.
Co-hashtag maps reveal how users are connecting various issues together and also reveal issues that might be hidden within the data. Just as we look for cohesive subgroups in directed maps, in co-hashtag maps we look for modularity communities, which Gephi groups by color. These groupings can reveal how different communities within the larger network are using the hashtags and for what reasons.
In Figure 7, the larger hashtags are all related to the deaths of Black men by police officers, as well as groups and individuals associated with efforts to bring those deaths to light and provide justice for the familes of the victims. The green modularity community on the right side of the map is significant because it is comprised of the names of Black men who were killed by police officers. Their grouping suggests that users are directly associating the death of Terence Crutcher with the deaths of the other Black men. We would need to investigate the tweets in further detail to see just how the users are associating the individuals and, perhaps, if they are employing any activists methods to help further their cause. There are also some unexpected hashtags, such as #donkeyoftheday, #treysongz, and #pressplay, that should be investigated to see how and why they are being used in the same tweets as #terencecrutcher.
The following tutorial, which I set up for an undergraduate course, walks a researcher through exporting a Gephi-ready file in DMI-TCAT to creating Co-Hashtag network maps.
Ethical Considerations
In 2012, the Association of Internet Researchers (AoIR) Ethics Working Committee released Ethical Decision-Making and Internet Research, a wide-ranging guide for scholars engaged in or interested in pursing Internet research, which squarely locates archiving and visualizing social media data and texts within its scope (Markham & Buchanan, 2012, pp. 3–4). The guide is framed by six key principles, which I quote at length because of their importance:
- The greater the vulnerability of the community / author / participant, the greater the obligation of the researcher to protect the community / author / participant.
- Because ‘harm’ is defined contextually, ethical principles are more likely to be understood inductively rather than applied universally.
- Because all digital information at some point involves individual persons, consideration of principles related to research on human subjects may be necessary even if it is not immediately apparent how and where persons are involved in the research data.
- When making ethical decisions, researchers must balance the rights of subjects (as authors, as research participants, as people) with the social benefits of research and researchers’ rights to conduct research.
- Ethical issues may arise and need to be addressed during all steps of the research process, from planning, research conduct, publication, and dissemination.
- Ethical decision-making is a deliberative process, and researchers should consult as many people and resources as possible in this process. (Markham & Buchanan, 2012, pp. 4–5)
As the six principles suggest, Internet research exists in a murky, context-specific, and ever-changing environment that often blurs or obfuscates traditional boundaries and definitions. For example, consider the question of human subjects and informed consent. When conducting in-person interviews, the human subject is clear: the subject is the person being interviewed, a person who can provide informed consent. But what happens when a scholar is studying how humans are composing tweets through the process of analyzing, say, 5,000 tweets? Do each of the tweets represent a human subject that should provide informed consent? Just because (as of this writing) Twitter allows anyone to archive and republish tweets (“Developer Policy,” 2017), does that mean scholars have the right to do so without an individual’s consent, most especially if the content is of a sensitive nature, such as the tweets with the #notokay hashtag where women are sharing their experiences with sexual assault? Or, are the tweets merely bits of data that can be stored like any number in a spreadsheet? What happens when a researcher moves from archiving to (re-)publication, where personal identifiers (names, usernames, locations, etc.) move from the privacy of an archive to a public space? Is the tweet now a representation of a human being and not simply data to be analyzed? And if it is a representation of a human, what protections (if any) need to be in place in order to reduce the risk of harm?
These questions lead to further questions about expectations of privacy in an online space. James M. Hudson and Amy Bruckman (2004) observed that “researchers work with dichotomies such as public versus private and published versus unpublished. Works on the Internet, however, turn those dichotomies into continua” (p. 129). They suggest instead that nearly all online communications are “semipublished and semipublic” (p. 129), an argument supported by the AoIR because “individual and cultural definitions of privacy are ambiguous, contested, and changing. People may operate in online spaces but maintain strong perceptions or expectations of privacy” (Markham & Buchanan, 2012, p. 6). At first glance, it is somewhat bizarre to suggest that someone who posts a public tweet from a public account has some expectation of privacy, but most people communicating online have imagined boundaries that extend only to their followers. Most users, even those that use a trending global hashtag, don’t necessarily expect their tweets to be viewed by more than their immediate circle of followers—and most certainly don’t expect their tweets to be archived, stored, and analyzed (sometimes months and years later) by a scholar they have neither met nor given consent to store their information or republish it in another context. These concerns become even more important when the information being tweeted is sensitive. And yet, what is considered sensitive information is not clear-cut, either, because something that might seem innocuous at the time of tweeting could very well come back to haunt the person, as Justine Sacco and others who have been publically shamed for what they considered to be innocent, if not off-color, tweets have found out.
New and experienced Internet researchers might begin by asking these questions as they embark on each new study:
- What are the ethical expectations users attach to the venue in which they are interacting? Do they expect material to be archived?
- If access to an online context is publicly available, do participants/authors perceive the content to be public? What considerations might be necessary to accommodate “perceived privacy”?
- Does archiving require IRB approval? If not, does that make it the right thing to do, especially if the material is sensitive? If yes, does that make it practical? For example, how can one get IRB approval to archive a hashtag that is going viral if the IRB process takes months to complete?
- If data is housed in a repository for reuse, how might individuals be affected later? What possible risk or harm might result from reuse and publication of exact-quoted material?
- Have you asked all non-public figures for permission to use their tweets in your publication? Anonymizing tweets may not be enough because authors can often be found by searching tweet content.
When it comes to Internet research ethics, questions lead to more questions, which lead to more questions—and that’s a good thing. In ever-changing environments, reflective deliberation proves to be important and allows scholars to err on the side of caution, privacy, and lack of harm.
Conclusion
While the phrase “archiving tweets” seems innocuous, the processes that precede and follow that activity—processes that include learning and then applying new programming languages, software, and ethical guidelines—are intimidating, complex, and dependent on continually emerging software and fraught with ethical ambiguities that must be clarified on a case-by-case basis. One of the most concerning parts of that process has, for me at least, been figuring out how to use various archiving applications and figuring out when to use which one. I imagine others have shared that concern. My goal in writing this article has been to demystify, reveal, and collate what has been distributed on the web in various often hard to find and understand blog posts, forums, and FAQs.
In Toward a Composition Made Whole, Jody Shipka (2011) observed that when researchers analyze dynamic online texts “it becomes easy to overlook the various resources and complex cycles of activity informing the production, distribution, exchange, consumption, and valuation of that focal text or collection of texts” (pp. 29–30). However, Shipka argued,
by sharing with others descriptions of the variety of tools composers employ and by highlighting how, when, and to what end those tools are employed, we are provided with opportunities to imagine still other ways of making and negotiating meaning in the world. Further, by sharing with others descriptions of the processes by which texts are produced, consumed, and ultimately valued, we are given opportunities to consider how and why certain meditational means and certain actions are deemed best or at least more appropriate in a given context than are others. (p. 53)
Using Shipka as a guide, I encourage all scholars writing about tweets that have been archived in one way or another—and book and journal editors publishing that scholarship—to prioritize discussions of not only their methods but their ethical rationales. That is, when writing about and republishing tweets, to address at the very least the following questions:
- What archive method did you use and why? What did that method afford that others didn’t? Did you want to use another method but couldn’t figure out how to use it? What limitations did you find in the method? What did you want to do but couldn’t?
- What ethical questions did you face and how did you answer those questions?
- If using tweets in your publication, did you secure permission and how did you secure it? If you did not secure permission, why did you decide to use them?
As digital media scholars, we have a responsibility to make transparent all methodological, technological, and ethical decisions realized through the course of our research. And in doing so, we will help other scholars join the conversation in ways that are, hopefully, less intimidating than they are now.
Further Reading
Kozinets, R. (2015). Netnography: Redefined (2nd edition). Los Angeles: SAGE.
Markham, A., & Buchanan, E. (2012). Ethical decision-making and Internet research: Recommendations from the AoIR ethics working committee (Version 2.0). Association of Internet Researchers. Retrieved from https://aoir.org/reports/ethics2.pdf
Markham, A.N., & Baym, N.K. (2009). Internet inquiry. Los Angeles, CA: Sage.
McKee, H.A., & Porter, J.E. (2009). The ethics of Internet research: A rhetorical, case-based process. New York, NY: Peter Lang.
Scott, J. (2017). Social network analysis (4 edition). Thousand Oaks, CA: SAGE. Weller, K., Bruns, A., Burgess, J., Mahrt, M., & Puschmann, C. (Eds.). (2013). Twitter and Society. New York: Peter Lang.
Acknowledgements
Thank you to the following Saint Joseph's University students: Elizabeth Krotulis, for testing early drafts of the DMI-TCAT set-up tutorial and helping me work out the kinks; to the entire Fall 2016 COM 473 class for having the faith that the tutorials would actually work; and to Caroline DeFelice for correcting YouTube's auto-generated video closed captions. I am indebted to Kristi McDuffie who approached me to write this article after seeing one of my Computers and Writing talks and worked with me throughout the process of putting it together. And thank you to Dave Parry for locating a gently used Thinkpad so I could learn how to use NodeXL and I didn't have to try to bring back to life the decade-old HP laptop languishing in the back of my home office closet. That was a huge relief.
References
Borra, Erik, & Rieder, Bernhard. (2014). Programmed method: Developing a toolset for capturing and analyzing tweets. Aslib Journal of Information Management, 66(3), 262–278. https://doi.org/10.1108/AJIM-09-2013-0094
Bowdon, Melody A. (2014). Tweeting an ethos: Emergency messaging, social media, and teaching technical communication. Technical Communication Quarterly, 23(1), 35–54. https://doi.org/10.1080/10572252.2014.850853
Bruns, Axel. (2011, June 11). Switching from Twapperkeeper to yourTwapperkeeper. Retrieved from http://mappingonlinepublics.net/2011/06/21/switching-from-twapperkeeper-to-yourtwapperkeeper/
Developer Policy. (2017, June 18). Twitter. Retrieved from https://dev.twitter.com/overview/terms/policy
Farenthold, David. A. (2016, October 8). Trump recorded having extremely lewd conversation about women in 2005. Washington Post. Retrieved from https://www.washingtonpost.com/politics/trump-recorded-having-extremely-lewd-conversation-about-women-in-2005/2016/10/07/3b9ce776-8cb4-11e6-bf8a-3d26847eeed4_story.html
Features: NodeXL Features Overview. (2016, November 9). The Social Media Research Foundation. Retrieved from http://www.smrfoundation.org/nodexl/features/
Fosl, Catherine. (2008). Anne Braden, Fannie Lou Hamer, and Rigoberta Menchu: Using personal narrative to build activist movements. In R. Solinger, M. Fox, & K. Irani (Eds.), Telling Stories to Change the World: Global Voices on the Power of Narrative to Build Community and Make Social Justice Claims. New York: Routledge.
Hawksey, Martin. (2010, June 18). Using Google Spreadsheet to automatically monitor Twitter event hashtags and more. Retrieved from https://mashe.hawksey.info/2010/06/using-google-spreadsheet-to-automatically-monitor-twitter/
Hawksey, Martin. (2011, October 5). Using Google Spreadsheets as a data source to analyse extended Twitter conversations in NodeXL (and Gephi). Retrieved from https://mashe.hawksey.info/2011/10/using-google-spreadsheet-to-feed-nodexl/
Hayes, Tracey J. (2017). #MyNYPD: Transforming Twitter into a public place for protest. Computers and Composition, 43, 118–134. https://doi.org/10.1016/j.compcom.2016.11.003
Hudson, James M., & Bruckman, Amy. (2004). “Go away”: Participant objections to being studied and the ethics of chatroom research. The Information Society, 20, 127–139.
Jones, John. (2014a). Programming in network exchanges. Computers and Composition, 34, 23–38. https://doi.org/10.1016/j.compcom.2014.09.003
Jones, John. (2014b). Switching in Twitter’s hashtagged exchanges. Journal of Business and Technical Communication, 28(1), 83–108. https://doi.org/10.1177/1050651913502358
King, Shaun. (2016, September 19). KING: Arrest the Tulsa officer who killed Terence Crutcher. NY Daily News. Retrieved from http://www.nydailynews.com/news/national/king-arrest-tulsa-officer-killed-terence-crutcher-article-1.2798021
Markham, Annette, & Buchanan, Elizabeth. (2012). Ethical decision-making and internet research: Recommendations from the AoIR ethics working committee (Version 2.0). Association of Internet Researchers. Retrieved from https://aoir.org/reports/ethics2.pdf
McKee, Heidi A., & Porter, James E. (2009). The ethics of Internet research: A rhetorical, case-based process. New York, NY: Peter Lang Inc.
McNely, Brian J. (2010). Exploring a sustainable and public information ecology. In Proceedings of the 28th ACM International Conference on Design of Communication (pp. 103–108). New York, NY, USA: ACM. https://doi.org/10.1145/1878450.1878468
Messarra, Nasri. (2015, April 22). Simple walkthrough to visualizing Twitter data on NodeXL. Retrieved from http://nasri.messarra.com/analyzing-twitter-data-with-nodexl/
Moretti, Franco. (2007). Graphs, maps, trees: abstract models for literary history. New York: Verso.
Nicotra, Jodie. (2016). Disgust, distributed: Virtual public shaming as epideictic assemblage. Enculturation (22). Retrieved from http://enculturation.net/disgust-distributed
O’Brien, John. (2013, May 2). Update to yourTwapperKeeper to align with Twitter API changes - Google Groups. Retrieved from https://groups.google.com/forum/#!topic/yourtwapperkeeper/5MARsa19ZDM
Shipka, Jody. (2011). Toward a composition made whole. Pittsburgh, PA: University of Pittsburgh Press.
Smith, Marc. (2017, April 4). NodeXL: Network overview, discovery and exploration for Excel. Retrieved from https://nodexl.codeplex.com/Wikipage?ProjectName=nodexl
Stack, Liam. (2016, September 19). Video released in Terence Crutcher’s killing by Tulsa police. The New York Times. Retrieved from https://www.nytimes.com/2016/09/20/us/video-released-in-terence-crutchers-killing-by-tulsa-police.html
Totet, Matthieu. (2016, April 25). Twitter streaming importer : naoyun as a Gephi plugin. Retrieved from https://matthieu-totet.fr/Koumin/2016/04/25/twitter-streaming-importer-naoyun-as-a-gephi-plugin/
Vicent, Samantha. (2017, May 17). NOT GUILTY: Betty Shelby acquitted; jurors in tears; Crutcher’s sister says police tried to cover up her brother’s “murder.” Tulsa World. Retrieved from http://www.tulsaworld.com/news/bettyshelby/not-guilty-betty-shelby-acquitted-jurors-in-tears-crutcher-s/article_cfdc970b-2b10-5f15-b4c1-711c5da4cc03.html
Walls, Douglas M. (2017). The professional work of “unprofessional” tweets. Journal of Business And Technical Communication. pp. 1–26. https://doi.org/10.1177/1050651917713195
Wolff, William I. (2015a). Baby, we were born to tweet: Springsteen fans, the writing practices of in situ tweeting, and the research possibilities for Twitter. Kairos: A Journal of Rhetoric, Technology, and Pedagogy, 19(3). Retrieved from http://kairos.technorhetoric.net/19.3/topoi/wolff/index.html
Wolff, William I. (2015b). Springsteen fans, #bruceleeds, and the tweeting of locality. Transformative Works and Cultures, 19. Retrieved from http://journal.transformativeworks.org/index.php/twc/article/view/589