A little over a year ago, in our Collaboratory’s early days, we faced some big decisions about how to begin systematically and rigorously collecting data from social media. A good example is the question of whether to collect Twitter data from the public API, or to gain access to the firehose.
If you’re faced with this question, it’s important to learn the basics about each data source.
Few of the extant studies that have analyzed data from the Twitter API have mentioned that it provides only a sample of tweets — about a 1% sample, according to Twitter. And there’s been virtually no discussion in the published research to date explaining the distinction between the streaming API and the search API, or describing whether and how the sampling methodologies differ for each. Dr. Sandra González-Bailón, Internet researcher at the University of Oxford, recently used a comparison between the Twitter streaming vs. search APIs to measure sample bias. She presented her research at our Big Data in Public Health meeting.
When very large numbers of tweets about a particular topic are available through either the streaming or search API, they may provide sufficient information to draw relevant conclusions about the volume and content of tweets about that topic.
Think of it like looking for needles in a haystack. How easily you’ll find a needle depends on how many needles are in the haystack, right? So if you’re looking for tweets about Justin Bieber (according to The Fact Site, Justin gets about 60 new mentions on Twitter per second, regardless of whether he tweets or not), there are bazillions of Bieber-related needles in the Twitter haystack. So pretty much anywhere and anytime you grab, you’re likely to pull out some tweets about Bieber… and you could probably make some reasonable inferences about the larger Bieber conversation from the sample of the huge number of total Bieber-tweets that’s available on the API. Thus the public API forms a great resource for mining tweets about popular topics.
But if you’re looking for tweets about a less popular topic like, say, a tobacco control media campaign, there’ll be a lot fewer needles in the haystack — not many people may be talking about it, but we care about what all of them (or at least most of them) are saying. In the case of a relatively less popular topic, you’d have to have a pretty good idea where and when to grab in the haystack in order to get your hands on those rare needles. And since Twitter does not share its API sampling algorithms with the public, you’d have a hard time deciding when and where (i.e., which search terms) to sample, and the inferences you could make from such analyses would be quite limited. Based on an API sample a researcher might conclude that no one mentions tobacco control media on Twitter, when in fact there are a significant number of tweets about the topic, all or most of which are important to the analysis.
One of the most important limitations of both the streaming and search APIs is that little is known about the proportion of all tweets either of these API samples represents. This limitation is particularly salient for topics not widely tweeted about — like public health media campaigns or policy topics, for instance. While participants in the discussion may be influential thought leaders, and the discussion may represent an important lens into health behaviors and emerging trends, an API sample may simply gather too few numbers of tweets about such topics to allow for informed analyses or valid inferences.
To overcome these limitations, our research team searched for a software interface that would support access to the Twitter firehose, and allow us to efficiently manage the data we obtained. We eventually identified such software, DiscoverText, which had been developed by a fellow academic who shared our perspective on rigorous and transparent data collection and analysis. DiscoverText enables access to the GNIP PowerTrack Twitter firehose — a data source that produces all publicly available tweets within a specified time frame. With DiscoverText we can search and analyze all tweets about our high-priority smoking topics, not just the limited sample provided by the API. And DiscoverText combines data mining tools with a human-based coding system to improve the accuracy of our analyses. We use the software to increase the precision and recall of our search terms.
Since December 2011, we have used 283 keywords to capture nearly 63 million smoking-related tweets through the GNIP firehose and 12 keywords to collect more than 92 million tweets through the Twitter Public Search API. After doing precision testing on some pilot keywords, we learned that there are some relatively broad searches — such as ‘smoking’ that can yield usable data through the API, while more precise terms — like ‘smoking commercial’ or ‘lady with hole in throat’ — are best searched through the firehose.
Watch for an upcoming post where we highlight some projects we’re working on that illustrate how to decide when the API is sufficient, and when the firehose should come into play to ensure the utmost research rigor.
Before posting any comments, please read our comments policy.
- SXSW Dispatch - March 10, 2013 | Health Media Collaboratory on SXSW Dispatch 1 – OMG!!! I just met Grumpy Cat and he wasn’t impressed.
- Shout-Out: Who’s Using Tumblr Data for Research? | Health Media Collaboratory on SXSW Dispatch 4 – Images, the New Text
- Guest Post: An Education in ‘Re-identification’: Learning From the Personal Genome Project | The DNA Exchange on Big Data in Public Health
- Lisa Vera on Collecting Data from Twitter: Firehose or API?
- Clement Levallois on Collecting Data from Twitter: Firehose or API?
Phone: (312) 413-2885 Fax: (312) 413-0474