Welcome to the COVID-19 Twitter Pandemic Archive. The archive is a catalog of datasets containing billions of Tweet IDs for COVID-19 related tweets and a set of data visualizations that display high-level monthly stats about the COVID-19 conversations on Twitter.
The datasets are being offered as-is for archiving and non-commercial research purposes and are free to download and reuse.
The tweets are collected via Twitter’s COVID-19 Streaming Endpoint (API) using a custom script developed by the Social Media Lab. According to Twitter, this new streaming endpoint has no data volume or throughput limitations, and offers a real-time, full-fidelity stream of public Tweets containing the full conversation about COVID-19. For more information about what tweets are included in this collection see Twitter’s filtering rules for this endpoint.
As per Twitter’s API Terms, each dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). New datasets are uploaded to the web at the beginning of each month.
For each month, we prepare two data files:
- one file with Tweet IDs for all COVID-19 related tweets that we collect via the API, and
- a second file containing a subset of Tweet IDs for COVID-19 related tweets that also contain a vaccine-related word (i.e., words starting with vaccin*, vacin*, or vax*).