I helped the folks at #wjchat a while back by writing a quick django app to pull tweets each week and save them to a database. The tweets were then accessible as formatted HTML that could be posted in the #wjchat Wordpress blog.
But something happened around the middle of April. The script pulled in 389 tweets on April 16, but the following week only caught 100. Same for the week after.
I did some hand-holding for a few weeks, but more and more of the same occurred.
The script — which was using the Tweepy library — was missing the first hour or so of tweets, and leaving the archive incomplete. Time for a refactor.
I’m not blaming Tweepy but I wanted to see what else was out there. I tried making a go with requests and oauth but I don’t think I’m ready yet. But the
Tell you the truth: the library didn’t matter. It comes down to how tweets are returned by the search API — descending order — the maximum number of results per page — 100 — and this thing called the max_id.
The max_id is what amounts to “paging” it seems for the Twitter API.
The solution to the issue described above is to use a technique for working with streams of data called cursoring. Instead of reading a timeline relative to the top of the list (which changes frequently), an application should read the timeline relative to the IDs of Tweets it has already processed. This is achieved through the use of the max_id request parameter.
So the best I could get in a single search after April 16 was 100. And once I had 100 I had no means of moving on to the next page. Thankfully tonight I figured out pretty fast how the max_id parameter can work.
The script is based on the #wjchat script, and takes a hashtag and a search start date as arguments and outputs the results as a csv. It requires the following packages…
pip install python-dateutil==2.1 pip install pytz==2013b pip install twitter==1.14.3