an alarming national re-admittance rate(19%), with nearly 2 million Medicare beneficiaries getting readmitted in within 30 days of release each year costing $17.5 billion. It is not completely understood how to reduce the readmission rates as even highly ranked hospitals of the country have not been able to bring their rates down. hospitals have
While the Survey file contains survey data on 4606 hospitals across the country, after cleaning up missing values, "insufficient data", "errors in data collection" the number of hospitals was down to 3573.
That settles the structured data. Lets talk about unstructured data. Consider a tweet from a user @TonyHernandez whose nephew recently had successful brain surgery at @Florida Hospital. Yes this one.
7hr brain surgery, huge success! Big props to Dr Eric Trumble at #FloridaHospital #disney pavilion, a true ROCK STAR! Thanks 4 prayers!This tells how a particular patient feels about the care measure he(his nephew) received at the hospital. The sentiment of the tweet text, the hashtags, the retweet count, favorites count are simple yet powerful signals we can aggregate to get an idea about how the hospital is performing.
— Tony Tightropes (@TonyHernandez) June 7, 2013
Normally, I'd use Python to do these kind of matching but since the course had evangelized sqlite3 for such join-s, I went that route. A minor point to note here is that case insensitive string matches in sqlite3 for the text data type need an additional "collate nocase" qualification while creating the table.
Next you want to see how many matches between the two datasets do you actually get.
So we lose out 92% of the survey data and less than 8% of the hospitals we have data for were on twitter when this list was made. These 246 hospitals are definitely more proactive than the rest of the hospitals, so I already have a biased dataset. Shaks!
Moreover, apart from the twitter handle rest of the data in the list was outdated. I needed an updated count of followers, friends, listed, retweets and favorites for these handles. A quick Twython did the trick.
While the twitter api gives direct count of the friends, followers and listed, for other attributes I had to collect all the tweets that were made by these hospitals. Additionally, it is important to get the tweets that mention these hospitals on twitter.
Collecting such historic data means using the Twitter Search API and not the live Streaming API. The search API is not only more stringent as far as the rate limits are concerned, but it is thrift in terms of how many tweets it returns. Its meant to be relevant and current instead of being exhaustive.
Props to TwitterGoggles for such a nice tweet harvesting script written in Python 3.3. It allows you to run the script as jobs with list of handles and offers a very nice schema for storing tweets, hashtags, and all relevant logs.
Before I managed to submit the assignment, on two runs of TwitterGoggles I collected 21651 tweets from and to these hospitals, 10863 hashtags, 18447 mentions, and 8780 retweets from Medicare Aided Hospitals on Twitter.
Analysis: All this while, I was running with the hope that all would some how come together to form a story at the last moment. What made things even more difficult was the survey data was all in Likert Scale - and I could not think up some hardcore data science analysis for the merged data. However, my peers were extraordinarily generous to give me 20 points with the following insightful comments with the first comment nailing it.
peer 1 → The idea is promising, but the submission is clearly incomplete. Your objective is not clear: "finding patterns" is too vague as an objective. One could try to infer your objectives from the results, but you just build the dataset an don't show nor explain how you intended to use it, not to mention any result. Although you mentioned time constraints maybe you should have considered a smaller project.
peer 2 → Very promising work, but it requires further development. It's a pity that no analysis was made.
peer 4 → I thought the project was well put together and organized. I was impressed with the use of github, amazon AWS, and google docs to share everything amongst the group. The project seems helpful to gather data from multiple sources that then can hopefully be used later to help figure out why the readmission rates are so high.
peer 6 → As a strength, this solution is well-documented and interesting. As a weakness, I would like to have seen a couple of visualizations.
While there is a lot to be done I thought a quick tableau visualization of the data might be useful. Click here for an interactive version.