Sunday, February 2, 2014

Joining an Agile Team

For the last four months I have been working as a team member in a
Scrum team. I was familiar with the Agile principles and practices
before joining the team, but doing Scrum hands on has been an eye
opener. Having said that, rarely a day has passed that I have not
experienced the impostor syndrome. I started imagining crazy scenarios
which would always end in someone saying "What? You don't know what Y
means? Who hired this guy?". All my hopes were hanging onto the "new
guy" card - but then that has its shelf-life.

After talking to some great Scrum coaches, Scrum masters and

experienced programmers, I have collected some tips that can be
applied to reduce the self-loathing when you are the new guy in a
agile team.

Individuals and Interactions

What is at the end of a process decision, a bug, an obfuscated code commit or a failing test? It is a human being. If you really want to understand the why something is the way it is, you have to connect and communicate. While there are many ways to connect these days - a face-to-face introduction without a burning requirement seems to make future communication much easier. This means try to be the first person to give. 

I was very excited to see the Python wizardry at my workplace and talked to my manager if we can host the Chicago Python user group. The proposal was met with great enthusiasm and we are hosting the February Chipy meet up.

We are better when we are connected. So don't avoid workplace White Elephant parties or potlucks.

Offer help via pairing

In an alien code base, and little domain knowledge, even if you are an
algo-wiz-design-pattern-guru, you'll not be able to check in bugfree
code. The in-house frameworks (you have to be very lucky to have good documentation accompanying them), coding standards, testing practices
will add to the learning curve. My Scrum master realized this early,
and offered a highly effective solution from Extreme Programming.

Offer to pair with a team member on a particular story even if you

have no-clue about it.  Unfortunately neither you, nor the person you
are pairing with, will anticipate that you are going to slow
him/her down. So before you start working together, its better to make
sure he/she understands how familiar you are with the language,
module, frameworks and the business problem. If possible try to read
the tests before you start pairing. Get on the driver's seat.

Effective pair programming is HARD. Specially if you have been playing solo for a 
long time. You'll feel like sharing a steering wheel while driving a
car and being forced to drive in the slowest lane.  But as Uncle Bob
would say, you can only build good software by going slow.

Keep track of your progress 

This is something I having picked up after I started working out (love
my fitbit and Nike+). Peter Drucker said "if you can't measure, you
can not manage it". Some simple metrics that you'd find immediately usable:
- # of code commits 
- # of code reviews done 
- increase in test coverage
- # support tickets closed 
- helping others on irc, (or mailing lists)

Note taking

A moleskin notebook and a pen is your best friend when you are the new
guy.  Personally, I unsatisfied with only digital or only paper and use a combination of both. Additionally, have a few white sheets at your desk that
people can scribble on to explain stuff when they are at your desk.

Identify one area that needs love 

Unless you are playing with the Beatles, every team has an area that
needs some love. You'll get to learn about them during your Sprint
retrospective, which is Scrum's way of preventing broken windows.
Make an effort to develop an expertise in those areas, and try to help
the team get more productive. A good place to start is test coverage
and writing acceptance tests.

Those are some of the tips I have received in the last few

months. They are nothing super specific to Scrum or Agile for that
matter, but helpful in an extremely dynamic environment. In the end it
is a lot of common sense and desire to help your team mates. What do
you think?

Friday, June 28, 2013

Twitter Hospital Compare

While working on Coursera's Introduction to Data Science course project, a few folks on the discussion forum started exploring the possibilities of performing some twitter data analysis for healthcare. There were a number of thought provoking discussions on what insights about healthcare can be mined from twitter, and I was reminded of a data set I had seen earlier.

Last Fall there was another Coursera course, Computing for Data Analysis, by Roger Peng, that I was auditing. One of its assignments required doing some statistical analysis on Medicare Aided hospitals. These hospitals have an alarming national re-admittance rate(19%), with nearly 2 million Medicare beneficiaries getting readmitted in within 30 days of release each year costing $17.5 billion. It is not completely understood how to reduce the readmission rates as even highly ranked hospitals of the country have not been able to bring their rates down.

Research Questions:
I agree they are all very rudimentary, but my understanding about this country's complicated medical system is very limited. I know how to code, and take little steps at a time. Or so I thought.

Data description:
Composite Topics
While the Survey file contains survey data on 4606 hospitals across the country, after cleaning up missing values, "insufficient data", "errors in data collection" the number of hospitals was down to 3573.
That settles the structured data. Lets talk about unstructured data. Consider a tweet from a user @TonyHernandez whose nephew recently had successful brain surgery at @Florida Hospital. Yes this one.
Normally, I'd use Python to do these kind of matching but since the course had evangelized sqlite3 for such join-s, I went that route. A minor point to note here is that case insensitive string matches in sqlite3 for the text data type need an additional "collate nocase" qualification while creating the table.
Next you want to see how many matches between the two datasets do you actually get.
Moreover, apart from the twitter handle rest of the data in the list was outdated. I needed an updated count of followers, friends, listed, retweets and favorites for these handles. A quick Twython did the trick.
Props to TwitterGoggles for such a nice tweet harvesting script written in Python 3.3. It allows you to run the script as jobs with list of handles and offers a very nice schema for storing tweets, hashtags, and all relevant logs.
Before I managed to submit the assignment, on two runs of TwitterGoggles I collected 21651 tweets from and to these hospitals, 10863 hashtags, 18447 mentions, and 8780 retweets from Medicare Aided Hospitals on Twitter.
Analysis: All this while, I was running with the hope that all would some how come together to form a story at the last moment. What made things even more difficult was the survey data was all in Likert Scale - and I could not think up some hardcore data science analysis for the merged data. However, my peers were extraordinarily generous to give me 20 points with the following insightful comments with the first comment nailing it.
peer 1 → The idea is promising, but the submission is clearly incomplete. Your objective is not clear: "finding patterns" is too vague as an objective. One could try to infer your objectives from the results, but you just build the dataset an don't show nor explain how you intended to use it, not to mention any result. Although you mentioned time constraints maybe you should have considered a smaller project.
peer 2 → Very promising work, but it requires further development. It's a pity that no analysis was made.
While there is a lot to be done I thought a quick tableau visualization of the data might be useful. Click here for an interactive version.

Among the various data sets available from HCAHPS, this one contains feedback about the hospitals obtained by surveying actual patients. I thought it would be interesting to study how patients and hospitals interact on twitter. 

Why do some hospitals that have more followers, more favorited tweets, or more retweets? Is it because of the quality of the care measure they provide? Is the number of twitter followers of a hospital effected by how the nurses and doctors communicate with their patients? Do patients feel good (sentiment analysis) when hospitals provide clean, quiet environment and cater immediate help on request? Would proper discharge information help get hospitals more twitter love?

The Survey of PatientsHospital Experiences HCAHPS.csv (here on referred to as the "survey"), contains the following fields:

Nurse Communication
Doctor Communication
Responsiveness of Hospital Staff
Pain Management
Communication About Medicines
Discharge Information
Individual Items
Cleanliness of Hospital Environment
Quietness of Hospital Environment
Global Items
Overall Rating of Hospital
Willingness to Recommend Hospital

This tells how a particular patient feels about the care measure he(his nephew) received at the hospital. The sentiment of the tweet text, the hashtags, the retweet count, favorites count are simple yet powerful signals we can aggregate to get an idea about how the hospital is performing.

Next I got a list of hospitals that were on Twitter ... thanks to the lovely folks who hand curated it. It was nicely html-ed making it easy to scrape  into a Google Doc with one line of ImportXML("", "//tr"). Unfortunately, the number of hospitals on twitter according to this list (779) is significantly less when compared to the total number of hospitals. But it is still a lot of human work to match the 3573 x 779 hospital names.

So we lose out 92% of the survey data and less than 8% of the hospitals we have data for were on twitter when this list was made. These 246 hospitals are definitely more proactive than the rest of the hospitals, so I already have a biased dataset. Shaks!

While the twitter api gives direct count of the friends, followers and listed, for other attributes I had to collect all the tweets that were made by these hospitals. Additionally, it is important to get the tweets that mention these hospitals on twitter. 

Collecting such historic data means using the Twitter Search API and not the live Streaming API. The search API is not only more stringent as far as the rate limits are concerned, but it is thrift in terms of how many tweets it returns. Its meant to be relevant and current instead of being exhaustive.

peer 4 → I thought the project was well put together and organized. I was impressed with the use of github, amazon AWS, and google docs to share everything amongst the group. The project seems helpful to gather data from multiple sources that then can hopefully be used later to help figure out why the readmission rates are so high.

peer 6 → As a strength, this solution is well-documented and interesting. As a weakness, I would like to have seen a couple of visualizations.

It appears that hospitals on the east coast are far more active on Twitter when compared to the those on the West Coast. The data is here as a csv and the google doc spreadsheet.

Monday, June 24, 2013

Agony of Encoding on Python, MySql on Windows

So much has been written about Unicode and Python. But Unipain is the best. Although its ugly head surfaces at times, I've somehow got around unicode and python 2.7 problems and never given it the due respect it deserves. But a few months ago on a Sunday morning, I found myself in deep Unipain. This is an attempt of recalling how I got out of that mess.

My exploration in program source code analysis generally involves munging text files all day. Up until now, most projects, it has been text files with ASCII strings. Most of them came from open source projects with the code being written by developers who speak English. However, while working on Entrancer - we found the dataset that comes with TraceLab (a platform for software traceability research) contained source code from Italian developers. All my Python 2.7 scripts exception-ed miserably when they tried to chew on those files.

An example of one of the input files is here. What confused me more was that all of the *nix tools (sort, uniq etc) I had to access through Cygwin were happily operating on these input files, but the file utility appeared confused about the encoding.

After a bit of random googling, I found Chardet by David Cramer which guesses the encoding of text files.

So no help there. Why would Italian text be encoded in the Central European character set? RTFM-ing the codecs docs didn't lead anywhere. Soon I had drifted to reading hackernews.

Ok. Luckily the Internet has made this - Character Encoding Recommendation for Languages. I tried all variants like 8859-1, 8859-3, 8859-9, and 8859-15 and all have similar reactions. Thankfully, Jeff Hinrichs on the Chicago Python Mailing list pointed out "If it is in fact looking like 8559-1 then you should be using cp-1252, that is what HTML5 does". According to Wikipedia "This encoding is a superset of ISO 8859-1, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range."

While this worked for the file of a particular dataset, soon enough another file started biting at the scripts with its encoding fangs. And at this point you find yourself asking Why does Python print unicode characters when the default encoding is ASCII? and your REPL hurls a joke at you!

You are heartbroken at your failure to appreciate the arcane reasons for choosing the file system encoding as UTF-8 and leaving the default string encoding as ascii. You try coaxing Python by telling your .dot profiles to use UTF-8, export PYTHONIOENCODING=UTF-8 ... but Python doesn't care!

Almost by noon, you realize its time to let the inner purist go and whatever-works-ugly-hacker take over.

I vim /path/to/ +491-ed and changed the goddamned "ascii" to "utf-8" in the file. In your heart you know this is the least elegant way of solving the problem, as it would break dictionary hashes, and this code should never be allowed to talk to other python systems expecting a default ascii encoding. But its too easy to revert back. If you are interested, this is the /path/to/Python27/Lib/ file in your system. Read more on why is this a bad idea.

And lo and behold! All problems solved. But a safe way to do this might be to beg Python directly at the bangline of your script as described here.

With Python encoding out of the way, it was MySql's turn to come biting. We needed Wordnet for the Italian Language for our project and it uses MySql for storing the data. Though you have to get approval before using it, its free and the guys maintaining it are super helpful at helping you out.

While importing the data the first ouch was the following:

Well clearly it doesn't understand the difference between acute and grave accent marks. Luckily MySql Workbench is verbose enough to tell you where it is getting things wrong:

This stackoverflow post says that you have do an ALTER shcema - in MySQL workbench you can right click on the schema and find it on the menu. It drops you infront of a drop down to change the default encoding while importing.
But it was back to square one again: how do I know the encoding of these strings embedded in the sql statements? May be Chardet knows?

Nice. Next all you need is to find out what you should select to enable this charset - and luckily its in the official docs. Turns out, I needed latin2.

But unfortunately this did not change the auto-generated import sql statement that the MySql Workbench was using. It was still using --default-character-set=utf8.

Forget GUI! Back to command line. Under Plugins in MySql Workbench you'll find "Start shell for MySQL Utilities" and you'll be dropped to a shell where you can issue the above command with the password flag like this:

Note the error message saying it could not open the default file due to lack of file permissions, but that did not stop it from importing the data properly. Ok! works for me ;)

Finally everything tertiary was working. That means it was time to go back to writing the actual code!

Friday, May 11, 2012

The Third Meetup

Last Tuesday was our third meetup for Chuck eesley's Instead of the Michigan Street Starbucks opposite to The Chicago Tribune, we pivoted to the Wormhole for this one. For any geek who has been to this place, knows what a riot it is. From "Back to the Future" time-machine retro-fitted on the ceiling, old atari cartridges as showpieces on the coffee-table, super typo-friendly wifi password, stopwatch controlled brewing, Starwars puppets, app on ipad instead of cashbox - the bearded coffee masters had it all. Everything except a place to accommodate the thunderous 8 of Lake Effect Ventures. But of course, we have Benn Bennett - son of a lawyer and a linguist by profession. We witnessed the art of "coaxing people out of their couch" - 2 minutes later the entire team had the best possible seating arrangement.

Two hours of caffeine drenched brainstorming spitted out the following:

  1. I sketched out how the process might flow in two steps.  We are down to a pretty bare minimum concept build which is ideal both for this class and for getting something up quickly so that we can test it.
  2. I set up a Twitter account for Lake Effect Ventures so that we can tweet about progress we are making.
  3. Andy is going to jot up a positioning statement and beef up the business model canvas for the concept
  4. Leandre will use these to complete our 2-slide initial submission for our deliverable for the next deadline
  5. Leandre will also use this to start to craft a presentation deck
  6. Benn will be working on the copy for the landing page that I started.
  7. Benn will also be crafting a logo in Photoshop (Alex, Zak, Sidi if any one of you is good with design Benn would appreciate the assistance there)
  8. We need to think of a name for the concept as well
We think it is a bit premature to start on the user stories right now given that we have a good idea of what we are gonna build. Charles and me are gonna start on that and look to have something complete from a Version 1.0 standpoint by mid next week barring no setbacks. We will look to craft the user stories once we complete the MVP and use them as structure for testing features and functionality (Zak stay tuned on this)
Benn and Andy will also be working on putting together a more formal customer survey so that we structure the interviews we are having and start to compile meaningful data which we will need going forward.
Its getting exciting ….

Saturday, April 14, 2012

Advice:John Doerr on working in teams

Incredible Networking:  Collect names, emails of all folks you meet. Be very careful about who your friends and keep in touch - after all you become the average of the five people you spend your time with. Call them up - Its incredible what people will tell you over the phone. (This is something, I have always fallen short - I can hardly get beyond emails).

Carry Chessick, the founder and last CEO of once told me after his lecture session at UIC, that networking as it is perceived is worthless. When you meet people, make sure you finish off by saying "If I can be of any help to you, please do not hesitate to get in touch". That's the only way that business card will actually fetch you some benefit. I met a sales guy from, some time back at Chicago Urban Geeks drink ... who sent out a mail immediately after the introduction from his phone with a one line saying who he was, where we met, and that he'll keep an eye on tech internship notices for me. Brilliant.

360-s: If you want to find information about some company, of course you Google. So lets say if you are gathering info about Google, you'll also want to talk to their competitors Yahoo, Bing ... and find what they are thinking. Then you triangulate all that information to get in a good position.

Coaching: Make sure there is some one will consistently give you advice on what's going on in your workplace.

Mentoring: Having a very trusted person outside your work who can give advice is invaluable.

Time buddy: How do you make sure that you are doing good time management? Get a time buddy, compare your calendars on how you are spending time. Bill Gates does this Steve Balmer.

Another interesting practice I've read sometime back on Hackernews is communicating with team members in two short at regular intervals:
(1) What I did last week/day:
(2) What I'll do next week/day:

As my dear friend Guru Devanla( would put it "Its all about setting expectations ... and meeting them"!

Monday, August 15, 2011

Data loss protection for source code

Scopes of Data loss in SDLC
In a post Wikileaks age the software engineering companies should probably start sniffing their development artifacts to protect the customer's interest. From requirement analysis document to the source code and beyond, different the software artifacts contain information that the clients will consider sensitive. The traditional development process has multiple points for potential data loss - external testing agencies, other software vendors, consulting agencies etc. Most software companies have security experts and/or business analysts redacting sensitive information from documents written in natural language. Source code is a bit different though.

A lot companies do have people looking into the source code for trademark infringements, copyright statements that do not adhere to established patterns, checking if previous copyright/credits are maintained, when applicable. Blackduck or, Coverity are nice tools to help you with that.

Ambitious goal

I am trying to do a study on data loss protection in source code - sensitive information or and quasi-identifiers that might have seeped into the code in the form of comments, variable names etc. The ambitious goal is detection of such leaks and automatically sanitize (probably replace all is enough) such source code and retain code comprehensibility at the same time.

To formulate a convincing case study with motivating examples I need to mine considerable code base and requirement specifications. But no software company would actually give you access to such artifacts. Moreover (academic) people who would evaluate the study are also expected to be lacking such facilities for reproducibility. So we turn towards Free/Open source softwares., Github, Bitbucket, Google code - huge archives of robust softwares written by sharpest minds all over the globe. However there are two significant issues with using FOSS for such a study.

Sensitive information in FOSS code?

Firstly, what can be confidential in open source code? Majority of FOSS projects develop and thrive outside the corporate firewalls with out the need for hiding anything. So we might be looking for the needle in the wrong haystack. However, being able to define WHAT sensitive information is we can probably get around with it.

There are commercial products like Identity Finder that detect information like Social Security Numbers (SSNs), Credit/Debit Card Information (CCNs), Bank Account Information, any Custom Pattern or Sensitive Data in documents. Some more regex foo or should be good enough for detecting all such stuff ...

for i in `cat sensitive_terms_list.txt`;do
        for j in `ls $SRC_DIR`; do cat $SRC_DIR$j | grep -EHn --color=always $i ; done

Documentation in FOSS

Secondly, the 'release early, release often' bits of FOSS make a structured software development model somewhat redundant. Who would want to write requirements docs, design docs when you just want to scratch the itch? The nearest in terms of design or, specification documentation would be projects which have adopted the Agile model (or, Scrum, say) of development. In other words, a model that mandates extensive requirements documentation be drawn up in the form of user stories and their ilk. being a trivial example.

Still Looking
What are some of the famous Free/Open Source projects that have considerable documentation closely resembling a traditional development model (or models accepted in closed source development)? I plan to build a catalog of such software projects so that it can serve as a reference for similar work that involve traceability in source code and requirements.

Possible places to look into: (WIP)
* Repositories mentioned above

Would sincerely appreciate if you leave your thoughts, comments, poison fangs in the comments section ... :)

Monday, August 8, 2011

Hacking the newsroom

[This is part 2 of the final pitch, which talks about the newsroom and business perspective. Part 1, detailing the newsreader perspective is here.]

Before anything else, there must be a 90 seconds theatrical promo:

Stop laughing at my amateurish video editing! This is my first ever ... even Bergman, Godard, Fellini started somewhere to be great! Jokes apart here's what REVEAL actually is all about:

Lets consider a hypothetical newsroom which uses REVEAL. A journalist gets hold a huge collection of classified documents that contains potentially sensitive information. Instead of painstakingly reading each line and jumping back to google to search relevant information - she uploads them to REVEAL and hits the pantry for her coffee. Reveal goes to work and automatically parses out names of pepole, places, organizations etc. Using the names it detected, REVEAL affixes thumbnail images with the mappings of the named entities with the documents. The journalist now sits back, sips the coffee and flips through the images looking for someone/something/some place that's interesting and jumps directly to the document when she finds her target.

But that's not all. In order to make the life much easier for the journalist - REVEAL uses the names and keywords from the document, to aggregates semantically related contents from the net - images, video, news, blog, wiki articles using open apis. Making the background context readily available, it allows the journalist focus solely on her analysis of the story.

What follows is an over the top ambitious plan for making lots of money - I mean the business plan.

Unearthing named entities involves doing tonnes of computationally intensive text analysis and for any sizable dataset we need a cloud based solution. While REVEAL will always be Free and Open Source Software, the business proposition is offering it as a service. Be a startup or a news corp, whoever deploys REVEAL at their site - they can offer it as a service to other news agencies/ organizations based on pay by usage model. Different packages can be offered based on when they want to share the information dug out from their documents.

Nothing like REVEAL exists today. The cohesive bond of unknown information on well known personalities and organizations, original content (the documents), expert opinion(journalist's view), user generated content(comments) and  aggregated content - will make REVEAL a dream product for generating ad-revenues. Features for lead generation is inbuilt into the system and the karma points based reader appreciation along with the 360 degree view of the world will ensure persistent traffic.

Now get me to Berlin Hackathon!
(398 words)

Most common names detected in Wikileaks cablegate files

Link to an incomplete implementation