Much of the housing crisis story comes down to numbers, but getting stories from data is often a messy and difficult process.
A Rice University research group is working to make this task easier. Hadley Wickham, assistant professor of statistics at Rice University, wrote last week on the “Flowing Data” forum (a great resource for datasets and tips on how to use them), that a team of his students published 13 relevant datasets that have been cleaned up for easier use.
Here’s what he means by “cleaned up”:
“Data related to the housing crisis exists in large, independent, and often messy data sets. So far, we have worked with subsets as large as 10 GB. The variety and size of the data creates an obstacle for effective analysis. Our first task after locating a new data source is to make it consistent with our existing data structures. We must also screen it for correctness, completeness, and conciseness.
“To facilitate sharing data, we have conducted both data cleaning and analysis with the open source statistical software R, which is available free of charge. We’ve made both the data and programming code available to the public. We hope that by keeping the code transparent and self-replicating, others are able to easily build off our work.”
Sounds like not just a great resource for covering this particular issue, but a fine model for journalists to emulate more generally. Whenever you gather and clean up a dataset for your work, consider its value as a separate spin-off product or resource.