I went to csv.conf, a data science conference that centered around journalism, government, open source, and more. At the event, I got to talk to so many people, including librarians, product managers, software engineers, etc. It was really invigorating and refreshing! Data science is truly a multi-disciplinary unicorn. Here I briefly mentioned the talks I went to and what I thought about them.
You can see the list of talks here. They’re supposed to release the recordings soon. Here are the talks I have attended. Since it’s probably easier to learn about the technological aspects of data science elsewhere/online and there are plenty of blog posts abut that, I decided to spend more time hearing from data scientists who work in non-tech fields.
- Smelly London
- Our Cities, Our Data
- Open Contracting Data in Mexico City
- Keynote by Heather Joseph
- Data and Abuse of Power
- Data Lovers in a Dangerous Time
- How to mislead the public
- How do I get that? Strategies for requesting data and negotiating for it
- DDF: Gapminder’s data model for collaborative harmonization of multidimensional statistics
The project is about text-mining the smell data from the historical data and using visualization and other analysis to navigate the “smellscape” of London in 19th to 20th centuries. At first, I thought the word “smell” was used as a symbol for something else. But no, it’s completely about the smell in London collected by physicians.
The speaker herself studies medicine history. But why smell, I asked. Apparently, physicians in the past believed that disease was caused by smells until the germ theory was widely accepted. That’s why much of the medical records have smells associated with different diseases.
Our Cities, Our Data
This is the start of a recurrent theme in the conference: data science is key to good policy-making.
The speaker is a passionate data scientist. She started doing data science as a hobby and eventually became a full-time consultant + advocate for open data. Why should public policies live with carefully curated data? For example, some agency made an app to let people report accidents for pedestrians to improve road safety. They found that Capitol Hill in Portland was most dangerous for pedestrians. However, anyone who has been to Capitol Hill can tell you that it’s the safe heaven for walking. So what happened here?
Using a mobile app to crowd-source statistics can often lead to a biased dataset since the data is collected from people who actually care about the issue and has a mobile phone. Obviously, that left out a big majority of the population.
Open data can lead to better accountability. Making the available can let the public (a third person) analyze the data and see how the public fund is being spent, which neighborhood is getting more money, and gentrification through housing price. These are just a few examples.
Though it’s good to have more open data, it’s also important to have “good-quality” open data. Open data is nice, but a lot of time the data itself is not analysis-ready. For the science data folks out there, imagine your data is in a pdf or other bad format, with many duplicated values. Oh man…
Open Contracting Data in Mexico City
This is about how open data can help with accountability. In this case, local government releases contracting data based on the Open Contracting Data Standards, so that people know what’s going on in every stage (Planning, tender, award, contract, implementation) of the contract.
Personally, I see the value in this given my experience in Montreal. In around early 2010, my impression of Montreal was a place full of never-ending, seemly inefficient constructions. Then people later found out that there were numerously massive corruptions with the construction company, the mafia, and some officials.
Keynote by Heather Joseph
Heather is a director of SPARC and she’s part of this movement to make research and data open and accessible to the public. After all, it’s taxpayers money that fund scientists’ research. It’s not fair that we need to pay to see those papers that they publish. Things are getting better, at different speeds for different fields of science.
Things were also getting better at the government level. For example, we have Freedom of information act, a federal law passed in 1966, stating that anyone has a right to access federal agency records. Obama had also signed an executive order of making open and machine readable the new default for government information.
Obviously, we all know that the current administration with Trump is not supportive of open data. The word “open government” was removed from the white house website, EPA was defunded, and many public datasets have become inaccessible. We now have to work through fake news and 404 errors from removed pages. The data-driven and fact-based operating model of the government is seriously threatened.
Data and Abuse of Power
While many people I know are familiar with analyzing data and the importance of open data, I only learned about the importance of vetting your data recently from this talk. Blindly trusting a data is just as bad as taking the data from sources. There are five things that are red flags for biases.
- “We don’t collect it.”
What an organization doesn’t collect say a lot about the organization itself. For example, the data with police shooting was not part of a requirement until the BlackLifeMatter movement.
- “We don’t collect it and you cannot either.”
For example, CDC was collecting data about gun ownership as part of the homicide research. However, legislation was passed to cut funding to CDC and prevent them from studying gun violence. Moreover, NFL donates $30 million to National Institutes of Health (NIH) about football and brain disease, potentially tried to influence concussion research.
- “We collected it but you can’t have it”
For example, open.whitehouse.gov was set up by Obama administration, but data is all gone from the website, showing “no result”.
- “We frame what we share.”
This can often be shown from how data is being aggregated, making it hard to understand how the numbers were collected. For example, giving the total number of drone strikes without telling you which drone strikes are for what purposes or regions. Or that the Trump administration showing the crime rate going up in recent time, but it’s actually going down consistently if you look at the entire decade.
- The data is biased.
Basically, the entire dataset is made to promote a bias. For example, President Trump signed an executive order in January which includes publishing a weekly list of crimes committed by immigrants. It’s motivated by the idea that immigrants commit more crimes, without specifying what type of crimes and how the crime rate compares to other Americans. It’s designed to convince the public that immigrants are criminals.
I learned a lot from this talk. Not only that data could be a powerful tool for the public to ensure accountability and sound policies, but when manipulated, it is also very dangerous. We cannot just blindly trust any data (this is another theme of the conference).
Data Lovers in a Dangerous Time
The speaker was a lively person who is part of DataRefuge. He and the group have organized several meetups where people (librarians, anthropologists, developers, data scientists, other scientists) work together to save government data before they’re taken down or get lost forever, all for the name of preserving knowledge and information.
He talked about the challenges of doing so. For example, determining how and where to store the data, finding things that cannot be found with a web scraper, and ensuring that authenticity still remains. This reminds me of an O’Reilly Data podcast - an interview with MaxWell Odgen, on saving the world’s scientific and government data. If you’re interested in this topic, definitely check out these guys and the podcast.
How to mislead the public
Anyone doing research understands that designing a study well is very important and that we should try our best to avoid biases. However, when people don’t understand statistics well enough or too entrenched in their idea, biases can really affect the result of their study.
One example: The Gates Foundation was spending millions of dollar funding the small school movement: making high school classes smaller. It was based on a study that smaller classrooms lead to higher achieving students. There are some hypotheses to why this effort has failed. For Philipp, it was because small classrooms are overrepresented in high performing high school. So when the researcher thought that they observed the probability P(high performing | small classroom) > P(high performing | big classroom), what they actually observed was P(small classroom | high performing) > P(big classroom | high performing).
Another example is the Simpson’s Paradox. In 1973, UC Berkely was accused of alleged gender discrimination in graduate school admission. The overall admission figures show that men were more likely to be admitted than women.
However, when they examined admission by each department, they actually found that most departments were actually slightly biased in favor of women. It was because women tended to apply to more competitive programs with lower admission rates, and men tended to apply to programs with higher admission rates.
Here are other examples:
- Why you can’t get a date on a Saturday night and why most suicide bombers are Muslims
- Engineers have more sons, nurses have more daughters
Looking at the data for too long is dangerous if you don’t step out and think about it. It’s easy to confirm what we think it’s true. Keep that in mind whenever you conduct a study, analyze your data, or even reading a paper from a good journal.
How do I get that? Strategies for requesting data and negotiating for it
Say you want to get a dataset from the government. How do you do that? How do you know what they have? What if they say no? This talk is like a mix of negotiation and data science.
How do I know what exists? Ideally, there should be some inventories of databases. Citations in papers, proposals, reports, and audits are also good places to start. The metadata would tell you what’s in the dataset.
How to write a data request It’s best to use formal language (of the law), e.g. “I request, under the Oregon public records law, the following records… “ Sometimes it helps to mention what fields (ex: numbers and addresses) you are willing to give up in case they are concerned about privacy issues. You also want to be specific on what format you want (eg. “please provide machine readable format …”) and that you want any relevant data (“eg. … and all documentation that describes the tables and fields included in this request, as well as the definitions of any codes used.”). Of course, provide contact information (eg. “Please do not hesitate to contact me for if you have questions.”).
- What obstacles should I anticipate
Here are some of the reasons/excuses they can give when they refuse to give you the data:
- I’m not obligated to create a record for you
- Response: but you’re required to provide your data
- This data contains personally identifiable information
- See if without these fields you can still do your study
- This is far too much work
- Responses: I can help!
- That would require programming
- Responses: I can help! (might just be a SQL query)
- That’s trade secret, That’s protected by copyright …
- Don’t let it go at a “No”. Appeal!
- I’m not obligated to create a record for you
- How to appeal
Agency must cite a law that prevents it from complying and regulations and preferences don’t count. You can also ask to see their boss. Here are some resources that the speaker provided:
- Art of access - strategies for acquiring public records (book)
- First amendment coalitions
- Reporters committee for freedom of the press
- House Bill 3399
DDF: Gapminder’s data model for collaborative harmonization of multidimensional statistics
I knew of Gapminder since high school, so it was nostalgic to see it again and hear about their work straight from the people who are working on it. The mission is the same: educate the public a fact-based worldview that everyone can understand.
The sad fact is that most people don’t have an accurate view of the world; people’s don’t update their world view with the changing world. Take a test here to see how updated your world view is. For the record, I only got 2/3 of the questions right …
The beginning of the talk, we were presented with a chart that shows the number of children (< 15 years old) per year. The graph rises with the year and stops at 2000. Now the question is, what is the number gonna be at 2100? A) same B) Double C) Triple? Because I watched Kurzgesagt video on overpopulation, I (and very few people) chose option A, the right answer. There will be no more babies beyond this point in time!
They also talked about a data model they’re adopting called DDF, which makes flexible to transform and merge data together. Great for visualization of the world data!
In this first day, I learned so much about data science’s role in policy making, government, and what to watch out for. My friend also took some notes for first day of the conference. Tomorrow’s talks will have more tech aspects of data science.