Linear Digressions

In each episode, your hosts explore machine learning and data science through interesting (and often very unusual) applications.

http://lineardigressions.com

subscribe
share


 
 

      Clustering with DBSCAN


      DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

      share





      16m
       

      The Kaggle Survey on Data Science


      Want to know what's going on in data science these days?  There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle.  Kaggle asked practicing and aspiring data scientists about themselves, their tools, how they find jobs, what they find challenging about their jobs, and many other questions.  Then Kaggle released an interactive summary of the data, as well as the anonymized dataset itself, to help data scientists understand the trends in the da...

      share





      25m
       

      Machine Learning: The High Interest Credit Card of Technical Debt


      This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on....

      share





      22m
       

      Improving Upon a First-Draft Data Science Analysis


      There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your favorite machine learning class, or your most recent online machine learning competition. However, if you've ever tried to maintain a machine learning workflow (as opposed to building it from scratch), you know that taking a simple modeling...

      share





      15m
       

      Survey Raking


      It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do? Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights...

      share





      17m
       

      Survey Raking


      It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do? Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights...

      share





      17m
       

      Happy Hacktoberfest


      It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you can get involved. You can even get a free t-shirt! Hacktoberfest main page: https://hacktoberfest.digitalocean.com/#details

      share





      15m
       2017-10-16

      Re - Release: Kalman Runners


      In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're running a race but don't have a watch, how fast you're going. Katie's Chicago race report: miles 1-13: light ankle pain, lovely cool weather, the most fun EVAR miles 13-17: no more ankle pain but quads start getting tight, it's a little more...

      share





      17m
       2017-10-09

      Neural Net Dropout


      Neural networks are complex models with many parameters and can be prone to overfitting.  There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout.  It seems counterintuitive that undermining the structural integrity of the neural net makes it robust against overfitting, but in the world of neural nets, weirdness is just how things go sometimes. Relevant links: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

      share





      18m
       2017-10-02

      Disciplined Data Science


      As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work being done by other people, but this one summarizes some thinking Katie's been doing herself around how to guide data science teams toward more mature, effective practices. We'll go through five key characteristics of great data science teams, which we collectively refer to as "disciplined data...

      share





      29m
       2017-09-25