HomePostsJul 31, 2024

Imagining a Personal Data Pipeline

I've been thinking a lot about personal data lately, the stuff that accrues as we live our lives in the presence of machines. About a year ago, I wrote down my thoughts on my own personal data: where it collects, what shape it takes, how it could be used.

The amount of data exhaust that we're all generating is pretty staggering. I'm fairly cautious about what I sign up for but I'm still finding more and more sources of data I can export and use. It's a bit of a bind, to be honest; I love having this data available to me but I don't love allowing companies to collect it ... What I am truly cautious about, though, is how much time I spend recording and collecting and transforming ... There is a whole bunch of toil baked into this hobby and I'm wary of creating an endless source of digital chores for myself ... I want to make it easy to collect as much of my own raw data as possible, even if I don't know exactly what I want to do with it right now.

Up until the end of last year, this curiosity was limited to me finding new sources of data, downloading them, and looking through these small windows into the past. What was the first thing I bought on Amazon (2004-12-7, The Daily Show with Jon Stewart Presents America, gift for a friend)? What was my longest bike ride recorded on Strava (2016-8-27, 118 miles from Seattle to Bellingham)? What was my first tweet (2008-3-27, "twit twit twit... nothing to say! :)")? It was fun and interesting and a nice distraction from what I was supposed to be doing on my laptop at the time, probably paying bills or reading another email from school.

Coupeville, WA

Quick coffee stop in Coupeville, WA on the way to Bellingham.

Earlier this year, I started thinking more about just how much data I generate and what, if anything, this data could be used for in aggregate. The idea of being a "data-driven" human being, making decisions in the name of "optimization," is not appealing to me at all but there are a number of places where I would like to be able to combine data across services without allowing disparate companies to just vacuum it up wholesale.

All of the things above could be stand-alone products on their own, and probably already exist in one form or another if I looked hard enough. But there are three big things that would stand in the way of me using something on the list above:

No matter how you cut it, trying to get all of your data into one service in order to pull insights out of it is problematic. Whether it's multiple apps working together or one single app that does everything, there is a fundamental problem with handing your personal data to a company and trusting them to both do the right thing and give you what you need, doubly so when you plug two apps together. For me, these issues prevent me from using a lot of different services that probably could be helpful.

The more I thought about it, the more important it felt to have as much of my data as possible immediately at my disposal. Maybe it's stored locally on my computer or in the cloud somewhere that won't be used without my consent, like an S3 bucket or a private git repo. Once it's downloaded, I could query it or transform it to another format or filter it and send the smaller payload to a specific app or service or even delete the data from the originating service. Not only that, I would have a backup of the original data always available in case I leave the service or it shuts down.

Data pipelines

As I started to piece together this would work, it sounded more and more like the extract, load, transform (ELT) data pipelines I was using at work. For those unfamiliar with the concept, the basic idea is:

  1. You have a bunch of data stored in unconnected systems, some or all of which cannot be accessed directly. At work, this could be application databases containing customer-managed data, 3rd party systems like SalesForce, or something else.
  2. You set up a connector for each of the data sources you want to combine or examine to pull the raw data out and store it in a central location, often called a data lake. These connectors could be scripts, DBT, connection services like FiveTran, pre-built integrations, or any number of things.
  3. Once the raw data is in the central database, transformations can be run at whatever interval with results stored in tables that can be dropped and recreated during each run. These could be filters, combinations, calculations, or something else.
  4. Analytics, reports, and other tools can now operate on transformed data, making it easier to reuse logic.

Putting this all together looks, conceptually, something like this oversimplified but still somewhat complicated diagram:

Made with D2

Applying this system to any one of the personal data combinations I listed above seems to be a good fit:

  1. Use services that store their data in a way that you can't directly query.
  2. Extract that data into a central repository you control, like locally on your machine.
  3. Make connections, combinations, calculations, and more without hitting rate limits, dealing with pagination, or waiting for requests to complete.
  4. End up with text files or CSVs or metrics or anything else you need, ready to review, publish, or share with others.

I want to pause for a moment and recognize that this particular idea is not original to me. I've been thinking about how to backup and use personal data for a few years and have, recently, been picking up ideas from various tools and blog posts I've found out there: the idea of a human programming interface (HPI); a personal data search engine; outboard memory using a memex. This is just scratching the surface but I wanted to give credit where credit is very much due.

The main idea here is that we can apply some modern data infrastructure ideas to the world of personal data and, potentially, end up with a sustainable system for collecting, backing up, transforming, and querying data across multiple platforms. This puts your data in your control and reduces your reliance on companies to do the right thing on your behalf.

The system

As I started to imagine what this might look like as a complete system, I realized that what I didn't want was a complicated software system that I needed to maintain by myself for the rest of my life. There is definitely a bit of compulsion behind this type of data hoarding and missing out on a sunny day at the park because I have to fix a bug with my data pipeline is not what I want my life to become. Whether this was a community of like-minded people or some kind of open-core business, I want to build something that others are able to contribute to and extend for their purposes.

To that end, I came up with a list of attributes, or values, that need to guide the creation of whatever this might become beyond Just Another Repo. I see these as table stakes, not aspirational recommendations:

I spent some time writing and diagramming and I think I have, if not the answer, then at least a step in the right direction.

Made with D2

Each of the components are described below, in terms of their job(s) to be done. I use these terms throughout the rest of the post.

If it's not clear in the descriptions above, a critical attribute of this system is that all the different parts can be self-hosted and maintained for free. If you are critical of giving a 3rd party service access to your credentials and data (as I am), then you can spin all of this up on your own hardware and the whole thing should work as described. Everything here is designed to be both modular and extensible, making the system work for many levels of paranoia and technical ability. The "paid service" I keep referencing is just a theoretical way to support work on this project in some distant future. Always the pragmatist.

What might not be completely clear in the diagram above is where the different layers of data end up. Because this system is meant to be flexible and extensible, I don't think there needs to be system-level opinion of how that comes together. In a data pipeline for a software company, you need to be clear and deliberate about where and how sensitive data is stored. If you're just pulling down your own data and creating new connections and combinations, this is not as big of a problem. Your layers could be any of the following:

In use

So, let's take the first idea listed above, a personal CRM, and see if we can build it, conceptually, using the system above. This is a system that's specific to my work flows and includes both stored cloud data and a local system.

Here's what I'm using currently and how I want it all to fit together:

Since the output is a local file, the processor, at least, will need to be local. To keep things simple, the rest of the components will be local as well. This will, incidentally, allow the retrieved raw JSON to be backed up in iCloud just by virtue of being saved to the right directory. Full circle!

What we're describing here looks like this:

Made with D2

With the Data Getter running, what we'll have waiting for us is a collection of events in this (filtered) format:

{
  "summary": "A fun event!",
  "start": {
    "date": "2024-05-16",
    // ...
  },
  "end": {
    "date": "2024-05-16",
    // ...
  },
  "attendees": [
    {
      "email": "josh@joshcanhelp.com",
      // ...
    },
    {
      "email": "bob@bobcanhelp.com",
      // ...
    }
  ],
  // ...
}

... and a collection of iCloud contacts looking like this:

{
  "firstName": "Josh",
  "lastName": "CanHelp",
  "emailAddresses": [
    {
      "field": "josh@joshcanhelp.com",
      // ...
    },
  ],
  // ...
}

We'll connect the event to the contact using the email address and output a line in the daily note for that date.

Note: I'm skipping over the details of the Data Getter because, in the end, that service is meant to "just work."

For those of you out there who can write code, this is not a terribly involved task. Parse the JSON, get all the email addresses, find them in the iCloud data, grab the names, find the daily note files, and print it all out. This is ... fine and much easier to do once you've got all the data you need sitting nearby. But the idea of these Recipes is to:

All of this, in my mind, points to some kind of declarative way to describe the data sources, their relationships, and the output that we want. The Processor would run a number of pre-flight checks against the stored data to make sure that the input sources exist, that the output destination is configured, and a map of how we get from input data to output.

Let's try to write a Recipe that represents all this in a way that makes sense for what we're trying to accomplish. Note that this is all just conceptual at the moment and the choice to use YAML is insignificant at this stage. Starting with the input data:

input:
  google:
    calendar:
      summary: 'event_summary'
      start.date: 'start_date'
      start.time: 'start_time'
      'attendees[].email': 'event_emails'
  icloud:
    contacts:
      firstName: 'first_name'
      lastName: 'last_name'
      'emailAddresses[].field': 'contact_emails'

# ... more to come ...

Here, we're telling the Processor that we have two data sources, google and icloud, which can be checked against either existing data files or some configuration to figure out if we expect those sources to exist. Then, each of the sources have one set of data, which would be validated directly on the data that we either have locally or are connected to. Finally, we indicate the fields that we're using and a shorthand name to be used later in the pipeline. Our pre-check can make sure that those names are unique across all sources but, because we're not processing the data yet, whether the fields exist or not is handled later. Note that anything with a [] in it tells us that we're handling multiple values.

Next, we're going to tell the Processor what output we want and what it's shape should be. The possibilities for output is effectively infinite but, just like with the rest of this, I want to start with capable default functionality and allow for lots of places to hook into the process. Standard functionality could just be simple file system operations like writing to a CSV or appending to a text file or making another JSON file.

This should cover a number of use cases but, like the Data Getter, I want to also include user-contributed modules for specific services and tools. Each contributed output would have configuration of some kind that would be checked (file paths, API credentials, etc.) and allow folks to define specific templates, output locations, and metadata to use.

For our CRM use case, this could look something like:

inputs:
  # ...
output:
  file:
    - strategy: 'csv'
      data:
        path: '/Users/user/Downloads/'
        fields:
          - date
          - start_time
          - event_summary
          - location
          - event_emails
  obsidian:
    - strategy: 'daily_notes_append'
      data:
        date: 'date'
        template: |
          - ${event_summary} with ${:loop:contact_emails:, :[[${first_name} ${last_name}]]} at ${start_time}  

# ... more to come ...

In the example above, we're using a theoretical Obsidian-specific module that provides capability for working with daily notes. When being called from a recipe like this, it would check for necessary configuration (in this case we would need a path to the local files at least) and fail if it didn't have the information it needed. Once it passes pre-flight, the module would tell the Processor how to find the daily note for the data being provided and what to do with the data once the file is found or created.

The last, and most important piece, is defining how the data should be modified and connected. In keeping with our "pipeline" concept, we'll define a set of actions to be taken on the data that we have:

inputs:
  # ...
outputs:
  # ...
pipeline:
  - field: 'start_date'
    toField: 'date'
  - field: 'start_date_time'
    transform:
      - 'toStandardDate'
    toFieldUpdateIfEmpty: 'date'
  - field: 'start_date_time'
    transform:
      - 'toStandardTime'
    toField: 'start_time'
  - field: 'event_summary'
    transform:
      - 'trim'
  - field: 'event_emails'
    linkTo: 'contact_emails'

Walking through this step-by-step:

Another note about data layers here ... The Data Getter is saving raw JSON parsed out into days but, after that, we just have what we have. But, for so many things that we want to do, we're going to be filtering, augmenting, connecting, and transforming this data into what we want. In many cases, the processing that's required for one output will be the same as others and duplicating transformations and links would be "fork bomb," as a friend described it. To help with that, we could define a set of "pre-processing" recipes that builds new JSON files from our data sources, creating a new source that can be truncated and replaced for each run of the Processor. This would simplify the recipes and provide smaller datasets to work with.

Assuming everything was wired up correctly, running this recipe would output the following block on all daily notes that had events from Google:

**Google Calendar events**
- Doctor appointment at 2:00 PM
- Partner introduction with [[Alice Lake]] at 11:30 AM
- Staff meeting at 10:00 AM

Viola! We started with disparate calendar and contact data and ended up with information we can use in a helpful location!

Are you interested?

I want to give you a nod of appreciation for making it this far!

This post originally started as an explanation of a concept I wanted to explore but, as I worked on the proof of concept more, it felt like it should, and could, be working before the post saw the light of day. So, over the last few months, I wrote the two main components, the Data Getter and the Processor. I added contracts for Google Calendar and iCloud Contacts importing and an output processor for Obsidian daily notes.

Right now, PDPL is a more simplified version of the one described above:

Made with D2

There is a long way to go before PDPL can do everything I described above but you can try it out using the tutorial below. This runs through how to set up the Data getter and generate a CSV of the output.

👉 Getting started with PDPL

Whether you are someone who has written something like this before or wishes that this kind of thing existed, I would love to hear from you. If you want to hear updates when I have them, sign up on Substack and I'll send updates on how the system is coming along and new releases. If you're looking for a more substantial contribution:

I'm specifically interested to hear feedback about the declarative processing, specific use cases, pitfalls from your experience, or "product" feedback from potential users, especially folks who would use something like this but do not have the ability (or desire) to build it themselves.

< References >

< Take Action >

Suggest changes on GitHub ›

Comment via:

Email › GitHub › Hacker News ›

Subscribe via:
RSS › Twitter › GitHub ›

< Read More >

Tags
Software Engineering Personal Data Obsidian Open Source Portfolio
Newer

Aug 07, 2024

My Values for Technical Leadership

My professional values as an engineer, architect, and technical leader.

Older

Apr 08, 2024

Building a CLI from scratch with TypeScript and oclif

I'm building a pair of CLI programs in TypeScript and decided to use oclif for flag parsing and releasing. I needed something more than the getting started doucmentation they had so I wrote it myself.