Data Pipeline Resilience & Cost Efficiency at 360Learning

As we scale up at 360Learning, we’re on a quest to improve our product with great data. The problems get bigger, the guessing gets harder. Since the beginning of 2022, we have started to incorporate more sources of data in the heart and veins of our company. Staying true to our core value of transparency, here's a closer look behind-the-scenes at how we've done that.

Once upon a time, I worked in the healthcare industry. The product was top quality and the clients were buying the upgrades. But because of the amount of testing and validation required, and the physical time needed to build a new hardware, we typically had to wait two years between starting the project and having it in the hands of the doctor.

The devices weren’t connected, so we had no information on the usage of the software. Did we do a good job? Was this feature that we spent months building actually useful? Was there a point in this optimization we took weeks to do?

To this day, I still don’t know. There is nothing we can’t build in software, so the only important question is: Should we spend our money building it?

At 360Learning, our business is special among the other scale-ups: we have a huge functional base. We build a product that is used by startups, mid-market organizations, and enterprise companies for onboarding, sales training, customer training, and a host of other learning and development objectives. Writing the “right” feature when you have so many different clients is a challenge. It gets very hard, very fast.

And to make the right decision, we need data on the usage of our platform.

Thankfully, at 360Learning, we have that data. And our data pipeline is only getting stronger.

This is the story of how it’s come to be.

How stronger privacy policies affected data collection

Why Steve Jobs failed us.

At 360Learning, our main source of truth is Amplitude. Our Product Managers chat with customers during discovery meetings to find out what needs they have, and based on their feedback, we create new features. Then, we implement and release the features with Amplitude trackers that monitor their usage.

It’s quite easy to see how often a button is used or how many times a page is opened, as well as analyzing the workflow of a user to see when they drop out due to one of many possible reasons: loss of interest, confusing user interface, unclear copy, etc..

Amplitude is a great tool, and it’s designed very well. It is easy to integrate inside your front-end code and to trigger a call from the browser to the Amplitude servers each time a user triggers a widget. However, this behavior works exactly the same way as the ad trackers that populate the internet ecosystem to track your every move and sell your data to the world.

A few years ago, Apple decided to make more billions by becoming a “champion of privacy” and started to wage a war against those trackers. Mozilla, Microsoft, even the kings of ad, Google, were all forced to follow and all the browsers (and adBlockers) are now blocking all or part of those trackers, by default or optionally.

Amplitude, as a somewhat innocent bystander, suffers the same fate as the rest of the trackers. Some events will pass through, depending on the version of the browser or the configuration of the network, but some won’t. At the end of the day, all of the data is not 100% available, and therefore not reliable. At best, it can be used to gather trends in usage (by supposing that the amount of data blocked stays the same across time and features).

Any tentative fix inside the browser is a losing battle—you’re stuck between the biggest companies of the world and their money.

Sourcing more data outside of Amplitude

Like chocolate, there is never too much data.

But there is a way to correct Amplitude’s data. It requires going server to server. Instead of having the browser send the data to Amplitude, we can push the information to our servers, and then do the call ourselves to Amplitude.

We could do that. That would mean some implementation work, some infrastructure and maintenance cost, maybe losing some quality of data (e.g., information on the browser used by the user), and definitely losing part of the appeal of Amplitude—the ease of use and simplicity and instead using it as an expensive database with a very nice user interface.

But we knew we wanted more, a better event management system. Our goal is not just tracking the usage of our widgets but to also answer :

What is the usage of this feature per company? Per company type?
How much is the cost of maintaining this feature? Is it worth it?
Can we improve our Machine Learning models?
Can we reduce the usage of our legacy database? Can we speed it up?
Can we use Event Driven Architecture ?

We want to show the value 360Learning brings to our clients, and we decided that fixing Amplitude wouldn’t be worth the investment, and we could go further: Track the data, get as much of it as possible, store it, and use it.

How did we plug the data in?

Let’s get technical.

First, we needed to gather the data. Each feature should be able to publish an event describing the change of state that happened, and it should be stored somewhere for further analysis. We would need to be able to quickly create events, push them at a high frequency (thousands per minute), and store them for at least a year. Bonus points if the technology used could also be used later as an input in streaming architectures.

We just finished a migration to the Microsoft Azure Cloud, and our DevOps squad still has opportunity to grow so building our own Kafka or HDFS cluster did not seem like the best use of our time.

Instead, we dipped into the managed technologies Microsoft proposed.

Azure ServiceBus and Azure Logic Apps

Azure ServiceBus is a message broker with publish/subscribe capabilities. We're starting to use it as an inter-service protocol between our monolith and some new services. It was easy to set up, it promised to handle 100ks messages/minute, with a separation of concern by topics and a life expectancy of several days for the messages. All in all, something that sounded like everything Kafka can do, and well inside our requirements. It worked as promised, and we could publish our events.

To store our data, one candidate was to reuse MongoDB. However, we eventually decided against it, as we did not want to impact our current cluster. Plus, MongoDB is optimized for fast storage and fast reads, built on RAM. We preferred to go with Azure Blob, a cold storage (HDD), optimized to store massive amounts of data, where the Gb is so affordable you never think about deleting it.

However, there was no automatic feature to store the events on the ServiceBus. The point was to require minimal maintenance for the pipeline, so we did not want to implement a service—from scratch—that would consume and store the data.

So, we turned to the Azure Logic Apps, a “NoCode” functionality that allows us to build workflows (signals to actions). From a signal “Event in the ServiceBus”, react with “Write in Azure Blob”. However, the edge cases kept piling up and complicated the logic app: different types of blob storage, steps to decrypt the messages, file name format depending on the event type and date, etc…

In the end, we had a functioning pipeline but already an overly complex logic app.

It got too complex way too quickly

Azure Event Hub

We then decided to dig into the alternatives that Azure was proposing (there are at least 3 different ways of doing pipeline of messages). We realized that our workflow corresponds to Azure Event Hub much more, because for all intents and purposes, Azure Event Hub seems like a reimplementation of Kafka.

All basic concepts are there (producer/consumer, topics, retention, partitions, scalability), and as far as I saw, only one additional feature (you can have namespaces of topics on Event Hub, a way to gather your topics together). The key differentiator with ServiceBus was this one-toggle feature to “Capture”, i.e. store the events automatically in a Blob Storage. It was as simple as advertised and we could enjoy our events stored by a line of config, and we could dump the logic apps.

At this stage, our platform could emit events and they would be stored forever.

Now what about actually doing something with the data?

Databricks

Databricks is a software development platform used by our Machine Learning engineers. It allows them to write down Python code, read our MongoDB, process the code by providing on-demand servers, and write down the result in Cosmos DB (fully compatible with our MongoDB). We tested and validated that we could access the Blob storage as well.

We took a minute there and checked our goals. It was now possible for us to provide more data to the ML Engineers. But there were additional clients: we want to empower our dev with the capability to develop streaming applications and to be able to precompute a stack of data. Python and Spark/Spark Stream on Databricks is a good response, but the gap versus our current tech stack (Javascript and MongoDB) would make the adoption tough.

Aside from the pure tech discussions, we realized the pull for data was also strong in the Operations part of the company. How much is this feature worth to a client? And, subsequently, to us? In other words, how can we link the usage of the product to both their and our revenues?

Finding this information requires us to know about all the databases and sources of data of the company, how to link them, clean up the data, and realize its full potential.

Luckily, we had someone who knew how to do that.

Building a Data Squad to support the data pipeline

People come first.

Enter Julie. She’s our Data engineer and, at that time, she was working full time for Operations, answering the previous questions about businesses and usage of the platform as much as she could. But she grew frustrated with the information that was lost forever, the difficulties of getting all the data together, and the inability to provide the requested product data.

There were a lot of discussions, some time for reflection, and ultimately we realized that we needed more from her—we needed her skills to build a proper data pipeline, we needed her knowledge to be applied further in the company, and we realized that she would need some help to answer our new needs along with the old ones. So, we opened up some positions! And now, we’re looking to build out our ever-growing data team.

If all goes well, we envision the Data Squad of being composed of several Data engineers who will focus on the data pipeline, as well as Data Analysts, who will focus on harnessing this data to improve our product and business. The Data Squad will be incubated inside the AI Squad to start—as the ML engineers are huge consumers of data—until it gets big enough to be its own independent squad.

What lies ahead for the team

At the time of this writing, we have two event hubs plugged in prod. We have identified three other use cases (group membership, mail events, and subtitles used) that should be implemented during the quarter. Our data pipeline is getting stronger, with new tools in various stages of POC.

Hopefully, we will get back to you with status updates soon! And if you’re interested in joining the team, check out our open roles here.

360Learning’s Data Journey: Building A Resilient & Cost-Effective Pipeline