by Ru Arup, Data Team Lead. Connect with Ru on LinkedIn.
As LendInvest has expanded to become a technology-first public business, the number of data sources, variation and quantum of data has grown enormously.
For each loan application we are now collecting tens of thousands of data points to help us analyse credit risk and make better and faster lending decisions. The operational efficiency gained from collecting data digitally that was once a manual task is one of the core reasons the business has been so successful at writing more loans without linearly increasing headcount. This increase has also resulted in managing more data pipelines, transformers and tools to feed data models that our functional Analysts use downstream.
A core duty of a data platform is to pipeline all data into a single Data Warehouse where data modelling can be carried out to prepare the data for reporting and machine learning. Without adopting the appropriate technologies and applying strong Data Engineering practices, more disparate data sources can have a significant negative impact on the business’s ability to leverage the data for insights, if at all.
Our legacy data platform wasn’t able to scale so we set out to build the “Nova” Data Platform. A platform was born.
This article highlights three of the core problems we were trying to solve by rebuilding our platform from the ground up. It finishes with some insight into the Data Team here at LendInvest.
- Orchestration of data transformations with dbt and dagster
- Minimising data latency
- Maintaining a usable data catalogue
Orchestration of Data Transformations with dbt and dagster
One of the pain points with our legacy data platform was associated with hundreds of stored procedures being run on a MySQL server to prepare our Fact and Dimension tables for analysts to use downstream. We shoehorned ourselves into maintaining an ordinal position for each procedure which became unmanageable and a key-person dependency risk. This pattern was painful to maintain, meanwhile, dbt (Database Transformation Tools) had started to become an industry standard solution for this exact problem.
“Dbt makes data engineering activities accessible to people with data analyst skills to transform the data in the warehouse using SQL statements”.
With strong SQL skills within the team and more business knowledge among the BI analysts we decided it would become a core tool in our Nova platform to solve our orchestration of data transformation problems.
The Business Intelligence arm of the team took on the huge task of upskilling to become part-time “Analytics Engineers”, in order to migrate our Data Warehouse tables and reports into a dbt project running on Amazon Redshift. We now successfully use dagster, a python-based cross platform orchestration tool, and dbt to alleviate the issues we once faced.
A simple command like `dbt run -m +tag:core` is all that would be needed to build our core reporting layer instead of calling 20 procedures.
Below is an example of a dagster job and the dbt data transformation dependency graph to create a core reporting model, “base_application”. The pipelines are self aware and self correcting in instances where the data upstream isn’t in place or hasn’t completed transformation successfully. This means the data team can focus on adding business value rather than fixing data issues if data hasn’t arrived in time.
A typical dagster pipeline:
Below is a dbt model dependency graph used to determine which tables to create and in which order. We now have over 1000 models in our warehouse, so having a tool that manages dependencies is priceless.
Minimising Data Latency
The increasing size of our loan application pipeline and lending portfolio started reducing the ability to deliver data at an acceptable latency. What once took five minutes began taking two hours. Instead of critical dashboards being updated every 30 minutes it started taking 3-4 hours. These long-running processes had a knock on effect on the performance of other data processing. We knew that our legacy data platform was not scalable and didn’t provide us the flexibility nor performance to match the growth of the business over the next 5 -10 years.
How do we minimise Data Latency in the Nova platform?
- Kafka for data streaming
- dbt and Redshift to improve processing concurrency
Kafka for data streaming
Kafka is an event streaming platform. Data can be produced into this central hub as messages to be read by a consumer and fed into a database, a python function or another messaging service. Kafka has allowed us to move as fast as our data.
Kafka allows us to
- Replicate any external database into the Nova platform in near real time for use in our Data Warehouse / Data Services
- Subscribe AWS lambdas to transform messages in real time
- Push change data into an S3 Data Lake for a cheaper storage
dbt and Redshift for Concurrent processing
Another benefit of moving to dbt and Redshift has been the ability to concurrently process more SQL transformations. What was once limited to one SQL script running sequentially can now be scaled up with our Redshift cluster size. We are currently running four concurrent tasks during each job.
All reports are now being updated at least 4X faster than our old platform. Data models concerning applications are refreshed every 30 minutes instead of every 4 hours, which meets the SLA set out when building the Nova platform.
We are also exploring options for real-time analytics with tools like Materialize which can sit directly on a kafka broker and transformations can be managed by dbt, however for the most part a 30 minute delay is an acceptable latency.
Maintaining a usable data catalogue
A data catalogue is essential for any business seeking to democratise data. Without knowledge of what and where the data is there is very little chance that an analyst in Finance will be able to self-serve from the Data Warehouse. In the old world we would maintain a spreadsheet with all our definitions which synced to a database that powered a custom data dictionary web application. It worked, but the maintenance cost to the team was enormous and as a result we couldn’t keep it updated.
Dbt, fortunately, solves this problem too.
Dbt self-documents models and generates a useful, searchable catalogue for the business to use which is served on our internal network. It allows us to feed metadata from our project to Tableau and Metabase, our BI visualisation tools. So we can define a field once in our dbt project and automatically sync downstream. This was a huge win for the team and the business. We now can guarantee that our Data Catalogue is 100% up to date.
The flow of metadata to our data catalogue and tools:
The Nova Data Catalogue web app:
We educate our data champions to use this tool to find where reports already exist with the intention of reducing the pressure on our team and to help colleagues locate data. It also allows analysts to see the exact logic that is being used (in SQL) to generate a model. This increases transparency and paves the way for functional analysts who write SQL to also contribute to the dbt project one day thus scaling the capability of the platform without simply adding new BI Analysts to the team.
Working in data at LendInvest…
We are a small team of seven data practitioners spanning Platform Engineering, Data Engineering and Analytics. Choosing the right tools that would allow us to scale our impact on the business was absolutely critical when building out the Nova platform. Data Engineering burnout is commonplace in the industry so every decision has been made with automation and reducing manual work in mind to ensure we don’t become around the clock data cleaners, the fate of many Data Engineers today. We think we have built a really exciting data stack that will allow the business and the team to grow over the next 5-10 years without the pain points that manifested themselves in our pre-dbt data platform days.
Please reach out if you’re interested in finding out more about our stack!