We've all been there: you've attended (many!) meetings with sales reps from all the SaaS data integration tool companies and have 14-day access to test their products. Now you have to decide what kinds of things to try to definitively determine if the tool is the right compromise for you and the team.
I wanted to put together some notes on key assessment questions, as well as some ways to check functionality, as I'm sure this is a process I'll encounter again and again, and I like having a template for these guys. of things.
These are mainly compiled with cloud-based integration platforms in mind such as but not limited to Fivetran, Airbyte and Rivery, but could also apply to other cases!
If you have a favorite way to try new data tools, add them to the comments!
1. Create a rubric
You can find a million articles on evaluation criteria for data integration tools (I really like This!), but ultimately it all comes down to your data platform and the problems you're trying to solve within it.
Get the team together and determine what these things are. Of course, there are obvious features, like the required source and destination connectors, that can be deal-breakers, but maybe you're also looking for a metadata solution that provides lineage, or trying to increase monitoring, or need to scale something that was built in-house . and it no longer stands firm.
When all of that is laid out, it also becomes easier to divide the work of conducting these assessments among team members so that they run in parallel.
2. Start a simple pipeline that runs immediately
Pick something fairly simple and get it up and running from day one. This will help create an overall picture of logging, metadata, latency, CDC, and all the other things that come with a pipeline.
If you're lucky, you might even run into a platform error over the course of the 14 days and see how it's handled by the tooling company. If it is an open source option, it can also help you understand if you are prepared to manage these issues internally.
Key questions
- Does the documentation and user interface guide you through setting up permissions and keys, scheduling, setting up schemas, etc. in an intuitive way or do you have to contact the technical representative for help?
- If errors occur on the platform, are they obvious through the logs or is it difficult to tell if you or the platform are the problem?
- How quickly are customers notified and issues resolved when the platform goes down?
3. Create some end-to-end transformations.
Some tools come with built-in DBT integrations, others allow fully custom Python-based transformations. Translating a few transformations, maybe even a somewhat complex one, from one end of your existing solution to the other can give you a good idea of how heavy it will be to move everything, if even possible.
Key questions
- Can they send the data in the same format it's coming in now, or will it change in ways that greatly affect upstream dependencies?
- Are there types of transformations you do before landing the data that can't be done in the tool (joining supplemental data sources, parsing multi-level unordered JSON, etc.) that will now have to be done in the database after landing ? ?
4. Throw a non-native data source at it
Try rendering something from a source or format that isn't natively supported (create some fixed-width files, or maybe choose an internal tool that exports data in an unconventional way), or at least talk to your technical sales department about how I could do it. representative. Even if, right now, that's not an issue, if something comes up, it's worth at least understanding what the options are for implementing that functionality.
Key questions
- When an unsupported source arises, will the tool have enough flexibility to create a solution within its framework?
- When you start adding custom features to the framework, do the same logging, error handling, state management, etc. apply?
5. Force an error
Somewhere in one of the test processes you've created, add a poorly formatted file, add bad code in a transformation, change the schema, or wreak havoc in some other creative way to see what happens.
Third-party tools like these can be black boxes in some ways, and nothing is more frustrating when a pipeline fails than incomprehensible error messages.
Key questions
- Do the error messages and logs make it clear what went wrong and where?
- What happens to data that was in process once a solution is implemented? Is something getting lost or loaded more times than it should?
- Are there options to redirect the bad data and allow the rest of the process to continue?
A couple of bonuses
Have a non-technical user ingest a Google Sheet
The need to integrate data from a manually loaded spreadsheet is a somewhat more common use case than DEs like to think. A tool should make it easy for the production team to do this without DEs getting involved at all.
Read the Reddit threads about the tool.
I have found Reddit very helpful when researching tool options. People are often very reasonable in their evaluation of positive and negative experiences with a tool and open to answering questions. At the end of the day, even in an extensive testing phase you will miss things, and this can be an easy way to see if you have some blind spots.