Managing data at a company is never an easy feat. Doing so at a rapidly growing company is even more difficult- oftentimes it means adding and building as the amount of data you're working with increases exponentially. As this data and data documentation increases, the amount of knowledge that new team members must understand increases as well.
Since we work with data teams of all sizes, here are some of the best practices we'd recommend for data teams that are serving startups and fast growing companies of all sizes.
1. Collect data from each source system
Collecting data from different sources can be challenging, especially when you’re in a fast-paced environment. It’s important to gather all your data into a single source of truth so that you don’t have to access disparate systems to get a complete view of your business.
When it comes to building out your data stack, there are two types of tools you should consider: ETL and warehouse tools. ETL (extract, transform, load) tools pull data from API sources and allow for some transformation before loading the data into a warehouse or database. Examples include Fivetran and Stitch. Warehouse tools—such as Snowflake, BigQuery or Redshift—are where the transformed data is stored and queried using SQL.
2. Use standard table and column names for your data model
When you're working with a SQL database, it helps to have some standard naming conventions for tables and columns. The following is a list of queries that can help you build your data model:
- Use the "SELECT * FROM" query to create a list of all tables in the database. With this information, you can ensure that none of your tables are abbreviated and that there isn't another table present with almost the same name (for example, "users" and "users_2").
- Use the "DESCRIBE" query to create a list of columns for each table. If you have multiple teams using multiple databases within your company, this can be helpful in making sure that everyone is on the same page as far as how data should be structured across tables.
- Use the "SHOW FIELDS FROM" query to create a list of all columns that are foreign keys in tables. This will make it easier to see which fields are mapped and which ones aren't.
- Use the "SHOW KEYS FROM" query to create a list of all columns that are primary keys in tables. Again, these queries will help you avoid confusion when building or modifying your database's structure later on down the line.
3. Disambiguate common column names between tables
As your company grows, the number of data sources available to you will also grow. It will be increasingly important for all users of your data to understand what variables mean across these different data sources. This way, a user like a finance analyst who may not be familiar with your web application can know what an event_id or session_id corresponds to in your database.
To help prevent confusion, telling people that they must always reference columns with their table names (e.g., `user.gender`) is a good practice because it removes the need for disambiguation and can prevent errors when querying multiple tables at once.
However, if you use common column names across tables—for example, `date` or `email`—you should explicitly disambiguate them by prepending them with their table name and/or using more descriptive terms such as `creation_date`. This will prevent ambiguity in SQL queries and ensure that everyone who is writing SQL knows where the column is coming from without having to look it up elsewhere.
4. Create an ETL documentation page or an org-specific playbook
- You should strive to create an ETL documentation page or an organization-wide Wiki that explains how data is collected and transformed at your company.
- Your documentation should include:
- Data sources: what they are, where they live, their purpose, who owns them
- Glossary of terms that are used in your org or data domain (e.g., in a SaaS company, you might want to define a "segment" as distinct from a "group")
- Overview of the architecture for ETL and analysis processes
For example, you could have a page dedicated to each of the projects below—or possibly even more granular pages within those projects:
- FinServ (finance service) — information about the usage data collection service that powers usage-based revenue intelligence products
- Segment — explain how events get into Segment and out of it into other tools
- Redshift — explain why we use Redshift and what its role is in our dashboarding/reporting processes
5. Stay up to date on the latest technology advancements in the data space
Even the best data teams are always looking for ways to improve their tech stack and processes, so be sure to stay on top of the latest advancements in technology. If your team is working with a new or unfamiliar technology, ask teammates if they have any experience with it, or know anyone who does. If not, attend meetups—if you’re based in New York City like we are, there are tons of amazing meetups to attend! You can also keep an eye out for news about new technologies that may prove useful to you—sometimes these announcements come from trade shows at events like Strata+Hadoop World or AWS re:Invent. It may sound obvious, but reading industry blogs is a great way to make sure you’re staying up-to-date on the latest trends in your field!
6. Use one consistent data transformation framework
As you add engineers to your data team, you'll need a way to ensure that everyone is using the same transformation framework for their work:
- Fewer bugs and other problems. A consistent transformation framework reduces the chance of individual engineers making mistakes due to not having all of the necessary context from the others.
- Easier onboarding. Engineers coming onto your data team will be able to quickly pick up how to perform data transformations because they'll know where to look for them and have a shared language with their teammates.
- Easier maintenance and scaling. If you have multiple transformations scattered throughout your codebase, it makes changing or troubleshooting them much more difficult (note: one common mistake is having transformations in SQL stored procedures). With one consistent framework, though, updating or fixing transformations can be very simple even as your company grows.
7. Create a naming taxonomy for all your models and queries
As your company grows, so will the number of people who have access to your data. If you’ve done a good job securing and organizing your data warehouse, then you’ll have a lot more people using it. That’s why creating a naming taxonomy is key for keeping everything organized and understandable. What should you name things? Here are some ideas:
- Use snake case (e.g., “my_model”)
- Use descriptive names (e.g., “monthly_registration_counts_by_cohort”)
- Use a consistent prefix for all tables (e.g., “tbl_…”)
- Use a consistent prefix for all views (e.g., “vw_…”)
- Use a consistent prefix for all materialized views (e.g., “mvw_…”)
- Use a consistent prefix for all derived tables (e.g., “dtb_…”)
- Use a consistent prefix for all functions (e.g., “fnctn-…”).
8. Create a directory structure with descriptive names for all your model folders
You’re going to want a directory structure of your model folders that are named in a way that makes sense. This is important because it makes it easy for others to find the information they need when working on an analysis or building new models. For example, if you are working on the “Customer Retention” project and have a “Customer Lifetime Value” model, your folder could be called something like “customer_retention/lifetime_value.”
People who are not familiar with the data team will be able to easily navigate after some basic training (e.g., where all Git repositories live, how many projects do we currently have), and you can also create search functionality for more advanced users.
9. Make sure everyone knows where to find the documentation and which version of it is accurate
- Create a central location for all documentation.
- Use a versioning system so that users know which version is the latest.
- Notify users when changes are made to important documents.
10. Be kind to yourself when collaborating with others, and understand that not everyone learns in the same way you do!
When you’re collaborating with other teams and individuals, remember that not everyone learns in the same way as you do. Be patient when you explain your project to others: it may take a few rounds of questions and answers before they fully understand what you’re trying to accomplish.
If something isn’t clear to you when someone is explaining their data project, don’t be afraid to ask for more details. But don’t be surprised if the person on the other side of the conversation has trouble understanding why you don’t get how their project works! You each have a unique way of processing information.