Insights from the Databricks Conference

The Data + AI conference in early June brought together data practitioners from across the globe both in-person and virtually. Over the course of four days attendees had the opportunity to complete trainings, compete in hackathons, attend informational sessions, and hear about the new features coming to the Databricks Data Intelligence Platform. This year it became clear that Databricks is aiming to expand its community and gain additional reach in organizations from the IT suite to include business users more and more. While retaining the deep connection with core technical users, the company has introduced more features that enable a user-friendly GUI, point-and-click pipeline logic deployment, and marketplace provisioning. We can anticipate Databricks to continue becoming more inclusive of all types of users as they grow, catering both to their core fanbase and building a community of younger users through the new free edition as well as their pledge to increase training at the university level.

Governance Rules the Day

Databricks is all in on governance. Unity Catalog, Marketplace, Bricks, Metrics, writeback capabilities, and a whole host of other features are growing the organization’s ability to serve governance directly integrated with AI and data development. Some of these features the Metric’s ability to enforce standardization policies on data living outside of Databricks is innovative compared to the majority of governance platforms today. However, there is still a lot of growth needed in change management and the user experience especially as it pertains to data stewards, governing bodies, and the wider community that is needed to make governance effective on an enterprise scale.

Unity Catalog was made open-source last year and is now integrated into every aspect of the Databricks platform. There are a variety of new features that could rival other governance programs but Unity in and of itself is not something that could be used by non-technical users. Unity is now the central connector for the account console, Metastore, lineage, discovery, access control, auditing, credential management, and monitoring. Currently the system is still focused on traditional roles like account admin, workspace admin, etc but there are new efforts to bring data stewards closer to the catalog and enable users through Apps and One.

Some additional new features of Unity Catalog include the following:

Metrics (In Preview): Engineers and stewards (theoretically) can define metrics in Unity Catalog not just as a business asset but also as a policy that can be enforced for the name of the metric and defined alias’. These can be built explicitly in code or using the Genie native language interpretation. These metrics can also be pushed out to a variety of other data sources, including Snowflake and dbt.

Quality: There are some OOtB quality capabilities already integrated within Unity, including Bricks that can be used for both evaluation and monitoring/ auditing. This is fledgling currently but could turn into something similar to Apps where partners and eventually clients would be able to contribute to the available frameworks. Quality capabilities also include timeseries, snapshot, and inference modes.
Traces: Models now have both bookend and internal tracing. Not only is there logging and identification of errors or successful completion at the input and output of the model but also at each step in between. This is a result of open-sourcing the system and provides benefits twofold; developers will be able to more distinctly identify errors within the pipeline instead of requiring extensive investigation in obscure logs or guess-and-check where logs did not exist.

Foundations Remain in Technical Users

While there was a significant focus on both AI governance and data governance throughout the conference, Databricks had plenty of new features for their base including performance improvements, infrastructure updates, and streamlined engineering capabilities. There are also some features like Bricks and Apps that allow for greater self-service for end users, removing the onus from developers and allowing them to focus on higher value tasks.

Databricks has improved their baseline DBSQL performance in a number of ways. Warehouse performance has been optimized to operate 25% more efficiently than before. Developers are now able to use variant and geospatial datatypes, opening the door for more complex and insightful analysis. The new Lakebridge capability eases migration from legacy systems allowing developers to get to work quickly without overly complex migration pipelines. And finally, a partnership with Microsoft has allowed Databricks to integrate Gemini into every aspect of the platform to assess models, provide recommendations, support code development, and more.

Beyond pure performance, there are a number of new features that are both easing and expanding the capabilities of data engineering in the platform. Through the acquisition of Neon comes the emergence of Lakebase, a new transactional warehouse that separates storage from compute while also reducing the transaction to analytics pipeline to effectively real-time. This drives the ability for negligible-/ no-latency application build on top of transactional systems with the purpose of supporting AI and multi-agentic applications with near-immediate output/ input processing. Additionally, they are moving towards a more software development style of maintaining databases with their new product. Leveraging the Neon acquisition and a more robust, cloud version of Postgres they have separated storage and compute in efforts to reduce cost and processing needs. This will allow the compute code to operate in branching and call data up from the low-cost storage only when executed and only call up data that is needed. This would also allow developers to build out and quickly stand-up replicated databases for development and testing without impacting others’ code, ensure they are using the same source of truth, and streamline deployment to production.

Another big win for developers throughout the week includes the changes and improvements to pipeline development. The company has relied upon Spark to run pipelines for years, but oftentimes Spark can become convoluted in the various applications, orchestrators, testing mechanisms, and other elements that need to be integrated together. Lakeflow remedies this disaggregation by integrating with Spark but also providing its own pipeline creation capabilities that combine with the platforms OOtB orchestration and CICD needs in a point-and-click platform. The new tool can also use Genie to develop technical pipelines from natural language business logic, a feature that can be used by business users for initial pipeline development and bolstered by developers in the traditional backend. The system operates as inherently incremental to reduce cost and processing needs while also ideally speeding up pipeline runs. As with all other aspects of the new Databricks infrastructure, Lakeflow pipelines integrate directly with Unity Catalog.

Finally, MLFlow is the equivalent of Lakeflow for the data scientist. It brings all elements of the ML pipeline including tracking, observability, performance, etc into a single application that can be leveraged by data scientists and analysts to develop new products and deploy those new products in the model registry. This year models have been expanded to include agentic agents and has a new functions feature that modularizes the build of AI/ ML tools that can be used generically eg sending a message to Slack, extracting data from a pdf, referencing a predefined connection, or accessing APIs to other models.

Some other features added through MLFlow include inference tables (automatically persisted prediction tables built from model endpoints) and Lakehouse Monitoring which is includes OOtB monitoring using a backend framework of Databrick’s own schema with prebuilt dashboards and insights. With the integrations Databricks has been building into their platform this could become significantly more comprehensive in the year to come.

Embrace of Non-Technical Users

Traditionally, Databricks has been a platform for highly technical data engineers and data scientists building sophisticated models that were often obfuscated from the lay-user through semantic layers or by simply providing output in the form of simple reports and dashboards. That is now changing as the organization sets its sights on enterprise adoption through Databricks Apps, Databricks One, and a new comprehensive training and partnership program.

Databricks Apps use the current python apps framework to support a more user-friendly experience of the Databricks platform. Overlayed on a number of the various Databricks microservices this abstraction view still does not provide the type of experience that would allow for enterprise rollout but instead caters to those more experienced business users that are uncomfortable in the highly technical environment of traditional Databricks. There are significant improvements to GUI interaction capabilities but while this would service a business analyst a business user would not feel comfortable in this platform. The Apps begin to incorporate more traditional governance aspects like domains, metrics, and data stewards but still lack the ease of use expected from and needed by non-technical users as they attempt to navigate a more data enriched landscape. They will begin to allow users to drive more comprehensive solutions in Databricks that incorporate not just the data and models but also the outcomes of those models. This shows that Databricks is beginning to grow into the data product space, but will require additional development before full enterprise rollout.

Driving further into the business user side of the equation, Databricks One is the new platform for end users. Databricks is clearly edging for the full stack data solution. The Databricks One platform is streamlined and clean, an interface that non-technical users can feel comfortable interacting with on a regular basis, not be intimidated by, and not be inundated with too much information.

Beyond rote features, Databricks is also deploying a comprehensive training program, a free edition of the platform, and a $100M dedication to university programs to build user adoption in a new generation. This strategy will help them grow their community and this in addition to the culture of open source indicates they are interested in joining the ranks of foundational data technologies like python and R. These efforts will ensure there are students coming directly into the workforce with Databricks experience who will advocate for the platform to be added to the stack of any organization they join.

Future Expectations

Databricks still has a long way to go before they can truly be called an end-to-end enterprise data solution, but they are well on their way and show no indication of pivoting from that strategy. Over the next few years we can anticipate more business user-focused features, additional point-and-click capabilities to bridge the gap between developers and end users, and an increase in partnerships to ensure that no matter an organization’s stack, Databricks is the central integrator.

With Databricks Apps and other community initiatives we can also expect third-party development to expand and the Databricks external marketplace to become a first-stop for users looking for third-party datasets, application overlays, and other prebuilt content. As Unity Catalog continues to expand and develop new features it will also feed into those capabilities as less technical users become more comfortable with the well-documented and standardized functionality they get through Databricks even when selecting from different vendors.

As the company continues to expand into business user and self-serve capabilities we will also likely see an offloading of repetitive and menial tasks that currently belong to data engineers and data scientists. This creates a green-space of free time for those more technical users to actually build new solutions and focus on strategic roadmaps. This will no doubt lead to the release of innovative and exciting features that continue to ease blockers like overly convoluted CICD as well as support complex development like agentic AI and complex use case resolution.

Databricks packed a lot into four days in the heart of San Francisco and rolled out a host of new features that will significantly impact the way users interact with the platform. More than just a promotional marketing exercise though, this conference provided access to a community that is often disjointed and education on topics that can often feel a bit black box. As these new features roll out over the next few months they will require individual testing to confirm the ease of use to deploy the smooth demos seen during the keynotes and future releases will need to be monitored to identify whether the company is sticking to its promoted roadmap. Overall, Data + AI was an energetic, informative meeting of the minds that I intend to turn into an annual habit.

Amanda Darcangelo is a Lead Data & Analytics Consultant at CTIData.