Featured image of post Explore your DATA using AI

Explore your DATA using AI

As the world is becoming increasingly data driven, the need for powerful and intuitive tools to manage and interpret this data has never been more critical. This is where RosettaDB comes in, utilizing the advanced capabilities of OpenAI’s language models with the robustness of database management. Let’s explore how RosettaDB is setting a new standard for data exploration.

RosettaDB emerged as an open source tool in the landscape of database management, especially for migrating data across different platforms. It acts as a DDL (Data Definition Language) transpiler, enabling the translation of database schemas from one database system to another with no manual intervention.

Artificial Intelligence (AI), particularly in the form of Large Language Models (LLMs) like those developed by OpenAI, is becoming increasingly prolific. One of the most exciting applications of LLMs in the context of database management is their ability to generate SQL queries. This capability allows users, even those with minimal technical expertise, to interact with databases in a more intuitive way. By simply describing data needs in natural language, users can retrieve information without needing to know the SQL syntax.

RosettaDB introduces a groundbreaking feature: the “rosetta query …”. This functionality enables users to write queries in natural language to explore their data and even export results into a CSV file. This means that instead of wrestling with complex query syntax, users can ask questions in plain English and receive answers directly, greatly simplifying data analysis tasks.

RosettaDB doesn’t stop at queries. It also supports the generation of various DDL and DML scripts. This allows users to not only fetch data but also modify the database structure or manage the data itself through commands generated by the tool. To ensure security and integrity, RosettaDB restricts its operations to SELECT commands when executing queries generated from natural language. This prevents accidental data modifications or deletions, making it a safer choice for users who are experimenting with data queries or those in a learning phase.

In the following section we’ll learn how to use RosettaDB in practice to achieve the above mentioned capabilities.

  1. Download and configure RosettaDB on your machine. Download all the required JDBC drivers. For more details on this step please refer to the Getting Started section of RosettaDB docs https://github.com/AdaptiveScale/rosetta#getting-started
  2. Create a new rosetta project using the init command
rosetta init [PROJECT_NAME]

3. Edit the main.conf file to configure the database connection settings. At the top of the file, include your OpenAI API key and, optionally, specify the model you wish to use (default is gpt-3.5-turbo).

Note: You will need to register for an account with OpenAI if you don’t already have one.

Example:

openai_api_key: "sk-abcdefghijklmno1234567890"
openai_model: "gpt-4"
connections:
  - name: pg
    databaseName: postgres
    schemaName: rosseta_testing
    dbType: postgres
    url: jdbc:postgresql://<HOST>:<PORT>/<DATABASE>?user=<USER>&password=<PASSWORD>
    userName: <USER>
    password: <PASSWORD>

4. Run the rosetta extract command to generate the DBML models from the PostgreSQL database tables.

rosetta extract -s pg

This command analyzes your database schema and creates a workspace with the corresponding DBML model, which you can review and modify if needed.

5. Run the rosetta query command to write queries in natural language to explore the data, the output will be written in a CSV file

Examples:

rosetta query -s psg -q "Find the most borrowed book title."
rosetta query -s psg -q "Retrieve the names of students who have borrowed more than five books" --limit 10
rosetta query -s psg -q "Find the total number of books borrowed by each student" --output test.csv

As you can see in the examples we have a couple of arguments you can add for specific scenarios:

Additional Arguments:

  • -l --limit : Limit the number of rows in the response (Optional). The default value is 200.
  • --no-limit : No limit on the number of rows in the response (Optional).
  • --output : Specify the output directory or file (Optional).

By default, if you do not use the --output argument, the CSV files will be saved in the pg/data/ directory with a filename based on the query and a timestamp.

RosettaDB represents a significant leap forward in data management and exploration. By integrating OpenAI’s powerful LLM AI, RosettaDB makes it easier and more accessible for everyone to interact with data. Whether you’re a developer, a data scientist, or just someone curious about the insights hidden in your data, RosettaDB offers a versatile and user-friendly platform to explore and manipulate data efficiently. As data continues to drive decisions more than ever, tools like RosettaDB will become crucial in harnessing the power of information in the digital age.

Dive into RosettaDB and start transforming your data interaction experience today!