Connect to your data where it lives

About this Article

Published September 19 2025
By Michael Uzquiano

Tags:
data ingestion, unstructured data

Working seamlessly with your unstructured and structured data is a core philosophy at Gitana.

Point Gitana at your structured and unstructured data. Gitana automatically builds a rich knowledge graph that organizes your world into entities, relationships, events, and provenance.

Gitana provides the content governance and curation tools let you quickly find, preview and approve updates. All of your changes are version controlled and audited. Push and pull the best ideas. Deploy with incremental delta-publishing and rollback, at any time, to last known good states.

The knowledge graph is then utilized alongside our powerful axiomatic reasoning engine to generate a meaningful corpus-targeted, contextual and citation-rich dataset. This dataset is tested, verified and deployed to improve retrieval quality and answer reliability for your RAG applications and smart API endpoints.

Connect to your Data

Gitana lets you connect to your data where it lives.

Your data sources may include databases, files, wiki spaces, emails and more. It may include popular file formats such as CSVs, PDF files, Word documents and Parquet files. These data sources may exist on-premise, within databases or within data lakes or warehouses.

Gitana connects to those data sources and analyzes your data. It extracts key elements from your data and uses that to build a knowledge graph and ontology.

Alternatively, you can connect to Gitana from your external processes and tasks. For example, you may have an existing ingestion pipeline running in Apache Spark. This process can route data into Gitana for human-in-the-loop approval.

Understanding your Data

Once Gitana is able to connect to your data, it goes about the process of understanding your data. Through a process of inference, Gitana works out the entities, relationships, fields and attributes of your content.

Your data is analyzed to determine the internal schema and how it fits into your existing content ontology or model. Duplicate information is discovered and resolved. Poorly structured or incomplete information is fixed or cleaned up.

This data is then run through a process of enrichment. Meaning is inferred from the ingested content through an axiomatic reasoning process. Metadata is attached to your ingested data to capture information such as authorship, timestamp, geographic markers and identifiers. External knowledge such as tags, classifications, industry codes and public identifiers are inferred and applied to your data in accordance with your knowledge graph policies. Provenance is captured at span-level and stored as metadata alongside the data itself.

The result is a knowledge graph with nodes and associations that captures the entities, relationships and inferred structure of your data. The knowledge graph provides a clean, enriched view into your data. It offers full-text search, SQL-like query, graph traversal and vectorization so that your team can quickly glean insights, find things and make changes as they see fit.

Axiomatic Reasoning

Gitana's Axiomatic Reasoning engine plays a bit part in achieving everything up until now.

Our approach enables customers to define the axioms or "core truths" that hold together the world view of their data. These axioms inform the reasoning engine as it works through your unstructured data and makes attempts to understand and infer the inherent structure. As the reasoning engine works through your content, it consults the ontology, assertions and axioms you've declared to reason out relationships and properties that are consistent with the rest of the documents.

Policies and Governance

Gitana connects to your data where it lives. As your data changes, your ingestion processes discover these changes and pull them in. This data is augmented or appended into your existing knowledge graph. This can include both schema/ontology changes and instance-level assertions depending on your preference.

The imported knowledge is staged in branches. Your team has the ability to view these proposed changes, preview them and approve them for merge into the main line. Alternatively, you can use pull requests and scheduled publishing to organize the delivery of these changes as per your organizational needs.

This workflow-driven process provides the means for human-in-the-loop approval. You're in control of what the truth is and aware, always, of where it came from and how it was acquired. Our reasoning engine automatically identifies contradictions and anomalies, flagging them via workflow. Content ingestion may be fully automated but also may route to teams for human approval. Notifications (such as email or Slack) inform users of assigned tasks.

Every change to your Knowledge Graph is versioned and captured. A full change history is available for every document as well as the entire knowledge graph at a time. Quickly roll the entire branch forward or backward in time.

With enterprise access policies, you guarantee that only the right people work on the right content. Audit records capture every object and service-level operation so that you assert regulatory compliance. Finally, content policies let you define records management lifecycle rules for your documents -- such as redaction, expiration and deletion constraints.

Trained Data Sets

With the Knowledge Graph established and richly-populated, you're now in a position to generate the corpus data sets that power your RAG applications.

The goal here is to provide your RAG applications with an ideal text/corpus data set to power precise and accurate answers for prompt building. This data set does not need to be fully-inclusive of all the knowledge contained in your knowledge graph. Rather, it needs to be honed in on the kinds of questions that the RAG application will need to answer.

In fact, you may find that you need multiple corpus data sets. You may have multiple RAG applications driven off of the same knowledge graph. Or you may have a need for multiple data sets to reflect multiple perspectives or skew on answers.

Data sets are generated from the source knowledge graph and then typically vectorized and written into target vector databases.

Incremental Data Set Updates

Gitana tracks the source citations so that generated corpus/data set elements can be sourced back to their originating entities and relationships.

This information allows Gitana to perform incremental updates to the vector databases that power your RAG applications. Gitana tracks every create, update and delete to your content. As such, incremental data set generation means that only the data set delta needs to be deployed out to the target vector database. Old records are purged, new records are created and updates are applied on-the-fly.

Powering your RAG applications

By ingesting your unstructured data into Gitana, the process of reasoning, understanding and infering the structure of your data provides you with a high quality knowledge graph that is ideal for powering RAG applications.

Automated trained data sets provide your applications with a purpose-built RAG corpus containing documents that reflect meaningful statements and answers for your prompts. The corpus is indexed using hybrid techniques (including full-text search, SQL-like query, GraphQL and vector search) for precise, fast retrieval.

The result is better retrieval built around context-aware, application-specific answers that are trained, scored and validated ahead of each publish. That way, you are ensured that your RAG application has the correct corpus to provide the correct prompts in response to your customers.

In addition, multi-hop reasoning becomes more straightforward. The answers are precomputed in the training so that complex questions that would require complex, multi-document joins or multiple queries span disconnected sources are generated at time of publish. They're delivered as fundamental statements of truth to the RAG application. Which means they're fast and available in one hop.

Finally, contextually rich answers that are well-trained means that there is less hallucination and more grounding in consistent fact. Each retrieved document or statement carries explicit provenance with it, letting models cite sources and stick to the facts. The result is faster, more-accurate retrieval that has improved customer impact for much lower token costs.

Putting it all to work

You can get started with Gitana today!

Point Gitana at a few representative sources. For example, you may point it at a database table, a folder of PDF files or a wiki space.

From this information, Gitana will produce a knowledge graph and ontology that reflects your entities, relationships, properties, paths and attributes.

Now, connect your existing DB and LLM endpoints. Kick off the process of training your corpus data sets from your knowledge graph. Preview and score the results and then iterate to train your corpus to be precisely what you want.

Finally, publish your trained corpus. Put into production while be assured that the corpus reflects your data properly and correctly.

Gitana turns your raw content into a living knowledge graph and a high-signal RAG corpus, so your models can retrieve less, understand more, and answer with confidence.

Articles