Skip to content

Code examples

HexTractor examples module for tabular data processing.

This module provides complete examples showing how to use HexTractor to transform tabular data into heterogeneous graphs. Two main cases are demonstrated:

  1. Single-table data processing where all data resides in one denormalized table
  2. Multi-table data processing where data is split across normalized tables

The examples show common patterns like: - Creating node and edge type parameters - Handling multi-value columns - De-duplicating entities - Joining data across tables - Building graph specifications

Example Usage:

from hextractor.examples.single_table import create_single_table_graph
graph = create_single_table_graph()

from hextractor.examples.multi_table import create_multi_table_graph
graph = create_multi_table_graph()
See the individual modules for more detailed examples and documentation.

Data sources

Example datasets for demonstrating HexTractor functionality.

This module provides sample datasets in both single-table and multi-table formats that demonstrate common patterns in heterogeneous graph extraction.

The data represents a simple company-employee-tag relationship graph where: - Companies have employees and tags - Companies have attributes (employee count, revenue) - Employees have attributes (occupation, age) and a label (promotion) - Tags are simple identifiers

The same data is provided in two formats: 1. Single denormalized table with all relationships 2. Multiple normalized tables (companies, employees, tags, relationships)

get_multi_table_data()

Generate example data split across multiple normalized tables.

Returns:

Type Description
dict of {str: pd.DataFrame}

Dictionary containing DataFrames: - companies: Company information (id, employees, revenue) - employees: Employee information (id, occupation, age, promotion) - tags: Tag IDs - company_employees: Company-employee relationships - company_tags: Company-tag relationships

Source code in hextractor/examples/data.py
def get_multi_table_data() -> Dict[str, pd.DataFrame]:
    """Generate example data split across multiple normalized tables.

    Returns
    -------
    dict of {str: pd.DataFrame}
        Dictionary containing DataFrames:
        - companies: Company information (id, employees, revenue)
        - employees: Employee information (id, occupation, age, promotion)
        - tags: Tag IDs
        - company_employees: Company-employee relationships
        - company_tags: Company-tag relationships
    """
    companies = pd.DataFrame(
        {
            "company_id": [1, 2],
            "company_employees": [100, 5000],
            "company_revenue": [1000, 100000],
        }
    )

    employees = pd.DataFrame(
        {
            "employee_id": [0, 1, 3, 4, 5, 6],
            "employee_occupation": [0, 1, 3, 1, 1, 4],
            "employee_age": [25, 35, 45, 18, 20, 31],
            "employee_promotion": [0, 1, 0, 1, 1, 0],
        }
    )

    tags = pd.DataFrame({"tag": [1, 2, 3, 4]})

    company_employees = pd.DataFrame(
        {
            "company_id": [1, 1, 1, 2, 2, 2],
            "employee_id": [0, 1, 3, 4, 5, 6],
        }
    )

    company_tags = pd.DataFrame(
        {
            "company_id": [1, 1, 1, 2, 2, 2],
            "tags": [[1, 2, 3], [1, 2], [3, 4], [1, 4], [1, 1], [1, 2]],
        }
    )

    return {
        "companies": companies,
        "employees": employees,
        "tags": tags,
        "company_employees": company_employees,
        "company_tags": company_tags,
    }

get_single_table_data()

Generate example data in single denormalized table format.

The table contains company data duplicated across rows, one row per company-employee relationship. Companies can have multiple tags stored as lists in the tags column.

Returns:

Type Description
DataFrame

DataFrame with columns: - company_id (int): Unique company identifier - company_employees (int): Number of employees - company_revenue (int): Company revenue - employee_id (int): Unique employee identifier - employee_occupation (int): Employee occupation code - employee_age (int): Employee age - employee_promotion (int): Binary promotion label - tags (List[int]): List of tag IDs for the company

Source code in hextractor/examples/data.py
def get_single_table_data() -> pd.DataFrame:
    """Generate example data in single denormalized table format.

    The table contains company data duplicated across rows, one row per
    company-employee relationship. Companies can have multiple tags stored
    as lists in the tags column.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
        - company_id (int): Unique company identifier
        - company_employees (int): Number of employees
        - company_revenue (int): Company revenue
        - employee_id (int): Unique employee identifier
        - employee_occupation (int): Employee occupation code
        - employee_age (int): Employee age
        - employee_promotion (int): Binary promotion label
        - tags (List[int]): List of tag IDs for the company
    """
    return pd.DataFrame(
        [
            (1, 100, 1000, 0, 0, 25, 0, [1, 2, 3]),
            (1, 100, 1000, 1, 1, 35, 1, [1, 2]),
            (1, 100, 1000, 3, 3, 45, 0, [3, 4]),
            (2, 5000, 100000, 4, 1, 18, 1, [1, 4]),
            (2, 5000, 100000, 5, 1, 20, 1, [1, 1]),
            (2, 5000, 100000, 6, 4, 31, 0, [1, 2]),
        ],
        columns=[
            "company_id",
            "company_employees",
            "company_revenue",
            "employee_id",
            "employee_occupation",
            "employee_age",
            "employee_promotion",
            "tags",
        ],
    )

Commands and specs

Utility functions for creating graph specifications.

This module provides helper functions for creating node and edge parameters used in both single-table and multi-table examples. These utilities help reduce code duplication and standardize parameter creation.

create_company_employee_edge_params(company_id_col='company_id', employee_id_col='employee_id')

Create edge parameters for company-employee relationships.

Parameters:

Name Type Description Default
company_id_col str

Column name for company IDs

'company_id'
employee_id_col str

Column name for employee IDs

'employee_id'

Returns:

Type Description
EdgeTypeParams

EdgeTypeParams configured for company-employee edges

Source code in hextractor/examples/utils.py
def create_company_employee_edge_params(
    company_id_col: str = "company_id", employee_id_col: str = "employee_id"
) -> structures.EdgeTypeParams:
    """Create edge parameters for company-employee relationships.

    Parameters
    ----------
    company_id_col : str
        Column name for company IDs
    employee_id_col : str
        Column name for employee IDs

    Returns
    -------
    structures.EdgeTypeParams
        EdgeTypeParams configured for company-employee edges
    """
    return structures.EdgeTypeParams(
        edge_type_name="has",
        source_name="company",
        target_name="employee",
        source_id_col=company_id_col,
        target_id_col=employee_id_col,
    )

create_company_node_params(id_col='company_id', employees_col='company_employees', revenue_col='company_revenue')

Create node parameters for company entities.

Parameters:

Name Type Description Default
id_col str

Column name for company ID

'company_id'
employees_col str

Column name for employee count

'company_employees'
revenue_col str

Column name for company revenue

'company_revenue'

Returns:

Type Description
NodeTypeParams

NodeTypeParams configured for company nodes

Source code in hextractor/examples/utils.py
def create_company_node_params(
    id_col: str = "company_id",
    employees_col: str = "company_employees",
    revenue_col: str = "company_revenue",
) -> structures.NodeTypeParams:
    """Create node parameters for company entities.

    Parameters
    ----------
    id_col : str
        Column name for company ID
    employees_col : str
        Column name for employee count
    revenue_col : str
        Column name for company revenue

    Returns
    -------
    structures.NodeTypeParams
        NodeTypeParams configured for company nodes
    """
    return structures.NodeTypeParams(
        node_type_name="company",
        id_col=id_col,
        attributes=(employees_col, revenue_col),
        attr_type="float",
    )

create_company_tag_edge_params(company_id_col='company_id', tag_id_col='tags', multivalue=True)

Create edge parameters for company-tag relationships.

Parameters:

Name Type Description Default
company_id_col str

Column name for company IDs

'company_id'
tag_id_col str

Column name for tag IDs

'tags'
multivalue bool

Whether tags are stored as lists of values

True

Returns:

Type Description
EdgeTypeParams

EdgeTypeParams configured for company-tag edges

Source code in hextractor/examples/utils.py
def create_company_tag_edge_params(
    company_id_col: str = "company_id",
    tag_id_col: str = "tags",
    multivalue: bool = True,
) -> structures.EdgeTypeParams:
    """Create edge parameters for company-tag relationships.

    Parameters
    ----------
    company_id_col : str
        Column name for company IDs
    tag_id_col : str
        Column name for tag IDs
    multivalue : bool
        Whether tags are stored as lists of values

    Returns
    -------
    structures.EdgeTypeParams
        EdgeTypeParams configured for company-tag edges
    """
    return structures.EdgeTypeParams(
        edge_type_name="has",
        source_name="company",
        target_name="tag",
        source_id_col=company_id_col,
        target_id_col=tag_id_col,
        multivalue_target=multivalue,
    )

create_dataframe_specs(name, df, node_params=None, edge_params=None)

Create DataFrame specifications for a data source.

Parameters:

Name Type Description Default
name str

Name identifier for the data source

required
df DataFrame

Source DataFrame

required
node_params Optional[Tuple[NodeTypeParams, ...]]

Tuple of NodeTypeParams for entities in the DataFrame

None
edge_params Optional[Tuple[EdgeTypeParams, ...]]

Tuple of EdgeTypeParams for relationships in the DataFrame

None

Returns:

Type Description
DataFrameSpecs

DataFrameSpecs configured with the provided parameters

Source code in hextractor/examples/utils.py
def create_dataframe_specs(
    name: str,
    df: pd.DataFrame,
    node_params: Optional[Tuple[structures.NodeTypeParams, ...]] = None,
    edge_params: Optional[Tuple[structures.EdgeTypeParams, ...]] = None,
) -> data_sources.DataFrameSpecs:
    """Create DataFrame specifications for a data source.

    Parameters
    ----------
    name : str
        Name identifier for the data source
    df : pd.DataFrame
        Source DataFrame
    node_params : Optional[Tuple[structures.NodeTypeParams, ...]]
        Tuple of NodeTypeParams for entities in the DataFrame
    edge_params : Optional[Tuple[structures.EdgeTypeParams, ...]]
        Tuple of EdgeTypeParams for relationships in the DataFrame

    Returns
    -------
    data_sources.DataFrameSpecs
        DataFrameSpecs configured with the provided parameters
    """
    return data_sources.DataFrameSpecs(
        name=name,
        node_params=node_params or tuple(),
        edge_params=edge_params or tuple(),
        data_frame=df,
    )

create_employee_node_params(id_col='employee_id', occupation_col='employee_occupation', age_col='employee_age', promotion_col='employee_promotion')

Create node parameters for employee entities.

Parameters:

Name Type Description Default
id_col str

Column name for employee ID

'employee_id'
occupation_col str

Column name for occupation code

'employee_occupation'
age_col str

Column name for employee age

'employee_age'
promotion_col str

Column name for promotion label

'employee_promotion'

Returns:

Type Description
NodeTypeParams

NodeTypeParams configured for employee nodes

Source code in hextractor/examples/utils.py
def create_employee_node_params(
    id_col: str = "employee_id",
    occupation_col: str = "employee_occupation",
    age_col: str = "employee_age",
    promotion_col: str = "employee_promotion",
) -> structures.NodeTypeParams:
    """Create node parameters for employee entities.

    Parameters
    ----------
    id_col : str
        Column name for employee ID
    occupation_col : str
        Column name for occupation code
    age_col : str
        Column name for employee age
    promotion_col : str
        Column name for promotion label

    Returns
    -------
    structures.NodeTypeParams
        NodeTypeParams configured for employee nodes
    """
    return structures.NodeTypeParams(
        node_type_name="employee",
        id_col=id_col,
        attributes=(occupation_col, age_col),
        label_col=promotion_col,
        attr_type="long",
    )

create_tag_node_params(id_col='tags', multivalue=True)

Create node parameters for tag entities.

Parameters:

Name Type Description Default
id_col str

Column name containing tag IDs

'tags'
multivalue bool

Whether tags are stored as lists of values

True

Returns:

Type Description
NodeTypeParams

NodeTypeParams configured for tag nodes

Source code in hextractor/examples/utils.py
def create_tag_node_params(
    id_col: str = "tags", multivalue: bool = True
) -> structures.NodeTypeParams:
    """Create node parameters for tag entities.

    Parameters
    ----------
    id_col : str
        Column name containing tag IDs
    multivalue : bool
        Whether tags are stored as lists of values

    Returns
    -------
    structures.NodeTypeParams
        NodeTypeParams configured for tag nodes
    """
    return structures.NodeTypeParams(
        node_type_name="tag",
        id_col=id_col,
        multivalue_source=multivalue,
    )

Single-table case

Single table data processing example.

This module demonstrates how to use HexTractor to extract a heterogeneous graph from a single denormalized table containing all entities and relationships. The example shows how to handle: - Multiple entity types in one table - Entity de-duplication - Multi-value columns (tags)

create_single_table_graph(df=None)

Extract a heterogeneous graph from a single denormalized table.

This function demonstrates the complete workflow of: 1. Creating node type parameters 2. Creating edge type parameters 3. Creating DataFrame specifications 4. Creating graph specifications 5. Extracting the final graph

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing all entities and relationships. If None, uses example data from get_single_table_data().

None

Returns:

Type Description
HeterogeneousGraph

Extracted heterogeneous graph

Examples:

Basic usage:

from hextractor.examples.single_table import create_single_table_graph
graph = create_single_table_graph()

With custom data:

import pandas as pd
df = pd.DataFrame({...})  # Your data
graph = create_single_table_graph(df)

Source code in hextractor/examples/single_table.py
def create_single_table_graph(df: Optional[pd.DataFrame] = None):
    """Extract a heterogeneous graph from a single denormalized table.

    This function demonstrates the complete workflow of:
    1. Creating node type parameters
    2. Creating edge type parameters
    3. Creating DataFrame specifications
    4. Creating graph specifications
    5. Extracting the final graph

    Parameters
    ----------
    df : pd.DataFrame, optional
        DataFrame containing all entities and relationships.
        If None, uses example data from get_single_table_data().

    Returns
    -------
    HeterogeneousGraph
        Extracted heterogeneous graph

    Examples
    --------
    Basic usage:
    ```python
    from hextractor.examples.single_table import create_single_table_graph
    graph = create_single_table_graph()
    ```

    With custom data:
    ```python
    import pandas as pd
    df = pd.DataFrame({...})  # Your data
    graph = create_single_table_graph(df)
    ```
    """
    specs = create_single_table_specs(df)
    return hextract.extract_data(specs)

create_single_table_specs(df=None)

Create graph specifications for single table processing.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing all entities and relationships. If None, uses example data from get_single_table_data().

None

Returns:

Type Description
GraphSpecs

GraphSpecs configured for single table processing

Examples:

Basic usage:

from hextractor.examples.single_table import create_single_table_specs
specs = create_single_table_specs()

With custom data:

import pandas as pd
df = pd.DataFrame({...})  # Your data
specs = create_single_table_specs(df)

Source code in hextractor/examples/single_table.py
def create_single_table_specs(
    df: Optional[pd.DataFrame] = None,
) -> data_sources.GraphSpecs:
    """Create graph specifications for single table processing.

    Parameters
    ----------
    df : pd.DataFrame, optional
        DataFrame containing all entities and relationships.
        If None, uses example data from get_single_table_data().

    Returns
    -------
    data_sources.GraphSpecs
        GraphSpecs configured for single table processing

    Examples
    --------
    Basic usage:
    ```python
    from hextractor.examples.single_table import create_single_table_specs
    specs = create_single_table_specs()
    ```

    With custom data:
    ```python
    import pandas as pd
    df = pd.DataFrame({...})  # Your data
    specs = create_single_table_specs(df)
    ```
    """
    if df is None:
        df = get_single_table_data()

    # Create node parameters
    company_params = create_company_node_params()
    employee_params = create_employee_node_params()
    tag_params = create_tag_node_params()

    # Create edge parameters
    company_employee_edges = create_company_employee_edge_params()
    company_tag_edges = create_company_tag_edge_params()

    # Create DataFrame specifications
    df_specs = create_dataframe_specs(
        name="single_table",
        df=df,
        node_params=(company_params, employee_params, tag_params),
        edge_params=(company_employee_edges, company_tag_edges),
    )

    # Create and return graph specifications
    return data_sources.GraphSpecs(data_sources=(df_specs,))

Multi-table case

Multi table data processing example.

This module demonstrates how to use HexTractor to extract a heterogeneous graph from multiple normalized tables. This represents a typical relational database scenario where: - Each entity type has its own table (companies, employees, tags) - Relationships are stored in separate junction tables - Data is normalized to avoid duplication

create_multi_table_graph(tables=None)

Extract a heterogeneous graph from multiple normalized tables.

This function demonstrates the complete workflow of: 1. Creating node type parameters for each entity table 2. Creating edge type parameters for each relationship table 3. Creating DataFrame specifications for each table 4. Creating graph specifications combining all tables 5. Extracting the final graph

Parameters:

Name Type Description Default
tables dict of {str: pd.DataFrame}

Dictionary of DataFrames containing entities and relationships. If None, uses example data from get_multi_table_data().

None

Returns:

Type Description
HeterogeneousGraph

Extracted heterogeneous graph

Examples:

Basic usage:

from hextractor.examples.multi_table import create_multi_table_graph
graph = create_multi_table_graph()

With custom data:

tables = {
    'companies': companies_df,
    'employees': employees_df,
    'tags': tags_df,
    'company_employees': company_employees_df,
    'company_tags': company_tags_df
}
graph = create_multi_table_graph(tables)

Source code in hextractor/examples/multi_table.py
def create_multi_table_graph(tables: Optional[Dict[str, pd.DataFrame]] = None):
    """Extract a heterogeneous graph from multiple normalized tables.

    This function demonstrates the complete workflow of:
    1. Creating node type parameters for each entity table
    2. Creating edge type parameters for each relationship table
    3. Creating DataFrame specifications for each table
    4. Creating graph specifications combining all tables
    5. Extracting the final graph

    Parameters
    ----------
    tables : dict of {str: pd.DataFrame}, optional
        Dictionary of DataFrames containing entities and relationships.
        If None, uses example data from get_multi_table_data().

    Returns
    -------
    HeterogeneousGraph
        Extracted heterogeneous graph

    Examples
    --------
    Basic usage:
    ```python
    from hextractor.examples.multi_table import create_multi_table_graph
    graph = create_multi_table_graph()
    ```

    With custom data:
    ```python
    tables = {
        'companies': companies_df,
        'employees': employees_df,
        'tags': tags_df,
        'company_employees': company_employees_df,
        'company_tags': company_tags_df
    }
    graph = create_multi_table_graph(tables)
    ```
    """
    specs = create_multi_table_specs(tables)
    return hextract.extract_data(specs)

create_multi_table_specs(tables=None)

Create graph specifications for multi-table processing.

Parameters:

Name Type Description Default
tables dict of {str: pd.DataFrame}

Dictionary containing DataFrames: - companies: Company information - employees: Employee information - tags: Tag information - company_employees: Company-employee relationships - company_tags: Company-tag relationships If None, uses example data from get_multi_table_data().

None

Returns:

Type Description
GraphSpecs

GraphSpecs configured for multi-table processing

Examples:

Basic usage:

from hextractor.examples.multi_table import create_multi_table_specs
specs = create_multi_table_specs()

With custom data:

tables = {
    'companies': companies_df,
    'employees': employees_df,
    'tags': tags_df,
    'company_employees': company_employees_df,
    'company_tags': company_tags_df
}
specs = create_multi_table_specs(tables)

Source code in hextractor/examples/multi_table.py
def create_multi_table_specs(
    tables: Optional[Dict[str, pd.DataFrame]] = None,
) -> data_sources.GraphSpecs:
    """Create graph specifications for multi-table processing.

    Parameters
    ----------
    tables : dict of {str: pd.DataFrame}, optional
        Dictionary containing DataFrames:
        - companies: Company information
        - employees: Employee information
        - tags: Tag information
        - company_employees: Company-employee relationships
        - company_tags: Company-tag relationships
        If None, uses example data from get_multi_table_data().

    Returns
    -------
    data_sources.GraphSpecs
        GraphSpecs configured for multi-table processing

    Examples
    --------
    Basic usage:
    ```python
    from hextractor.examples.multi_table import create_multi_table_specs
    specs = create_multi_table_specs()
    ```

    With custom data:
    ```python
    tables = {
        'companies': companies_df,
        'employees': employees_df,
        'tags': tags_df,
        'company_employees': company_employees_df,
        'company_tags': company_tags_df
    }
    specs = create_multi_table_specs(tables)
    ```
    """
    if tables is None:
        tables = get_multi_table_data()

    # Create node parameters with appropriate column names
    company_params = create_company_node_params()
    employee_params = create_employee_node_params()
    tag_params = create_tag_node_params(
        id_col="tag",  # Different from single table
        multivalue=False,  # Tags are in their own table
    )

    # Create edge parameters
    company_employee_edges = create_company_employee_edge_params()
    company_tag_edges = create_company_tag_edge_params()

    # Create DataFrame specifications for each table
    company_specs = create_dataframe_specs(
        name="companies", df=tables["companies"], node_params=(company_params,)
    )

    employee_specs = create_dataframe_specs(
        name="employees", df=tables["employees"], node_params=(employee_params,)
    )

    tag_specs = create_dataframe_specs(
        name="tags", df=tables["tags"], node_params=(tag_params,)
    )

    company_employee_specs = create_dataframe_specs(
        name="company_employees",
        df=tables["company_employees"],
        edge_params=(company_employee_edges,),
    )

    company_tag_specs = create_dataframe_specs(
        name="company_tags", df=tables["company_tags"], edge_params=(company_tag_edges,)
    )

    # Create and return graph specifications
    return data_sources.GraphSpecs(
        data_sources=(
            company_specs,
            employee_specs,
            tag_specs,
            company_employee_specs,
            company_tag_specs,
        )
    )

LangChain integration

Langchain GraphDocument Integration Example.

This module demonstrates how to integrate Langchain's GraphDocument with a simple example. It shows how to create nodes and relationships to form a heterogeneous knowledge graph from a given text. The example includes creating nodes for persons, a library, and a graph, and establishing relationships between them.

Functions:

Name Description
get_text

Returns a sample text describing the developers of HeXtractor and its purpose.

get_example_langchain_graphdocument

Creates an example GraphDocument using Langchain, with nodes and relationships based on the sample text.

get_example_langchain_graphdocument()

Create an example Langchain GraphDocument.

This function creates an example GraphDocument using Langchain. It defines nodes for persons (Filip Wójcik and Marcin Malczewski), a library (HeXtractor), and a graph (Heterogeneous knowledge graph). It also establishes relationships between these nodes to form a heterogeneous knowledge graph.

Returns:

Type Description
list of GraphDocument

A list containing a single GraphDocument with the defined nodes and relationships.

Source code in hextractor/examples/langchain_integration.py
def get_example_langchain_graphdocument():
    """
    Create an example Langchain GraphDocument.

    This function creates an example GraphDocument using Langchain. It defines nodes
    for persons (Filip Wójcik and Marcin Malczewski), a library (HeXtractor), and a graph
    (Heterogeneous knowledge graph). It also establishes relationships between these nodes
    to form a heterogeneous knowledge graph.

    Returns
    -------
    list of GraphDocument
        A list containing a single GraphDocument with the defined nodes and relationships.
    """
    doc = Document(page_content=get_text())
    fw_node = Node(type="Person", id="Filip Wójcik")
    mm_node = Node(type="Person", id="Marcin Malczewski")
    hx_node = Node(type="Library", id="HeXtractor")
    kg_node = Node(type="Graph", id="Heterogeneous knowledge graph")

    fw_developed_hx = Relationship(source=fw_node, target=hx_node, type="Developed")
    mm_developer_hx = Relationship(source=mm_node, target=hx_node, type="Developed")
    hx_extracts_kg = Relationship(source=hx_node, target=kg_node, type="Extracts")

    data = [
        GraphDocument(
            nodes=[fw_node, mm_node, hx_node, kg_node],
            relationships=[fw_developed_hx, mm_developer_hx, hx_extracts_kg],
            source=doc,
        )
    ]
    return data

get_text()

Get sample text.

This function returns a sample text that describes the developers of HeXtractor and its purpose. The text is used to create nodes and relationships in the graph.

Returns:

Type Description
str

Sample text describing the developers and purpose of HeXtractor.

Source code in hextractor/examples/langchain_integration.py
def get_text() -> str:
    """
    Get sample text.

    This function returns a sample text that describes the developers of HeXtractor
    and its purpose. The text is used to create nodes and relationships in the graph.

    Returns
    -------
    str
        Sample text describing the developers and purpose of HeXtractor.
    """
    return """Filip Wójcik and Marcin Malczewski are data scientists, who developed HeXtractor. It is a library
that helps in extracting heterogeneous knowledge graphs from various data source.
Heterogeneous knowledge graphs are graphs that contain different types of nodes and edges."""