Skip to content

Interactions

This section presents HeXtractor integration with other tools and frameworks.

Langchain Integration - Graph Documents

Module contains functions to convert a GraphDocument to a PyTorch Geometric heterogeneous graph. It makes it easy to integrate LangChain LLM with PyTorch Geometric for graph-based learning tasks.

convert_graph_document_to_hetero_data(graph_doc)

Convert a GraphDocument to a PyTorch Geometric heterogeneous graph.

Parameters:

Name Type Description Default
graph_doc GraphDocument

The graph document containing nodes and relationships.

required

Returns:

Type Description
tuple[HeteroData, dict[tuple[str, str], int]]

A tuple containing:

  1. A PyTorch Geometric heterogeneous graph object.
  2. A dictionary of mappings from (node_type, original_id) tuples to type-specific indices.
Notes

This function performs the following steps:

  1. Groups nodes by their type
  2. Creates a mapping from string node IDs to numerical indices per node type
  3. Creates feature matrices for each node type
  4. Extracts all unique edge types
  5. Creates edge indices for each edge type

The resulting HeteroData object follows PyTorch Geometric's format for heterogeneous graphs, where node IDs within each type start from 0.

Source code in hextractor/integrations/langchain_graphdoc.py
def convert_graph_document_to_hetero_data(
    graph_doc,
) -> tuple[HeteroData, dict[tuple[str, str], int]]:
    """
    Convert a GraphDocument to a PyTorch Geometric heterogeneous graph.

    Parameters
    ----------
    graph_doc : GraphDocument
        The graph document containing nodes and relationships.

    Returns
    -------
    tuple[HeteroData, dict[tuple[str, str], int]]
        A tuple containing:

        1. A PyTorch Geometric heterogeneous graph object.
        2. A dictionary of mappings from (node_type, original_id) tuples to type-specific indices.

    Notes
    -----
    This function performs the following steps:

    1. Groups nodes by their type
    2. Creates a mapping from string node IDs to numerical indices per node type
    3. Creates feature matrices for each node type
    4. Extracts all unique edge types
    5. Creates edge indices for each edge type

    The resulting HeteroData object follows PyTorch Geometric's format for
    heterogeneous graphs, where node IDs within each type start from 0.
    """
    # Create HeteroData object
    data = HeteroData()

    # Step 1: Group nodes by type
    nodes_by_type = group_nodes_by_type(graph_doc)

    # Step 2: Create mapping from string IDs to numerical indices (per node type)
    node_id_mapping = create_node_id_mapping(nodes_by_type)

    # Step 3: Create feature matrices for each node type
    create_node_features(data, nodes_by_type)

    # Step 4: Extract all unique edge types
    edge_types = extract_edge_types(graph_doc)

    # Step 5: Create edge indices for each edge type
    create_edge_indices(data, graph_doc, edge_types, node_id_mapping)

    return data, node_id_mapping

create_edge_indices(data, graph_doc, edge_types, node_id_mapping)

Create edge indices for each edge type in the heterogeneous graph.

Parameters:

Name Type Description Default
data HeteroData

The PyTorch Geometric HeteroData object to populate.

required
graph_doc GraphDocument

The graph document containing nodes and relationships.

required
edge_types Set[Tuple[str, str, str]]

A set of (source_type, relation_type, target_type) tuples representing all unique edge types in the graph.

required
node_id_mapping Dict[Tuple[str, str], int]

A dictionary mapping (node_type, original_id) tuples to type-specific indices.

required

Returns:

Type Description
None

This function modifies the data object in-place.

Raises:

Type Description
ValueError

If an edge references a node that doesn't exist in the graph.

Notes

This function creates edge indices for each edge type in the format required by PyTorch Geometric: a tensor of shape [2, num_edges] where the first row contains source node indices and the second row contains target node indices.

Source code in hextractor/integrations/langchain_graphdoc.py
def create_edge_indices(
    data: HeteroData,
    graph_doc,
    edge_types: Set[Tuple[str, str, str]],
    node_id_mapping: Dict[Tuple[str, str], int],
) -> None:
    """
    Create edge indices for each edge type in the heterogeneous graph.

    Parameters
    ----------
    data : HeteroData
        The PyTorch Geometric HeteroData object to populate.
    graph_doc : GraphDocument
        The graph document containing nodes and relationships.
    edge_types : Set[Tuple[str, str, str]]
        A set of (source_type, relation_type, target_type) tuples representing
        all unique edge types in the graph.
    node_id_mapping : Dict[Tuple[str, str], int]
        A dictionary mapping (node_type, original_id) tuples to type-specific indices.

    Returns
    -------
    None
        This function modifies the data object in-place.

    Raises
    ------
    ValueError
        If an edge references a node that doesn't exist in the graph.

    Notes
    -----
    This function creates edge indices for each edge type in the format required by
    PyTorch Geometric: a tensor of shape [2, num_edges] where the first row contains
    source node indices and the second row contains target node indices.
    """
    for source_type, rel_type, target_type in edge_types:
        # Collect all edges of this type
        edge_indices = []

        for rel in graph_doc.relationships:
            if (
                rel.source.type == source_type
                and rel.target.type == target_type
                and rel.type == rel_type
            ):
                # Validate source node exists
                source_key = (source_type, rel.source.id)
                if source_key not in node_id_mapping:
                    raise ValueError(
                        f"Unknown source node: {rel.source.id} of type {source_type}"
                    )

                # Validate target node exists
                target_key = (target_type, rel.target.id)
                if target_key not in node_id_mapping:
                    raise ValueError(
                        f"Unknown target node: {rel.target.id} of type {target_type}"
                    )

                # Get type-specific source and target indices
                source_idx = node_id_mapping[source_key]
                target_idx = node_id_mapping[target_key]
                edge_indices.append([source_idx, target_idx])

        if edge_indices:
            # Convert to tensor with shape [2, num_edges]
            edge_index = torch.tensor(edge_indices).t().contiguous()
            data[source_type, rel_type, target_type].edge_index = edge_index

create_node_features(data, nodes_by_type)

Create feature matrices for each node type in the heterogeneous graph.

Parameters:

Name Type Description Default
data HeteroData

The PyTorch Geometric HeteroData object to populate.

required
nodes_by_type Dict[str, List[str]]

A dictionary mapping node types to lists of node IDs.

required

Returns:

Type Description
None

This function modifies the data object in-place.

Notes

This implementation creates simple feature matrices where each node's feature is just its index. In a real application, you would use actual node features extracted from the graph_doc's properties.

Source code in hextractor/integrations/langchain_graphdoc.py
def create_node_features(data: HeteroData, nodes_by_type: Dict[str, List[str]]) -> None:
    """
    Create feature matrices for each node type in the heterogeneous graph.

    Parameters
    ----------
    data : HeteroData
        The PyTorch Geometric HeteroData object to populate.
    nodes_by_type : Dict[str, List[str]]
        A dictionary mapping node types to lists of node IDs.

    Returns
    -------
    None
        This function modifies the data object in-place.

    Notes
    -----
    This implementation creates simple feature matrices where each node's feature
    is just its index. In a real application, you would use actual node features
    extracted from the graph_doc's properties.
    """
    for node_type, node_ids in nodes_by_type.items():
        num_nodes = len(node_ids)
        # Create a simple feature matrix (just node indices as features)
        data[node_type].x = torch.arange(num_nodes).view(-1, 1).float()

create_node_id_mapping(nodes_by_type)

Create a mapping from string node IDs to numerical indices per node type.

Parameters:

Name Type Description Default
nodes_by_type Dict[str, List[str]]

A dictionary mapping node types to lists of node IDs.

required

Returns:

Type Description
Dict[Tuple[str, str], int]

A dictionary mapping (node_type, original_id) tuples to type-specific indices.

Notes

This function ensures that node IDs within each type start from 0, which is required for PyTorch Geometric's heterogeneous graph format.

Source code in hextractor/integrations/langchain_graphdoc.py
def create_node_id_mapping(
    nodes_by_type: Dict[str, List[str]],
) -> Dict[Tuple[str, str], int]:
    """
    Create a mapping from string node IDs to numerical indices per node type.

    Parameters
    ----------
    nodes_by_type : Dict[str, List[str]]
        A dictionary mapping node types to lists of node IDs.

    Returns
    -------
    Dict[Tuple[str, str], int]
        A dictionary mapping (node_type, original_id) tuples to type-specific indices.

    Notes
    -----
    This function ensures that node IDs within each type start from 0,
    which is required for PyTorch Geometric's heterogeneous graph format.
    """
    node_id_mapping = {}
    for node_type, node_ids in nodes_by_type.items():
        for idx, node_id in enumerate(node_ids):
            # Store mapping as (node_type, original_id) -> type_specific_idx
            node_id_mapping[(node_type, node_id)] = idx
    return node_id_mapping

extract_edge_types(graph_doc)

Extract all unique edge types from the graph document.

Parameters:

Name Type Description Default
graph_doc GraphDocument

The graph document containing nodes and relationships.

required

Returns:

Type Description
Set[Tuple[str, str, str]]

A set of (source_type, relation_type, target_type) tuples representing all unique edge types in the graph.

Notes

Edge types in PyTorch Geometric are defined as tuples of (source_node_type, edge_type, target_node_type).

Source code in hextractor/integrations/langchain_graphdoc.py
def extract_edge_types(graph_doc) -> Set[Tuple[str, str, str]]:
    """
    Extract all unique edge types from the graph document.

    Parameters
    ----------
    graph_doc : GraphDocument
        The graph document containing nodes and relationships.

    Returns
    -------
    Set[Tuple[str, str, str]]
        A set of (source_type, relation_type, target_type) tuples representing
        all unique edge types in the graph.

    Notes
    -----
    Edge types in PyTorch Geometric are defined as tuples of
    (source_node_type, edge_type, target_node_type).
    """
    edge_types = set()
    for rel in graph_doc.relationships:
        source_type = rel.source.type
        target_type = rel.target.type
        rel_type = rel.type
        edge_types.add((source_type, rel_type, target_type))
    return edge_types

group_nodes_by_type(graph_doc)

Group nodes by their type.

Parameters:

Name Type Description Default
graph_doc GraphDocument

The graph document containing nodes and relationships.

required

Returns:

Type Description
Dict[str, List[str]]

A dictionary mapping node types to lists of node IDs.

Notes

This function creates a mapping from node types to lists of node IDs, which is useful for further processing of nodes by their type.

Source code in hextractor/integrations/langchain_graphdoc.py
def group_nodes_by_type(graph_doc) -> Dict[str, List[str]]:
    """
    Group nodes by their type.

    Parameters
    ----------
    graph_doc : GraphDocument
        The graph document containing nodes and relationships.

    Returns
    -------
    Dict[str, List[str]]
        A dictionary mapping node types to lists of node IDs.

    Notes
    -----
    This function creates a mapping from node types to lists of node IDs,
    which is useful for further processing of nodes by their type.
    """
    nodes_by_type = defaultdict(list)
    for node in graph_doc.nodes:
        nodes_by_type[node.type].append(node.id)
    return nodes_by_type