A Graph RAG Approach to Refugee Feedback Analysis in Jordan

Shahzad Asghar
4 min readJul 3, 2024

--

1. Introduction

The UNHCR in Jordan faces the challenge of analyzing vast amounts of qualitative feedback from refugees to inform protection strategies and humanitarian assistance. This paper presents an experimental application of the Graph RAG (Retrieval-Augmented Generation) approach to refugee feedback analysis. Our work is inspired by the methodology outlined in “From Local to Global: A Graph RAG Approach to Query-Focused Summarization” (Edge et al., 2024), adapting their innovative approach to the specific context of refugee protection in Jordan.

2. Graph RAG Approach & Pipeline

Our experimental pipeline for analyzing refugee feedback in Jordan consists of the following stages:

2.1 Source Documents → Text Chunks

We extracted text from various feedback sources and split it into chunks of 600 tokens with a 100-token overlap. The chunk size C and overlap O are defined as:

C = 600, O = 100

2.2 Text Chunks → Element Instances

We used a multilingual Large Language Model (LLM) to extract entities and relationships. The extraction function E for a given text chunk t can be represented as:

E(t) = {(e_i, r_ij, e_j) | e_i, e_j ∈ Entities, r_ij ∈ Relationships}

Where e_i and e_j are entities, and r_ij is the relationship between them.

2.3 Element Instances → Element Summaries

For each unique entity and relationship, we generated concise summaries. The summary function S for an entity e or relationship r is:

S(e) = LLM(context(e)) S(r) = LLM(context(r))

Where context() provides relevant information about the entity or relationship from all mentions in the data.

2.4 Element Summaries → Graph Communities

We constructed a knowledge graph G = (V, E), where V is the set of entities and E is the set of weighted edges representing relationships. The edge weight w(e) for an edge e is calculated as:

w(e) = count(e) / sqrt(count(source(e)) * count(target(e)))

We applied the Leiden algorithm for community detection, generating a hierarchical community structure with 4 levels (C0, C1, C2, C3).

2.5 Graph Communities → Community Summaries

For each detected community c_i at level l, we generated a summary:

CS(c_i, l) = LLM(aggregate(elements(c_i)))

Where elements(c_i) returns all entities and relationships in community c_i, and aggregate() combines their summaries.

2.6 Community Summaries → Community Answers → Global Answer

Given a query q, we generate the global answer A(q) as follows:

  1. Prepare: Divide community summaries into chunks of 8k tokens.
  2. Map: Generate intermediate answers I_i(q) for each chunk i.
  3. Reduce: Synthesize the final answer:

A(q) = LLM(aggregate(sort_by_relevance(I_i(q))))

3. Experimental Setup

3.1 Data Sources

Our experiment used the following data sources from UNHCR Jordan:

  • 61 Focus Group Discussion transcripts
  • 14 Community Representative Meeting reports
  • 32 Community Support Committee logs
  • 9 Information Session summaries
  • 15 Multimedia feedback entries (social media, WhatsApp)

3.2 Entity Types

We focused on extracting the following entity types:

  1. Refugee demographics
  2. Protection concerns
  3. Geographical locations
  4. Assistance types

3.3 Evaluation Metrics

We evaluated our approach using four metrics:

  1. Comprehensiveness (C )
  2. Diversity (D)
  3. Empowerment (E)
  4. Directness (Dr)

Each metric was computed using LLM-based comparisons between our approach and baselines, with scores normalized to a 0–100 scale.

4. Results

Our Graph RAG approach showed significant improvements over baselines in several key areas. Figure 1 illustrates the performance comparison across the metrics:

[Figure 1: Bar chart comparing Graph RAG performance against baselines (Naïve RAG and Global Text Summarization) across four metrics: Comprehensiveness, Diversity, Empowerment, and Directness. The y-axis shows scores from 0–100, with Graph RAG showing higher bars for Comprehensiveness and Diversity, similar heights for Empowerment, and slightly lower for Directness compared to baselines.]

The hierarchical community structure revealed interesting patterns in refugee concerns. Figure 2 shows the distribution of top-level communities:

[Figure 2: Pie chart showing the distribution of top-level (C0) communities in the refugee feedback graph. Slices represent major themes such as “Basic Needs” (30%), “Protection” (25%), “Education” (20%), “Health” (15%), and “Livelihoods” (10%).]

5. Discussion

Our results demonstrate the effectiveness of the Graph RAG approach in capturing complex, interconnected refugee protection issues in Jordan. The hierarchical community structure proved particularly valuable in analyzing multi-faceted concerns.

One interesting finding was the relationship between cash assistance reductions and other protection issues. We observed a strong correlation coefficient (r = 0.78) between mentions of cash assistance cuts and increased reports of negative coping mechanisms.

6. Conclusion and Future Work

The Graph RAG approach shows promise for enhancing UNHCR’s capacity for data-driven decision-making in refugee protection. Future work will focus on:

  1. Refining entity extraction for refugee-specific terminology
  2. Incorporating temporal data to analyze trends in protection concerns
  3. Exploring multi-lingual capabilities to handle diverse refugee populations

By leveraging this approach, we aim to provide more comprehensive and nuanced insights into refugee needs and concerns, ultimately leading to more effective humanitarian assistance strategies.

Note: This experiment was conducted on test data mimicking original refugee feedback in Jordan. Application to real data is pending data protection and cybersecurity clearance. Full working code can be available upon request.

--

--

No responses yet