SEO IRL
Pipeline
upload_file
Import & EmbedStep 1
filter_alt
DeduplicateStep 2
hub
ClusterStep 3
cleaning_services
DenoiseStep 4
lightbulb
Generate TopicsStep 5
content_copy
Deduplicate TopicsStep 6
account_tree
Label ParentsStep 7
Visualizers
grid_viewTreemap ViewhubRadial Tree
Generator
descriptionContent Brief
Presented by
Amin Foroutan
filter_altStep 2 of 7

Similarity Deduplication

Remove duplicate keywords with similar embeddings

info
No files found. Please complete Step 1 first to generate embeddings.
Keywords with 95% or higher similarity will be merged

How Do We Remove Duplicates?

By comparing embedding similarity, we identify and merge semantically duplicate keywords, consolidating their search volumes.

Similarity Network → Deduplication

1,200
best accounting software
800
top accounting software
400
accounting software reviews
950
crm software comparison
600
compare crm tools
1,500
project management tools
Press Play to start
Similar keywords (≥95% similarity) are connected, grouped, and merged into the highest-volume keyword.

codeStep 2: Backend Implementation

Semantic deduplication using cosine similarity on embeddings

account_treeDetailed Processing Pipeline

upload_file
1. Load & Parse Embeddings
Read CSV → Parse JSON embeddings → Stack into matrix (N × 1536)
parse_embedding_array()
straighten
2. L2 Normalization
Normalize embeddings for accurate cosine similarity (unit vectors)
normalize(X, axis=1)
grid_on
3. Batch Similarity Computation
Process in 1000-item batches → Compute cosine similarity matrix
cosine_similarity()
filter_list
4. Threshold Filtering
Keep only pairs with similarity ≥ threshold (default: 0.95)
if sim ≥ 0.95
device_hub
5. Build Similarity Graph
Create graph where edges connect similar keywords
similarity_graph[i,j]
hub
6. Find Connected Components (DFS)
Depth-first search to identify groups of similar keywords
dfs(node, group)
stars
7. Select Highest Volume Keyword
From each group, keep the keyword with max search volume
max(group, key=volume)
functions
8. Consolidate Search Volumes
Sum all volumes in group → Assign to kept keyword
Σ volumes
save
9. Generate Output Files
Save deduplicated CSV + removed items report with statistics
to_csv()

starKey Features

  • calculateCosine similarity computation
  • hubGraph-based grouping algorithm
  • memoryMemory-optimized batch processing
  • compressL2 normalization of embeddings
  • insightsVolume consolidation tracking
  • assessmentDetailed removal reporting

datasetResponse Model

class DeduplicationResponse(BaseModel):
    input_filename: str
    output_filename: str
    removed_filename: str
    input_download_url: str
    output_download_url: str
    removed_download_url: str
    original_count: int
    final_count: int
    removed_count: int
    groups_found: int
    volume_consolidated: float
    similarity_threshold: float
    created_at: str

apiAPI Endpoints

POST /keywords/deduplicate
Deduplicate similar keywords from embedding file with configurable similarity threshold
GET /files/outputs/list
List all available output files with metadata (size, modified date)