Eth Labels: Public Dataset of Blockchain Address Labels

Project Overview

Eth Labels was a comprehensive public dataset project that aggregated labeled cryptocurrency addresses from Ethereum and multiple EVM-compatible chains. Working alongside Dawson Botsford, creator of Earnifi, during May and June 2024, we addressed a critical need in blockchain research and analytics by making previously inaccessible address labeling data freely available to researchers, developers, and analysts.

My primary contribution to the project focused on developing and implementing the REST API architecture that provided programmatic access to the dataset. The project extracted valuable address labeling information that was previously locked within individual blockchain explorers and transformed it into a shareable, standardized format accessible through multiple interfaces including CSV files, JSON data, SQLite database, and a REST API.

The Problem

Blockchain researchers and analysts had long struggled with identifying the entities behind cryptocurrency addresses. While blockchain explorers like Etherscan, Arbiscan, and others maintained extensive databases of labeled addresses, this valuable data was historically:

Siloed within individual platforms with no standardized access
Inaccessible for bulk research and analytical purposes
Not available in machine-readable formats for automated processing
Fragmented across multiple chains without unified access

This data fragmentation created significant barriers for legitimate research, compliance efforts, and analytical work in the blockchain space.

The Solution

Eth Labels addressed these challenges by:

Automated data extraction from multiple blockchain explorers across 7+ EVM chains
Standardized data formats available as CSV, JSON, and SQLite database
Public REST API for programmatic access with comprehensive documentation
Multi-chain aggregation covering Ethereum, Arbitrum, Optimism, Base, BSC, Gnosis, and Celo
Regular updates through automated scraping and data refresh processes

Project Conclusion

We eventually discovered that other established projects were already providing similar blockchain labeling services with more resources and broader adoption. Rather than competing in an already crowded space, we made the strategic decision to collaborate with these existing solutions, recognizing that our efforts would be more valuable when combined with established infrastructure and communities.

Technical Implementation

Data Collection Architecture

The project employed a sophisticated web scraping system built with TypeScript and modern Node.js tooling:

// Modular chain support through abstract base classes
export const scanConfig = [
  new EtherscanChain(),
  new OptimismChain(), 
  new ArbiscanChain(),
  new BasescanChain(),
  new CeloChain(),
  new BscscanChain(),
  new GnosisChain(),
] as const;

REST API Development (My Primary Focus)

My main responsibility was architecting and implementing the REST API using Elysia.js, a modern TypeScript-first web framework optimized for Bun runtime:

High-performance backend leveraging Bun's speed advantages
Type-safe API endpoints with automatic validation using Zod schemas
SQLite database integration for efficient querying with Kysely query builder
OpenAPI documentation auto-generated through Swagger integration
Comprehensive error handling and response validation
Flexible query parameters supporting filtering, pagination, and search across multiple fields

Data Processing Pipeline

The system implemented a robust ETL (Extract, Transform, Load) pipeline:

Web scraping using Cheerio for HTML parsing
Progress tracking with CLI progress bars for large data operations
Data validation and normalization across different chain formats
Database optimization with efficient indexing and query patterns

Multi-format Data Export

Data was made available in multiple formats to serve different use cases:

CSV files (accounts.csv, tokens.csv) for spreadsheet analysis
JSON exports for web applications and APIs
SQLite database for complex queries and joins
REST API with filtering, pagination, and search capabilities

Challenges and Solutions

Rate Limiting and Ethical Scraping

Challenge: Responsibly extracting large amounts of data without overwhelming source servers.

Solution: Implemented intelligent rate limiting, respectful scraping intervals, and robust error handling to ensure sustainable data collection while respecting source platforms.

Cross-chain Data Standardization

Challenge: Different blockchain explorers use varying data formats and structures.

Solution: Developed an abstract chain interface with concrete implementations for each supported network, ensuring consistent data schema while accommodating platform-specific differences.

Scale and Performance

Challenge: Processing millions of address labels efficiently while maintaining data integrity.

Solution: Leveraged Bun's performance advantages, implemented streaming data processing to avoid memory constraints, and utilized SQLite with optimized indexing for fast queries.

API Reliability and Documentation

Challenge: Providing a stable, well-documented API for diverse user needs.

Solution: Built comprehensive OpenAPI documentation, implemented proper error handling and validation, and designed the system for reliable hosting and deployment.

Supported Blockchain Networks

The project aggregated data from seven major EVM-compatible networks:

Ethereum (Etherscan) - The primary network with the most comprehensive labeling data
Arbitrum (Arbiscan) - Leading Layer 2 scaling solution
Optimism (Optimism Explorer) - Popular optimistic rollup network
Base (BaseScan) - Coinbase's Layer 2 network
Binance Smart Chain (BSCScan) - High-performance alternative blockchain
Gnosis Chain (Gnosis Explorer) - Ethereum sidechain focused on stable transactions
Celo (Celo Explorer) - Mobile-first blockchain platform

API Implementation

The REST API I developed provided comprehensive access to the labeled address dataset through a clean, well-documented interface:

Endpoints

/labels - Retrieved all available labels
/labels/:address - Returned labels for a specific address
/accounts - Searched account labels with advanced filtering
/tokens - Searched token information with filtering capabilities
/health - API health check endpoint

API Features

Advanced filtering by chain ID, address, label type, and name tags
Pagination support with configurable offset and limit parameters
Flexible search across multiple data fields
Type-safe responses with comprehensive Zod validation
Automatic OpenAPI documentation generation for easy integration
Consistent error handling with meaningful HTTP status codes

Outcomes and Impact

During its active development period, Eth Labels successfully democratized access to blockchain address labeling data, providing:

Enhanced research capabilities for academic and industry analysts
Improved compliance tools for financial institutions and regulators
Better fraud detection through accessible address reputation data
Standardized data formats that reduced integration complexity for developers

The project demonstrated the value of making previously siloed blockchain data accessible through modern API architecture, while highlighting the importance of collaboration over competition in the open-source ecosystem.

Technologies Used

Core Runtime and Framework

Bun - High-performance JavaScript runtime for enhanced speed
TypeScript - Type-safe development with modern language features
Elysia.js - Fast, ergonomic web framework built for Bun

Data Processing and Storage

SQLite - Embedded database for efficient local data storage
Kysely - Type-safe SQL query builder for database operations
Cheerio - Server-side HTML parsing for web scraping
Viem - TypeScript interface for Ethereum interaction

Development and Operations

Docker - Containerization for consistent deployment
ESLint & Prettier - Code quality and formatting automation
Husky - Git hooks for pre-commit validation
Railway - Cloud deployment platform

Testing and Quality Assurance

Bun Test - Native testing framework for unit and integration tests
CLI Progress - User-friendly progress tracking for long-running operations
Zod - Runtime type validation and schema definition

This project represented a valuable learning experience in blockchain infrastructure development, demonstrating modern TypeScript development practices, efficient data processing techniques, and the importance of strategic collaboration in the open-source ecosystem.