How to Build an Auto.ria.com Parser: 5 Key Lessons from Production

Our team has been building scrapers for over 10 years. With the rise of vibe-coding tools (Cursor, Copilot and similar), many developers can now write a scraper on their own in a couple of hours. That's great. But every specific website has its own hidden pitfalls — ones that aren't visible at first glance and only surface in production. Here's what we discovered while scraping Auto.ria.com on a real project.

1. Where to Get Data: HTML, Not an API

The first question when starting any scraper — does the site have an API? Auto.ria.com runs on Vue.js with server-side rendering: data is hydrated into a Pinia store and simultaneously present in the DOM. There is no public API — data must be extracted from HTML.

Five key fields on a listing page:

Field	Selector
Car model	`#sideTitleTitle span`
Price	`#sidePrice strong`
Seller name	`#sellerInfoUserName span`
City	`#basicInfoTableMainInfoGeo span` (3rd comma-separated part)
Phone	`#autoPhonePopUpResponse` (after clicking the button)

To quickly inspect the structure on any listing page, paste this snippet into the DevTools console:

const get = (sel) => document.querySelector(sel)?.innerText?.trim() ?? null;

console.table({
  title:  get('#sideTitleTitle span')
       ?? document.querySelector('h1')?.innerText?.trim(),

  price:  get('#sidePrice strong')
       ?? get('#basicInfoPrice strong'),

  seller: get('#sellerInfoUserName span'),

  city: (() => {
    const geo = get('#basicInfoTableMainInfoGeo span');
    if (!geo) return null;
    const parts = geo.split(',').map(s => s.trim()).filter(Boolean);
    return parts[2] ?? parts[parts.length - 1];
  })(),

  // Phone will be null until the "Show number" button is clicked
  phone: document.querySelector('#autoPhonePopUpResponse')?.innerText?.trim()
      ?? document.querySelector('a[href^="tel:"]')?.href?.replace('tel:', ''),
});

Run it directly on a listing page — you'll immediately see which fields are available.

2. The Phone Number Extraction Flow

This is the most complex part of scraping Auto.ria.com. Phone numbers are hidden behind a modal that opens on button click. The button intentionally shows a masked number like "063 XXX XX XX"— the presence of "XXX" in the text confirms you've found the right button and the number hasn't been revealed yet.

The full phone extraction flow:

1. Verify URL contains /auto_ (it's a listing page)
2. Wait 500ms for the page to stabilize
3. Dismiss GDPR banner: button.fc-cta-consent (if present)
4. Dismiss "Switch to new version" prompt: button#switchingVersionsSet (if present)
5. Find the button: button[data-action='showBottomPopUp'] with "XXX" in its text
6. Realistic click: scrollIntoView → 100ms pause → click with mouse simulation
7. Wait up to 5000ms for #autoPhonePopUpResponse to appear
8. If the popup didn't appear — throw an exception, retry the page
9. Random delay of 900–1500ms before moving to the next page

The critical part is steps 3 and 4. If you don't dismiss the service popups before clicking the phone button, the click will miss. This is a classic bug that's hard to catch during manual testing because these popups appear rarely during normal browsing.

3. Pagination Resolver: Why "Click Next Page" Doesn't Work

Auto.ria.com uses a ?page=N URL parameter. Seems simple — just increment the number, right?

The problem is that when working through proxies, a pagination page may load, look normal, but contain the same listings as the previous page (due to caching or proxy geo-rotation). The scraper won't notice and will silently collect duplicates. Or the page may load empty, causing the scraper to skip several pages of real data.

The solution is content-dependent advancement: only move to the next page if the current one yielded listing links. No links — stop.

public string? ResolveNextPage(string sourceUrl, string html, List<string> detailLinks)
{
    if (detailLinks.Count > 0)
    {
        var currentPage = TryGetPageFromUrl(sourceUrl) ?? 0;

        // Optional page limit from config
        if (_maxPagesToParse > 0)
        {
            var parsedSoFar = currentPage - (_firstPageSeen ?? currentPage) + 1;
            if (parsedSoFar >= _maxPagesToParse)
                return null;
        }

        return SetOrReplacePageParam(sourceUrl, currentPage + 1);
    }

    return null; // No links = end of results
}

Additionally, all processed links are stored in a HashSet, so even if the same listing appears on two different listing pages, it will only be scraped once.

4. Excel Export: Do It at the End, Not During Scraping

A common mistake is writing data to Excel row by row as it's collected. This creates several problems: the file is constantly open for writing, deduplication is impossible, the final dataset can't be sorted, and a crash means losing the entire file.

The correct approach:

Scraping loop
  └- ExtractDataAsync() -> DataRow
       └- write one row to .csv (thread-safe via SemaphoreSlim)

End of session
  └- ConvertCsvToExcelAsync() -> .xlsx (single pass, full file)

CSV acts as an intermediate buffer — it's simple, fast, and doesn't require holding the entire dataset in memory. Excel is generated once at the end via the OpenXML SDK (no third-party libraries).

Output columns in the file:

Column	Description
`SourceUrl`	Full listing URL
`CarModel`	Make, model, year
`City`	Seller's city
`Name`	Seller name or company
`Phone`	Phone number after reveal
`Price`	Price with currency symbol

[SCREENSHOT: Excel result — table with columns SourceUrl, CarModel, City, Name, Phone, Price]

5. The Hidden Main Problem: Browser Fingerprint Detection

Auto.ria.com uses Cloudflare with JavaScript-level verification. A standard headless Chrome via Puppeteer gets blocked not by headers, but by browser behavior at the JS level.

Signals that reveal automation:

navigator.webdriver === true — the most obvious flag
navigator.hardwareConcurrency equals 1 (real machines typically have 8–16)
WebGL returns "Google SwiftShader" instead of a real GPU
Canvas toDataURL() returns identical results across all sessions

The solution is a stealth script injected via page.EvaluateExpressionOnNewDocumentAsync() that runs before any site code:

// Remove the webdriver flag
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

// Realistic hardware (office PC: 8 cores, 8GB RAM)
Object.defineProperty(navigator, 'hardwareConcurrency', { get: () => 8 });
Object.defineProperty(navigator, 'deviceMemory', { get: () => 8 });

// Spoof GPU to Intel
WebGLRenderingContext.prototype.getParameter = function(p) {
    if (p === 37445) return 'Intel Inc.';
    if (p === 37446) return 'Intel(R) UHD Graphics 630';
    return orig.call(this, p);
};

// Add canvas noise — unique per session
HTMLCanvasElement.prototype.toDataURL = function(type, quality) {
    // mutate 3 specific pixels seeded from the current session
};

Important: each proxy IP gets its own unique fingerprint drawn from 6 device profiles (Intel office, NVIDIA gaming, AMD workstation, etc.) and an isolated browser cache (BrowserCache/{proxy_host}/). Without this, Auto.ria.com starts returning CAPTCHAs or pages without the phone reveal button after just the first few requests.

Stuck or Getting Blocked?

Two common situations people come to us with:

Couldn't build the scraper yourself or don't want to spend the time— we'll handle it end-to-end. Describe what you need to collect, from where, and in what format — we'll estimate the timeline and cost.

The scraper works but gets blocked after N records— this is solvable. Proxy rotation, fingerprinting, session management — that's exactly what we specialize in. Get in touch.