Oxylabs Champions Ethical Web Data for the AI Era

Oxylabs builds ethical, scalable web data tools for AI, e-commerce, and research—fueling innovation while setting industry standards.

The Internet contains an enormous amount of valuable information, from product prices and financial data to news, research, and user-generated content.

But accessing this information at scale is difficult. Websites often block automated requests, limit access, or present data in formats that are hard to collect and structure.

For companies in e-commerce, finance, cybersecurity, and especially artificial intelligence, this creates a major barrier: they need vast, diverse, and constantly updated datasets, yet lack the infrastructure to gather them reliably and legally.

Oxylabs is a Lithuanian tech company that provides large-scale web data collection and proxy infrastructure. Its proxy infrastructure, scraping technologies, and ready-made datasets give businesses and researchers a compliant, efficient way to tap into the public web.

In doing so, Oxylabs not only supplies the raw material that fuels innovation — from AI training to market analysis — but also sets ethical standards in an industry often criticised for misuse, ensuring that data-driven progress can continue without compromising ethics or legality.

According to Grybauskas, Oxylabs started in 2015 by renting out data centre IP addresses. He recounts:

“We quickly realised there was a real need for robust, scalable public web data aggregation infrastructure. So we kept developing. Today, we offer many products related to public web data acquisition and aggregation.

Also Read: AI vs. AI: The $10B Cybersecurity Battle You’re Missing

The public Web as the world’s largest dataset

Grybauskas contends that the public web is the most diverse and dynamic dataset we have.

“If we want AI systems that are fair, representative, and globally relevant, access to the public web must remain available to everyone.”

This year Oxylabs launched the world’s first ethical video datasets, requiring creator consent for AI training. According to Grybauskas, “When it comes to datasets — especially YouTube datasets — we noticed that generative AI companies are very interested in video content.”

In December 2024, YouTube changed its policy to allow content creators to opt in to allowing third-party AI companies to train their models using YouTube videos. In response, Oxylabs decided to build a dataset by aggregating videos that have either opted in for AI training or are licensed under Creative Commons.

All datasets offered by Oxylabs include videos, transcripts, and rich metadata. While such data has many potential use cases, Oxylabs refined and prepared it specifically for AI training, which is the use that the content creators have knowingly agreed to.

Also Read: The Promise and Peril of AI in Hospital Cybersecurity

“Selling picks and shovels in the data gold rush”

Grybauskas contends that there’s a misconception that the internet is only about personal data:

“In reality, there are petabytes of non-personal information — like e-commerce data — that are just as important. Datasets are a tiny part of our business. Primarily, we’re an infrastructure provider. We joke internally that we’re selling picks and shovels during the gold rush.”

The company has also invested heavily in innovation, holding over 100 patents — mostly in the US. “In fact, if you look at Lithuanian companies filing US patents over the last five years, Oxylabs accounts for about 30 per cent of them. We’re very proud of our intellectual property team and our engineers who continue to innovate,” recounts Grybauskas.

Building an ethical industry standard

The release of ethically sourced YouTube datasets continues Oxylabs’ longtime mission to establish and promote ethical industry practices. Oxylabs also stands out for its work in creating a more ethical web and making data more accessible to not-for-profits and investigative journalists.

It’s one of the founders of the Ethical Web Data Collection Initiative, a global, industry-led group advancing responsible data aggregation. It defines best practices, promoting transparency, and helping organisations navigate the digital ecosystem ethically.

According to Grybauskas, “When we launched the initiative with the first group of companies, we wanted to show that not all scraping is bad, and that scraping companies don’t have to be associated with botnets or shady practices.”

“We published a set of principles that define what’s acceptable and what isn’t.

Over time, more companies have expressed interest in joining, but we only accept a select few. As insiders, we know which players didn’t meet the standards. That selectivity helped us become a sort of guiding light for ethical practices in the industry.”

Also Read: Data Poisoning: How to Detect, Prevent, and Respond

Web data for public good

The company is also behind pro bono Project 4β, which provides access to public web data gathering infrastructure, expertise, and legal/technical advisory to researchers, journalists, NGOs, academic institutions, and organisations engaged in social-impact missions.

It lowers the barrier to high-scale web data access for people and organisations who might not have the resources to build it all themselves. Through it, Oxylabs offers free masterclasses, training, guidance on legal, ethical, and technical aspects of public web data gathering and funding or advising academic / public-interest projects that tackle challenging questions needing web data.

For example, Oxylabs collaborated with Lithuania’s Environmental Protection Department (EPD) to detect and tackle illegal environmental advertisements on Lithuanian online marketplaces. They used web crawling / scraping infrastructure to monitor listings that might violate environmental laws — for example, banned chemicals, protected species, etc. It’s a powerful example of how public institutions can adopt web intelligence to enforce regulation.

In Germany, Project 4β partnered with CeMAS (Centre for Monitoring, Analysis, and Strategy) which used the Web Scraper API to monitor news articles and content relevant to extremist mobilisation (especially around Pride events and counter-protests). The scraped data helps CeMAS track the behaviour and communication of far-right groups.

Ethical scraping starts with how you source proxies

Oxylabs also secured a partnership with Honeygain, the largest passive income app of its kind.”

Once installed on a computer or phone, the app connects the device to Oxylabs’ proxy network, where the pooled bandwidth is used by businesses for legitimate purposes like price comparison, SEO monitoring, ad verification, and market research. Instead of relying on shady or malware-based networks, Honeygain provides a transparent, opt-in model where users are compensated for their contribution.

Grybauskas explained:

“Our infrastructure relies on proxy networks—millions of IP addresses, both data centre and residential. Some companies acquire these through malware, which is unethical. We chose a different path. We launched Honeygain, the largest passive income app of its kind.“

According to Grybauskas, “in some countries, it’s just beer money; in others, it’s a meaningful addition to income.” Users can also choose ad-free app experiences in exchange for sharing bandwidth. Consent and compensation are central to our model. However, in terms of residential proxies, Grybauskas admits that the company worries about competitors who don’t care about compliance.

“For example, after Russia’s full-scale invasion of Ukraine, we immediately cut ties with all Russian customers. Some of our competitors didn’t. For us, that was a moral decision. Ethical scraping involves the whole chain: how you get proxies, who you sell to, and how they use the data.”

Oxylabs Champions Ethical Web Data for the AI Era

Must read

YouTube Will Now Label AI Videos Where You Can See Them

London Data Firm Raises $2.7B for AI Expansion

Google’s AI Can’t Spell. That’s a Harder Fix Than It Sounds.

Are Banks Losing the Race Against Instant Fraud?

Oxylabs builds ethical, scalable web data tools for AI, e-commerce, and research—fueling innovation while setting industry standards.

The public Web as the world’s largest dataset

“Selling picks and shovels in the data gold rush”

Building an ethical industry standard

Web data for public good

Ethical scraping starts with how you source proxies

More articles

Latest posts

YouTube Will Now Label AI Videos Where You Can See Them

London Data Firm Raises $2.7B for AI Expansion

Google’s AI Can’t Spell. That’s a Harder Fix Than It Sounds.

Are Banks Losing the Race Against Instant Fraud?

Starbucks Pulls AI Inventory Tool After Miscounts

SchoolData Launches AI Tool for Student Insights

Cint Bets on Slack to Run Its AI-Powered Operations

Quick Links

Popular Categories

What to Read Next

Dataiku Wants to Make Everyone a Data Scientist

Mistral Buys Austrian AI Firm to Push Into Industry