Where I Find My Data
October 5, 2025
A curated list of all the data sources that I use for my EDA, ML, and visualization experiments.
Hey there!
Throughout my studies, my professors have always emphasized the importance of choosing the right data source for any project, it can shape the entire outcome of a data science workflow.
Data often accounts for about 70% of a project’s predictive power, while the model contributes the remaining 30%. In other words, a strong dataset is crucial.
So, for this blog post, I decided to share some of the data sources I use most often in my projects, from classic repositories to a few niche finds.
- Kaggle Datasets - https://www.kaggle.com/datasets
- Anyone who has worked with data is probably familiar with Kaggle. I find myself coming back to it regularly, whether for class projects or just exploring for future ideas.
- For each dataset, you’ll often find user-contributed notebooks, EDA reports, ML models, and visualizations which really helps a lot!
- UC Irvine Machine Learning Repository - https://archive-beta.ics.uci.edu/
- A classic collection of datasets, domain theories, and data generators used for decades by the ML community.
- It has been widely used by students, educators, and researchers all over the world as a primary source of machine learning datasets.
- Registry of Open Data on AWS - https://registry.opendata.aws/
- Helps people discover and share datasets stored on AWS.
- When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon EC2, Amazon Athena, AWS Lambda, and Amazon EMR. Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition.
- You can also submit your own project and may even get featured.
- Google Dataset Search - https://datasetsearch.research.google.com/
- A search engine for datasets across thousands of web repositories.
- Great for quick discovery when you have a specific topic in mind.
- Microsoft Research Open Data - https://www.microsoft.com/en-us/research/tools/
- An index of datasets, SDKs, APIs, and open-source tools developed by Microsoft researchers.
- Particularly useful for academic and research-oriented projects.
- Github: awesome-public-datasets - https://github.com/awesomedata/awesome-public-datasets
- A community-curated list of topic-specific public datasets, collected and tidied from blogs, answers, and user responses.
- Most are free, and it’s a great place to stumble upon something unexpected.
- Open Government Data Platform (OGD) India - https://www.data.gov.in/
- Hosted by the National Informatics Centre (NIC) under India’s Ministry of Electronics & IT.
- Contains a wide range of datasets from Indian government sources.
- U.S. Government’s Open Data - https://data.gov/
- The U.S. federal government’s central open data hub.
- Includes everything from demographic data to environmental, health, and economic datasets.
- OpenDataNI - https://www.opendatani.gov.uk/
- Features datasets from public-sector organizations in Northern Ireland.
- The official portal for European data - https://data.europa.eu/en
- A single access point for open data from across the EU, from international, national, regional, local and geodata portals.
- Airbnb Data Portal - https://www.airroi.com/data-portal/
- One of my most interesting finds.
- Offers comprehensive Airbnb data worldwide through downloadable datasets and real-time API endpoints which makes it great for market analysis and ROI studies.
Data sources can make or break a project. Exploring them not only sparks new project ideas but also gives you a better sense of the data’s quality and context.
If you have a favorite source I missed, I’d love to hear about it!
~Vibhav