Scratchpad crunchbase1/8/2024 ![]() We can see that there's a sitemap index that contains indexes for various target pages: ![]() ![]() The /robots.txt page indicates crawling suggestions for various web crawlers (like Google etc). Let's start by taking a look at /robots.txt endpoint: User-agent: * Since Crunchbase wants to be crawled and indexed by search engines it offers a sitemap directory that contains all of its target URLs. Crunchbase does offer a search system however, it's only for its premium users. To start scraping content we need to find a way to find all of the company or people URLs. You can explore available data types by taking a look at the /discover page. discovery page shows all available dataset types In this tutorial, we'll focus on company and people data though we'll be using generic parsing techniques which can be applied to all of the Crunchbase pages. As for, parsel, another great alternative is beautifulsoup package.Ĭrunchbase contains several data types: acquisitions, people, events, hubs, funding rounds and companies. These packages can be easily installed via pip command: $ pip install httpx parsel loguruĪlternatively, feel free to swap httpx out with any other HTTP client package such as requests as we'll only need basic HTTP functions which are almost interchangeable in every library. Optionally we'll also use loguru - a pretty logging library that'll help us keep track of what's going on via nice colorful logs. parsel - HTML parsing library though we'll be doing very little HTML parsing in this tutorial and will be mostly working with JSON data directly instead.httpx - HTTP client library which will let us communicate with 's servers.In this tutorial we'll be using Python and two major community packages: ![]() For example, the company dataset contains the company's summary details (like description, website and address), public financial information (like acquisitions, investments and) as well as leadership and used technology data.Īdditionally, Crunchbase data contains a lot of data points used in lead generation like the company's contact details, leadership's social profiles and events aggregation.įor more on scraping use cases see our extensive web scraping use case article Project Setup Crunchbase has an enormous business dataset that can be used in a variety of forms of market analytics and business intelligence. ![]()
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |