clewis/gcp_docs_scrape

Fork: 0

clewis / gcp_docs_scrape

PoC of scraping GCP documents page

Download ZIP 6commits

latest commit e6655f2ef1 moreserverless authored on 29 May
	docker	Preparing to scrape and populate db to not visit a page more than one. Restructured folders and removed some extraneous pages no longer needed.	2 months ago
	gcp_pages	Preparing to scrape and populate db to not visit a page more than one. Restructured folders and removed some extraneous pages no longer needed.	2 months ago
	sitemap_data	Preparing to scrape and populate db to not visit a page more than one. Restructured folders and removed some extraneous pages no longer needed.	2 months ago
	.gitignore	Preparing to scrape and populate db to not visit a page more than one. Restructured folders and removed some extraneous pages no longer needed.	2 months ago
	.python-version	Initial commit.	2 months ago
	gcp_docs.ipynb	Cleaning up parsed output as well as keeping the text from <a> tags that were previously stripped out. Most of the <a> contained <code> blocks or <pre> tags.	2 months ago
	gcp_products.ipynb	Preparing to scrape and populate db to not visit a page more than one. Restructured folders and removed some extraneous pages no longer needed.	2 months ago
	sample.ipynb	Initial commit.	2 months ago