2025-05-29 |
Preparing to scrape and populate db to not visit a page more than one. Restructured folders and removed some extraneous pages no longer needed.
moreserverless
committed
on 29 May
|
---|---|
2025-05-26 |
Added scraped data. Filename is the H1 tag of the page scraped. _RAW.txt is the data returned from BeautifulSoup so if I want to make minor changes I don't need to re-scrape the page.
moreserverless
committed
on 26 May
|
Cleaning up parsed output as well as keeping the text from <a> tags that were previously stripped out. Most of the <a> contained <code> blocks or <pre> tags.
moreserverless
committed
on 26 May
|
|
commented out over complicated way of extracting only the page content
moreserverless
committed
on 26 May
|
|
2025-05-25 |
removed link tags from article body
moreserverless
committed
on 25 May
|
Initial commit.
moreserverless
committed
on 25 May
|