Skip to content
This repository has been archived by the owner on Dec 3, 2019. It is now read-only.

Update Handbook guidance on publication date, access date, crawl date #4

Open
tlongers opened this issue Mar 14, 2017 · 0 comments
Open
Assignees

Comments

@tlongers
Copy link
Member

TL:

In some sources, a date of publication or update is not included at all within the source itself. In these cases, it may be preferable to make use of the date a snapshot of the source was made by the Internet Archive. This needs to be worked through and new guidance included in the Handbook.

My suggestion is that where there is a stated publication date we keep the current citation format. Absent a stated publication, we parse out the crawl date from Internet Archive and include it as a separate field in the raw data. In the human readable source citation, we flag it as such in the date field. For example, this source from the SFM Mexico research:

Ejército Mexicano – Regiones Militares. Secretaría de la Defensa Nacional (Mexico). Date crawled: 8 February 2004. http://sedena.gob.mx/ejercito/comandancias/reg_mil.htm Internet Archive Link: https://web.archive.org/web/20040208205506/http://sedena.gob.mx/ejercito/comandancias/reg_mil.htm

This is a particularly important where over time the content of a web page changes, and there is no publication date on any of the various versions in the Internet Archive.

TW:

Agreed - as an addition to this workflow I suggest that the researcher always capture a snapshot of the page they are on when they first access it. This way even for pages that have been crawled before we can use the crawl date as a substitute for the common "date accessed" formulation.

One issue is with sources that cannot be crawled and have no date. We'd need to 1) develop archiving policy, 2) use the "date accessed" formulation in addition to any other relevant notation (such as "on file with the Monitor")

TL:

We'd need to 1) develop archiving policy

Internet Archive has an offering called Archive It, which allows you to run your own crawling and archiving projects. Given that some of the content we wish to store, in the public interest, is blocked by robot exclusion it's good to know that Archive It enables you to request the crawler to ignore robots.txt. Also, Columbia University Libraries already runs a Human Rights collection on Archive It so there may be something existing infrastructure we can plug into and benefit from here.

If that isn't the path forward, there are (probably) self-hosted tools of various maturity that perform a similar function to the Wayback Machine and can be used as a private version of such.

@tlongers tlongers self-assigned this Mar 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant