You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 3, 2019. It is now read-only.
In some sources, a date of publication or update is not included at all within the source itself. In these cases, it may be preferable to make use of the date a snapshot of the source was made by the Internet Archive. This needs to be worked through and new guidance included in the Handbook.
My suggestion is that where there is a stated publication date we keep the current citation format. Absent a stated publication, we parse out the crawl date from Internet Archive and include it as a separate field in the raw data. In the human readable source citation, we flag it as such in the date field. For example, this source from the SFM Mexico research:
This is a particularly important where over time the content of a web page changes, and there is no publication date on any of the various versions in the Internet Archive.
TW:
Agreed - as an addition to this workflow I suggest that the researcher always capture a snapshot of the page they are on when they first access it. This way even for pages that have been crawled before we can use the crawl date as a substitute for the common "date accessed" formulation.
One issue is with sources that cannot be crawled and have no date. We'd need to 1) develop archiving policy, 2) use the "date accessed" formulation in addition to any other relevant notation (such as "on file with the Monitor")
TL:
We'd need to 1) develop archiving policy
Internet Archive has an offering called Archive It, which allows you to run your own crawling and archiving projects. Given that some of the content we wish to store, in the public interest, is blocked by robot exclusion it's good to know that Archive It enables you to request the crawler to ignore robots.txt. Also, Columbia University Libraries already runs a Human Rights collection on Archive It so there may be something existing infrastructure we can plug into and benefit from here.
If that isn't the path forward, there are (probably) self-hosted tools of various maturity that perform a similar function to the Wayback Machine and can be used as a private version of such.
The text was updated successfully, but these errors were encountered:
TL:
In some sources, a date of publication or update is not included at all within the source itself. In these cases, it may be preferable to make use of the date a snapshot of the source was made by the Internet Archive. This needs to be worked through and new guidance included in the Handbook.
My suggestion is that where there is a stated publication date we keep the current citation format. Absent a stated publication, we parse out the crawl date from Internet Archive and include it as a separate field in the raw data. In the human readable source citation, we flag it as such in the date field. For example, this source from the SFM Mexico research:
This is a particularly important where over time the content of a web page changes, and there is no publication date on any of the various versions in the Internet Archive.
TW:
Agreed - as an addition to this workflow I suggest that the researcher always capture a snapshot of the page they are on when they first access it. This way even for pages that have been crawled before we can use the crawl date as a substitute for the common "date accessed" formulation.
One issue is with sources that cannot be crawled and have no date. We'd need to 1) develop archiving policy, 2) use the "date accessed" formulation in addition to any other relevant notation (such as "on file with the Monitor")
TL:
Internet Archive has an offering called Archive It, which allows you to run your own crawling and archiving projects. Given that some of the content we wish to store, in the public interest, is blocked by robot exclusion it's good to know that Archive It enables you to request the crawler to ignore robots.txt. Also, Columbia University Libraries already runs a Human Rights collection on Archive It so there may be something existing infrastructure we can plug into and benefit from here.
If that isn't the path forward, there are (probably) self-hosted tools of various maturity that perform a similar function to the Wayback Machine and can be used as a private version of such.
The text was updated successfully, but these errors were encountered: