Utilities for working with a local dataset cache, adapted from AllenNLP

url_to_filename[source]

url_to_filename(url:str, etag:str=None)

Convert url into a hashed filename in a repeatable way. If etag is specified, append its hash to the url's, delimited by a period.

filename_to_url[source]

filename_to_url(filename:str, cache_dir:str=None)

Return the url and etag (which may be None) stored for filename. Raise FileNotFoundError if filename or its stored metadata do not exist.

cached_path[source]

cached_path(url_or_filename:Union[str, Path], cache_dir:str=None)

Given something that might be a URL (or might be a local path), determine which. If it's a URL, download the file and cache it, and return the path to the cached file. If it's already a local path, make sure the file exists and then return the path.

is_url_or_existing_file[source]

is_url_or_existing_file(url_or_filename:Union[str, Path, NoneType])

Given something that might be a URL (or might be a local path), determine check if it's url or an existing file path.

split_s3_path[source]

split_s3_path(url:str)

Split a full s3 path into the bucket name and path.

s3_request[source]

s3_request(func:Callable)

Wrapper function for s3 requests in order to create more helpful error messages.

get_s3_resource[source]

get_s3_resource()

s3_etag[source]

s3_etag(url:str)

Check ETag on S3 object.

s3_get[source]

s3_get(url:str, temp_file:IO)

Pull a file directly from S3.

session_with_backoff[source]

session_with_backoff()

We ran into an issue where http requests to s3 were timing out, possibly because we were making too many requests too quickly. This helper function returns a requests session that has retry-with-backoff built in. see stackoverflow.com/questions/23267409/how-to-implement-retry-mechanism-into-python-requests-library

http_get[source]

http_get(url:str, temp_file:IO)

get_from_cache[source]

get_from_cache(url:str, cache_dir:str=None)

Given a URL, look for the corresponding dataset in the local cache. If it's not there, download it. Then return the path to the cached file.

read_set_from_file[source]

read_set_from_file(filename:str)

Extract a de-duped collection (set) of text from a file. Expected file format is one item per line.

get_file_extension[source]

get_file_extension(path:str, dot=True, lower:bool=True)

class Tqdm[source]

Tqdm()