Simple Storage Service and Amazon Glacier Storage:
Amazon Simple Storage Service (S3) serves as a versatile storage solution for a wide array of needs. It’s ideal for storing backup archives, log files, and disaster recovery images. Additionally, S3 supports analytics on static data and hosts static websites efficiently. Unlike operating system volumes, which are associated with EC2 instances, S3 offers scalable object storage. It’s a cost-effective and dependable option that seamlessly integrates with various AWS services and external operations.
Overall, Amazon S3 is a go-to platform for storing data securely and efficiently, catering to diverse storage requirements with ease.
You’re going to learn the following:
- How S3 objects are saved, managed, and accessed
- How to choose from among the various classes of Simple Storage Service to get the right balance of durability, availability, and cost
- How to manage long-term data storage lifecycles by incorporating Amazon Glacier into your design
- What other AWS services exist to help you with your data storage and access operations
S3 Service Architecture:
In S3, files are organized into buckets. Initially, you can create up to 100 buckets per AWS account, with the option to request a higher limit. Each bucket and its contents are confined to a single AWS region, but the bucket name must be globally unique across all S3. This uniqueness requirement simplifies referencing buckets while ensuring data can be found in specific regions for operational or regulatory compliance.
Here is the URL you would use to access a file called filename that’s in a bucket called bucket name over HTTP:
https://s3.amazonaws.com/bucketname/filename
Naturally, this assumes you’ll be able to satisfy the object’s permissions requirements. This is how that same file would be addressed using the AWS CLI:
s3://bucketname/filenam
Prefixes and Delimiters:
In S3, objects are stored flatly within buckets, lacking traditional folder hierarchies. However, you can simulate a structured organization using prefixes and delimiters.
A prefix serves as a common text string indicating a level of organization. For instance, employing the word “contracts” followed by the delimiter “/” instructs Simple Storage Service (S3) to treat files like “contracts/acme.pdf” and “contracts/dynamic.pdf” as grouped objects.
When files are uploaded with folder-like structures, S3 recognizes and replicates these hierarchies within the bucket, converting slashes to delimiters. This mechanism ensures that folder structures are correctly displayed when viewing S3 objects through the console or API.
Working with Large Objects:
In S3, while there’s no theoretical limit to total data storage in a bucket, individual objects are capped at 5 TB, with uploads limited to 5 GB.
For objects larger than 100 MB, AWS recommends using Multipart Upload to mitigate data loss or upload failures. This feature breaks large objects into smaller parts, transmitting them individually to S3. If one part fails, it can be retried without affecting others.
Multipart Upload is automatically employed by AWS CLI or high-level APIs, but manual segmentation is necessary with low-level APIs.
An application programming interface (API) is a tool that allows operations to be executed programmatically, either through code or command-line interfaces. AWS utilizes APIs as the primary means of managing its services.
For Amazon S3, AWS offers both low-level APIs, suited for customized uploads requiring manual intervention, and high-level APIs, ideal for automation purposes.
For more detailed information on utilizing these APIs for uploading objects using Multipart Upload, you can refer to this page:
https://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
Encryption:
For data stored on Amazon S3, encryption is crucial unless it’s meant to be publicly accessible. Encryption keys are utilized to safeguard data both at rest within S3 and during transfers between S3 and other destinations. Data at rest can be encrypted using server-side or client-side encryption methods. It’s essential to utilize Amazon’s encrypted API endpoints for data transfers to ensure comprehensive protection.
Server-side encryption in S3 involves AWS encrypting data objects as they are stored on disk and decrypting them when authenticated requests for retrieval are made. This process occurs entirely within the S3 platform, providing a seamless and secure way to manage data encryption.
There are three encryption options available for server-side encryption in Amazon Simple Storage Service:
- Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3): AWS manages the encryption and decryption process using its own enterprise-standard keys.
- Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS): In addition to the features of SSE-S3, SSE-KMS introduces the use of an envelope key and provides a full audit trail for key usage tracking. Users have the option to import their own keys through the AWS KMS service.
- Server-Side Encryption with Customer-Provided Keys (SSE-C): This option allows users to provide their own encryption keys for S3 to use in the encryption process.
-
Client-Side Encryption:
Client-side encryption involves encrypting data before transferring it to Amazon S3. This can be achieved using an AWS KMS-Managed Customer Master Key (CMK) or a Client-Side Master Key provided through the Amazon S3 encryption client. While server-side encryption is generally preferred due to its simplicity, there are scenarios where maintaining full control over encryption keys is necessary, making client-side encryption the only viable option.
Logging :
By default, tracking S3 events to log files is turned off to avoid unnecessary data generation, given the potentially high activity in S3 (Simple Storage Service) buckets.
When logging is enabled, you must specify both a source bucket (the one being tracked) and a target bucket (where the logs will be saved). Optionally, you can define delimiters and prefixes for better organization of logs from multiple source buckets saved to a single target bucket.
S3-generated logs, which may have a short delay, contain essential operation details such as the requestor’s account and IP address, source bucket name, requested action (e.g., GET, PUT), request time, and response status (including error codes).
Additionally, S3 buckets serve as storage for logs and objects from various AWS services like CloudWatch and CloudTrail, including EBS Snapshots.
S3 Durability and Availability:
Amazon S3 provides various storage classes tailored to different durability, availability, and cost requirements. The choice of storage class depends on factors such as the criticality of data survival, the urgency of retrieval, and budget constraints.
Durability:
S3’s durability is measured as a percentage, exemplified by the 99.999999999 percent durability guarantee for most S3 classes and Amazon Glacier. This percentage translates to an average annual expected loss of 0.000000001% of objects. For instance, storing 10,000,000 objects with Amazon S3 would typically result in the loss of a single object once every 10,000 years.
Availability:
Availability of objects is also expressed as a percentage, indicating the likelihood of an object being instantly accessible upon request throughout the year. For example, Amazon S3 Standard class ensures 99.99% availability, equating to less than nine hours of downtime annually. In case downtime exceeds this limit within a year, users can request a service credit. It’s important to note the distinction between durability and availability guarantees: while durability ensures virtually no chance of data loss, availability focuses on the accessibility of data, even if occasional instances of unavailability may occur.
S3 Object Lifecycle:
S3’s object lifecycle management is particularly useful for handling backup archives. While it’s crucial to retain previous versions, managing storage costs requires retiring and deleting older versions. S3’s versioning and lifecycle features automate this process efficiently.
Versioning:
In many file system setups, saving a file with the same name and location as an existing file will overwrite the original, ensuring you have the latest version available. However, this practice can lead to the loss of older versions, including those overwritten accidentally.
Similarly, by default, objects in S3 behave similarly. However, enabling versioning at the bucket level allows older overwritten copies to be saved and retained indefinitely, preventing accidental data loss. Nonetheless, this approach can lead to archive bloat over time.
Here’s where lifecycle management can help :
Lifecycle Management:
You can set up lifecycle rules for a bucket to automatically transition an object’s storage class based on a defined timeframe. For example, you can configure new objects to remain in the S3 Standard class for the first 30 days, then transition to the cheaper One Zone IA class for another 30 days. If regulatory compliance requires retaining older versions, files can then be moved to the low-cost Glacier storage service for 365 more days before being permanently deleted. This automated process helps optimize storage costs while ensuring compliance and accessibility requirements are met.
Amazon Glacier:
Amazon Glacier may initially appear similar to another storage class within S3, but there are notable distinctions. While both offer high durability and can be integrated into S3 lifecycle configurations, Glacier supports archives up to 40 TB, exceeding S3’s 5 TB limit. Additionally, Glacier encrypts archives by default, unlike S3 where encryption is optional. Moreover, Glacier archives are identified by machine-generated IDs rather than human-readable key names used in S3.
The most significant difference lies in data retrieval times. Retrieving objects from Glacier archives can take hours, contrasting with the nearly instant access provided by S3. This key feature defines Glacier’s purpose: providing cost-effective long-term storage for data required only in rare and exceptional circumstances.
Storage Pricing :
To provide an estimate of the costs associated with S3 and Glacier, let’s consider a typical usage scenario. Suppose you generate weekly backups of your company’s sales data, resulting in 5 GB archives. You opt to store each archive in the S3 Standard Storage class for the first 30 days, then transition it to S3 One Zone-IA for the next 90 days. After 120 days, you move the archives to Glacier, where they will be stored for two years before deletion. With this rotation cycle, you’ll have approximately 20 GB in S3 Standard, 65 GB in One Zone-IA, and 520 GB in Glacier.
Here’s an overview of the estimated storage costs in the US East region:
- S3 Standard: $0.023 per GB per month
- S3 One Zone-IA: $0.0125 per GB per month
- Glacier: $0.004 per GB per month
It’s essential to note that these rates are subject to change and may vary based on factors such as region and usage.
Interview Questions On S3 And Glacier:
- Can you explain what Amazon S3 is and its primary use cases?
- How does Amazon S3 store data, and what are buckets?
- What is the significance of prefixes in Amazon S3?
- Can you describe the difference between object-level storage and block-level storage?
- What are the benefits of using Amazon S3 for storing archives and data?
- How does Amazon S3 ensure reliability and availability of stored objects?
- What are the key factors to consider when designing a storage solution using Amazon S3?
- What is the purpose of lifecycle management in Amazon S3, and how does it work?
- Can you explain the key differences between Amazon S3 and Amazon Glacier in terms of storage characteristics, accessibility, and use cases?
- How does Amazon Glacier handle data retrieval compared to Amazon S3, and what factors should be considered when deciding to use Glacier for long-term data storage?