storage-inventory
Amazon S3 Inventory
Amazon S3 inventory is one of the tools Amazon S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. You can also simplify and speed up business workflows and big data jobs using Amazon S3 inventory, which provides a scheduled alternative to the Amazon S3 synchronous List
API operation.
Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string). If weekly, a report is generated every seven days after the initial report. For information about Amazon S3 inventory pricing, see Amazon S3 Pricing.
You can configure multiple inventory lists for a bucket. You can configure what object metadata to include in the inventory, whether to list all object versions or only current versions, where to store the inventory list file output, and whether to generate the inventory on a daily or weekly basis. You can also specify that the inventory list file be encrypted.
You can query Amazon S3 inventory using standard SQL by using Amazon Athena, Amazon Redshift Spectrum, and other tools such as Presto, Apache Hive, and Apache Spark. It's easy to use Athena to run queries on your inventory files. You can use Athena for Amazon S3 inventory queries in all Regions where Athena is available.
Topics
- How Do I Set Up Amazon S3 Inventory?
- What's Included in an Amazon S3 Inventory?
- Where Are Inventory Lists Located?
- How Do I Know When an Inventory Is Complete?
- Querying Inventory with Amazon Athena
- Amazon S3 Inventory REST APIs
How Do I Set Up Amazon S3 Inventory?
This section describes how to set up an inventory, including details about the inventory source and destination buckets.
Amazon S3 Inventory Source and Destination Buckets
The bucket that the inventory lists the objects for is called the source bucket. The bucket where the inventory list file is stored is called the destination bucket.
Source Bucket
The inventory lists the objects that are stored in the source bucket. You can get inventory lists for an entire bucket or filtered by (object key name) prefix.
The source bucket:
- Contains the objects that are listed in the inventory.
- Contains the configuration for the inventory.
Destination Bucket
Amazon S3 inventory list files are written to the destination bucket. To group all the inventory list files in a common location in the destination bucket, you can specify a destination (object key name) prefix in the inventory configuration.
The destination bucket:
- Contains the inventory file lists.
- Contains the manifest files that list all the file inventory lists that are stored in the destination bucket. For more information, see What Is an Inventory Manifest?
- Must have a bucket policy to give Amazon S3 permission to verify ownership of the bucket and permission to write files to the bucket.
- Must be in the same AWS Region as the source bucket.
- Can be the same as the source bucket.
- Can be owned by a different AWS account than the account that owns the source bucket.
Setting Up Amazon S3 Inventory
Amazon S3 inventory helps you manage your storage by creating lists of the objects in an S3 bucket on a defined schedule. You can configure multiple inventory lists for a bucket. The inventory lists are published to CSV, ORC, or Parquet files in a destination bucket.
The easiest way to set up an inventory is by using the AWS Management Console, but you can also use the REST API, AWS CLI, or AWS SDKs. The console performs the first step of the following procedure for you: adding a bucket policy to the destination bucket.
To set up Amazon S3 inventory for an S3 bucket
Add a bucket policy for the destination bucket.
You must create a bucket policy on the destination bucket to grant permissions to Amazon S3 to write objects to the bucket in the defined location. For an example policy, see Granting Permissions for Amazon S3 Inventory and Amazon S3 Analytics.
Configure an inventory to list the objects in a source bucket and publish the list to a destination bucket.
When you configure an inventory list for a source bucket, you specify the destination bucket where you want the list to be stored, and whether you want to generate the list daily or weekly. You can also configure what object metadata to include and whether to list all object versions or only current versions.
You can specify that the inventory list file be encrypted by using Amazon S3 managed keys (SSE-S3) or customer managed keys (CMKs) stored in AWS Key Management Service (AWS KMS). For more information about SSE-S3 and SSE-KMS, see Protecting Data Using Server-Side Encryption. If you plan to use SSE-KMS encryption, see Step 3.
- For information about how to use the console to configure an inventory list, see How Do I Configure Amazon S3 Inventory? in the Amazon Simple Storage Service Console User Guide.
- To use the Amazon S3 API to configure an inventory list, use the PUT Bucket inventory configuration REST API, or the equivalent from the AWS CLI or AWS SDKs.
To encrypt the inventory list file with SSE-KMS, grant Amazon S3 permission to use the CMK stored in AWS KMS.
You can configure encryption for the inventory list file by using the AWS Management Console, REST API, AWS CLI, or AWS SDKs. Whichever way you choose, you must grant Amazon S3 permission to use the AWS KMS CMK to encrypt the inventory file. You grant Amazon S3 permission by modifying the key policy for the AWS KMS CMK that is being used to encrypt the inventory file. For more information, see the next section, Grant Amazon S3 Permission to Encrypt Using Your AWS KMS CMK.
Grant Amazon S3 Permission to Encrypt Using Your AWS KMS CMK
You must grant Amazon S3 permission to encrypt using your AWS KMS CMK with a key policy. The following procedure describes how to use the AWS Identity and Access Management (IAM) console to modify the key policy for the AWS KMS CMK that is used to encrypt the inventory file.
To grant permissions to encrypt using your AWS KMS CMK
Sign in to the AWS Management Console using the AWS account that owns the AWS KMS CMK, and open the IAM console at https://console.aws.amazon.com/iam/.
In the left navigation pane, choose Encryption keys.
For Region, choose the appropriate AWS Region. Do not use the region selector in the navigation bar (upper-right corner).
Choose the alias of the CMK that you want to encrypt inventory with.
In the Key Policy section of the page, choose Switch to policy view.
Using the Key Policy editor, insert following key policy into the existing policy and then choose Save Changes. You might want to copy the policy to the end of the existing policy.
{"Sid": "Allow Amazon S3 use of the key","Effect": "Allow","Principal": {"Service": "s3.amazonaws.com"},"Action": ["kms:GenerateDataKey*"],"Resource": "*"}
You can also use the AWS KMS PUT key policy API PutKeyPolicy to copy the key policy to the CMK that is being used to encrypt the inventory file. For more information about creating and editing AWS KMS CMKs, see Getting Started in the AWS Key Management Service Developer Guide.
What's Included in an Amazon S3 Inventory?
An inventory list file contains a list of the objects in the source bucket and metadata for each object. The inventory lists are stored in the destination bucket as a CSV file compressed with GZIP, as an Apache optimized row columnar (ORC) file compressed with ZLIB, or as an Apache Parquet (Parquet) file compressed with Snappy.
The inventory list contains a list of the objects in an S3 bucket and the following metadata for each listed object:
- Bucket name โ The name of the bucket that the inventory is for.
- Key name โ Object key name (or key) that uniquely identifies the object in the bucket. When using the CSV file format, the key name is URL-encoded and must be decoded before you can use it.
- Version ID โ Object version ID. When you enable versioning on a bucket, Amazon S3 assigns a version number to objects that are added to the bucket. For more information, see Object Versioning. (This field is not included if the list is only for the current version of objects.)
- IsLatest โ Set to
True
if the object is the current version of the object. (This field is not included if the list is only for the current version of objects.) - Size โ Object size in bytes.
- Last modified date โ Object creation date or the last modified date, whichever is the latest.
- ETag โ The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether it is depends on how the object was created and how it is encrypted.
- Storage class โ Storage class used for storing the object. For more information, see Amazon S3 Storage Classes.
- Intelligent-Tiering access tier โ Access tier (frequent or infrequent) of the object if stored in Intelligent-Tiering. For more information, see Amazon S3 Intelligent-Tiering.
- Multipart upload flag โ Set to
True
if the object was uploaded as a multipart upload. For more information, see Multipart Upload Overview. - Delete marker โ Set to
True
, if the object is a delete marker. For more information, see Object Versioning. (This field is automatically added to your report if you've configured the report to include all versions of your objects). - Replication status โ Set to
PENDING
,COMPLETED
,FAILED
, orREPLICA.
For more information, see Replication Status Information. - Encryption status โ Set to
SSE-S3
,SSE-C
,SSE-KMS
, orNOT-SSE
. The server-side encryption status for SSE-S3, SSE-KMS, and SSE with customer-provided keys (SSE-C). A status ofNOT-SSE
means that the object is not encrypted with server-side encryption. For more information, see Protecting Data Using Encryption. - Object lock Retain until date โ The date until which the locked object cannot be deleted. For more information, see Locking Objects Using Amazon S3 Object Lock.
- Object lock Mode โ Set to
Governance
orCompliance
for objects that are locked. For more information, see Locking Objects Using Amazon S3 Object Lock. - Object lock Legal hold status โ Set to
On
if a legal hold has been applied to an object; otherwise it is set toOff
. For more information, see Locking Objects Using Amazon S3 Object Lock.
The following is an example CSV inventory list opened in a spreadsheet application. The heading row is shown only to help clarify the example; it is not included in the actual list.
We recommend that you create a lifecycle policy that deletes old inventory lists. For more information, see Object Lifecycle Management.
Inventory Consistency
All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUTs of both new objects and overwrites, and DELETEs. Inventory lists are a rolling snapshot of bucket items, which are eventually consistent (that is, the list might not include recently added or deleted objects).
To validate the state of the object before you take action on the object, we recommend that you perform a HEAD Object
REST API request to retrieve metadata for the object, or check the object's properties in the Amazon S3 console. You can also check object metadata with the AWS CLI or the AWS SDKS. For more information, see HEAD Object in the Amazon Simple Storage Service API Reference.
Where Are Inventory Lists Located?
When an inventory list is published, the manifest files are published to the following location in the destination bucket.
- destination-prefix is the (object key name) prefix set in the inventory configuration, which can be used to group all the inventory list files in a common location within the destination bucket.
- source-bucket is the source bucket that the inventory list is for. It is added to prevent collisions when multiple inventory reports from different source buckets are sent to the same destination bucket.
- config-ID is added to prevent collisions with multiple inventory reports from the same source bucket that are sent to the same destination bucket. The config-ID comes from the inventory report configuration, and is the name for the report that is defined on setup.
- YYYY-MM-DDTHH-MMZ is the timestamp that consists of the start time and the date when the inventory report generation begins scanning the bucket; for example,
2016-11-06T21-32Z
. manifest.json
is the manifest file.manifest.checksum
is the MD5 of the content of themanifest.json
file.symlink.txt
is the Apache Hive-compatible manifest file.
The inventory lists are published daily or weekly to the following location in the destination bucket.
- destination-prefix is the (object key name) prefix set in the inventory configuration. It can be used to group all the inventory list files in a common location in the destination bucket.
- source-bucket is the source bucket that the inventory list is for. It is added to prevent collisions when multiple inventory reports from different source buckets are sent to the same destination bucket.
- example-file-name
.csv.gz
is one of the CSV inventory files. ORC inventory names end with the file name extension.orc
, and Parquet inventory names end with the file name extension.parquet
.
What Is an Inventory Manifest?
The manifest files manifest.json
and symlink.txt
describe where the inventory files are located. Whenever a new inventory list is delivered, it is accompanied by a new set of manifest files.
Each manifest contained in the manifest.json
file provides metadata and other basic information about an inventory. This information includes the following:
- Source bucket name
- Destination bucket name
- Version of the inventory
- Creation timestamp in the epoch date format that consists of the start time and the date when the inventory report generation begins scanning the bucket
- Format and schema of the inventory files
- Actual list of the inventory files that are in the destination bucket
Whenever a manifest.json
file is written, it is accompanied by a manifest.checksum
file that is the MD5 of the content of manifest.json
file.
The following is an example of a manifest in a manifest.json
file for a CSV-formatted inventory.
The following is an example of a manifest in a manifest.json
file for an ORC-formatted inventory.
The following is an example of a manifest in a manifest.json
file for a Parquet-formatted inventory.
The symlink.txt
file is an Apache Hive-compatible manifest file that allows Hive to automatically discover inventory files and their associated data files. The Hive-compatible manifest works with the Hive-compatible services Athena and Amazon Redshift Spectrum. It also works with Hive-compatible applications, including Presto, Apache Hive, Apache Spark, and many others.
Important
The symlink.txt
Apache Hive-compatible manifest file does not currently work with AWS Glue.
Reading symlink.txt
with Apache Hive and Apache Spark is not supported for ORC and Parquet-formatted inventory files.
How Do I Know When an Inventory Is Complete?
You can set up an Amazon S3 event notification to receive notice when the manifest checksum file is created, which indicates that an inventory list has been added to the destination bucket. The manifest is an up-to-date list of all the inventory lists at the destination location.
Amazon S3 can publish events to an Amazon Simple Notification Service (Amazon SNS) topic, an Amazon Simple Queue Service (Amazon SQS) queue, or an AWS Lambda function. For more information, see Configuring Amazon S3 Event Notifications.
The following notification configuration defines that all manifest.checksum
files newly added to the destination bucket are processed by the AWS Lambda cloud-function-list-write
.
For more information, see Using AWS Lambda with Amazon S3 in the AWS Lambda Developer Guide.
Querying Inventory with Amazon Athena
You can query Amazon S3 inventory using standard SQL by using Amazon Athena in all Regions where Athena is available. To check for AWS Region availability, see the AWS Region Table.
Athena can query Amazon S3 inventory files in ORC, Parquet, or CSV format. When you use Athena to query inventory, we recommend that you use ORC-formatted or Parquet-formatted inventory files. ORC and Parquet formats provide faster query performance and lower query costs. ORC and Parquet are self-describing type-aware columnar file formats designed for Apache Hadoop. The columnar format lets the reader read, decompress, and process only the columns that are required for the current query. The ORC and Parquet formats for Amazon S3 inventory are available in all AWS Regions.
To get started using Athena to query Amazon S3 inventory
Create an Athena table. For information about creating a table, see Creating Tables in Amazon Athena in the Amazon Athena User Guide.
The following sample query includes all optional fields in an ORC-formatted inventory report. Drop any optional field that you did not choose for your inventory so that the query corresponds to the fields chosen for your inventory. Also, you must use your bucket name and the location. The location points to your inventory destination path; for example,
s3://destination-prefix/source-bucket/config-ID/hive/
.CREATE EXTERNAL TABLE your_table_name(`bucket` string,key string,version_id string,is_latest boolean,is_delete_marker boolean,size bigint,last_modified_date timestamp,e_tag string,storage_class string,is_multipart_uploaded boolean,replication_status string,encryption_status string,object_lock_retain_until_date timestamp,object_lock_mode string,object_lock_legal_hold_status string)PARTITIONED BY (dt string)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';When using Athena to query a Parquet-formatted inventory report, use the following Parquet SerDe in place of the ORC SerDe in the
ROW FORMAT SERDE
statement.ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'To add new inventory lists to your table, use the following
MSCK REPAIR TABLE
command.MSCK REPAIR TABLE your-table-name;After performing the first two steps, you can run ad hoc queries on your inventory, as shown in the following example.
SELECT encryption_status, count(*) FROM your-table-name GROUP BY encryption_status;
For more information about using Athena, see Amazon Athena User Guide.
Amazon S3 Inventory REST APIs
The following are the REST operations used for Amazon S3 inventory.