US8296301B2 - Systems and methods for probabilistic data classification - Google Patents

Systems and methods for probabilistic data classification Download PDF

Info

Publication number
US8296301B2
US8296301B2 US12/022,676 US2267608A US8296301B2 US 8296301 B2 US8296301 B2 US 8296301B2 US 2267608 A US2267608 A US 2267608A US 8296301 B2 US8296301 B2 US 8296301B2
Authority
US
United States
Prior art keywords
file
filesystem
attributes
category
computer files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US12/022,676
Other versions
US20090192979A1 (en
Inventor
Norman R. LUNDE
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Commvault Systems Inc
Original Assignee
Commvault Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Commvault Systems Inc filed Critical Commvault Systems Inc
Priority to US12/022,676 priority Critical patent/US8296301B2/en
Assigned to COMMVAULT SYSTEMS, INC. reassignment COMMVAULT SYSTEMS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LUNDE, NORMAN R.
Publication of US20090192979A1 publication Critical patent/US20090192979A1/en
Priority to US13/615,084 priority patent/US20130066874A1/en
Application granted granted Critical
Publication of US8296301B2 publication Critical patent/US8296301B2/en
Assigned to BANK OF AMERICA, N.A., AS ADMINISTRATIVE AGENT reassignment BANK OF AMERICA, N.A., AS ADMINISTRATIVE AGENT SECURITY INTEREST Assignors: COMMVAULT SYSTEMS, INC.
Priority to US14/968,719 priority patent/US9740764B2/en
Priority to US15/654,042 priority patent/US10628459B2/en
Priority to US16/818,781 priority patent/US10783168B2/en
Priority to US16/944,555 priority patent/US11256724B2/en
Assigned to COMMVAULT SYSTEMS, INC. reassignment COMMVAULT SYSTEMS, INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: BANK OF AMERICA, N.A.
Priority to US17/327,509 priority patent/US20210342368A1/en
Assigned to JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT reassignment JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: COMMVAULT SYSTEMS, INC.
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1461Backup scheduling policy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/84Using snapshots, i.e. a logical point-in-time copy of the data

Definitions

  • the field of the invention relates to systems and methods for performing data classification operations.
  • files can be classified using an explicit set of rules defined by the user.
  • a data classification rule may be based on inclusion of a keyword or a small set of keywords.
  • One embodiment of the invention comprises a filesystem configured to store a plurality of computer files; a scanning agent configured to traverse the filesystem and compile data regarding the attributes and content of the plurality of computer files; an index configured to store the data regarding attributes and content of the plurality of computer files; and a file classifier configured to analyze the data regarding the attributes and content of the plurality of computer files and to classify the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files.
  • Another embodiment of the invention comprises a method of traversing a filesystem and compiling data regarding attributes and content of a plurality of computer files stored in the filesystem; storing the data regarding attributes and content of the plurality of computer files in an index; analyzing the data regarding the attributes and content of the plurality of computer files; and classifying the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files.
  • Another embodiment of the invention comprises means for traversing a filesystem and compiling data regarding attributes and content of a plurality of computer files stored in the filesystem; means for storing the data regarding attributes and content of the plurality of computer files in an index; means for analyzing the data regarding the attributes and content of the plurality of computer files; and means for classifying the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files.
  • FIG. 1 is a schematic representation of a data classification system.
  • FIG. 2 is a flowchart for performing classification operations on data files.
  • FIG. 3 is a schematic illustration of an embodiment of a data storage system for performing data storage operations for one or more client computers into which may be integrated a data classification system.
  • Electronic data classification can be performed manually with relatively good accuracy, but the process is slow and expensive. This type of process can be referred to as supervised classification.
  • data classification can be performed in an automated manner, but if done using explicit rules only, automated classification can result in relatively poor accuracy. This can be referred to as unsupervised classification.
  • techniques can be used which result in semi-supervised classification.
  • Semi-supervised classification techniques may rely on some degree of human input to train a machine to recognize various categories of data. Once the machine has been trained, it can perform data classification operations independent of further human intervention. Semi-automated techniques of this sort can result in greater accuracy than more simplistic automated methods which rely solely on explicit rules.
  • One example of a semi-supervised data classification technique of this sort is a Na ⁇ ve Bayes classifier. Na ⁇ ve Bayes classifiers have found use in certain email systems to help in rejecting unwanted, or “spam,” messages as they arrive over a network at an email server, for example, but not to existing files stored in a computer system.
  • an automated system with improved accuracy, for carrying out file classification operations on the data stored in a business' computing system.
  • an automated system would perform data classification on a substantial portion of a business' stored files on an enterprise-wide, cross-platform scope.
  • the task of data classification is to assign electronic data to one or more categories based on content or characteristics of the data.
  • files may be grouped according to common characteristics such as file size or file extension.
  • files could be grouped with more sophisticated techniques according to subject matter.
  • Many other classification schemes also exist and it should be understood that embodiments of the invention can be adapted to use a wide variety of classification schemes.
  • FIG. 1 is a schematic representation of an automated system for performing data classification on electronic files according to one embodiment of the invention.
  • the file servers 120 which can include or be coupled to electronic data storage devices, handle I/O requests to a filesystem shared by a plurality of client computers (not shown) in a business' computing system.
  • the client computers can be coupled to the file servers 120 via the Local Area Network (LAN) 190 , or in any other way known in the art.
  • LAN Local Area Network
  • the file servers 120 house a substantial portion of a business' electronic data, which is accessible to a plurality of client computers via the network 190 .
  • the shared data storage capacity could take a form other than shared file servers.
  • shared storage devices could be coupled to a plurality of client computers via a Storage Area Network (SAN) or a Network Attached Storage (NAS) unit.
  • SAN Storage Area Network
  • NAS Network Attached Storage
  • Other shared electronic data storage configurations are also possible.
  • each file server 120 may include a filesystem scanning agent 110 .
  • the file system scanning agents 110 can systematically traverse data housed by a corresponding file server 120 .
  • the filesystem scanning agents 110 can access electronic files and compile information about the characteristics of the files, the content of the files, or any other attribute of interest that could serve as the basis for categorizing the electronic files.
  • Filesystem classification agents 110 can be configured to operate with any type of filesystem.
  • filesystem scanning agents 110 are illustrated as modules operating within the file servers 120 , in other embodiments the filesystem scanning agents 110 can be separate devices coupled to file servers 120 via a network 190 . In still other embodiments, filesystem scanning agents 110 can be made capable of directly accessing data storage devices shared by a plurality of client computers over the network 190 , such as via SAN networks or NAS units. The filesystem scanning agents can be implemented in any combination of hardware and software.
  • the information can be shared with a file indexing service 150 which can maintain databases, such as a file attribute index 170 and a file content index 180 , to store the information.
  • the file attribute index 170 can be combined with the file content index 180 , or the two indexes can be implemented as a number of sub-indexes.
  • the file indexing service 150 may be a module operating on an Intelligent File Classifier (IFC) server 130 and information can be exchanged between the filesystem scanning agents 110 and the file indexing service 150 via the network 190 .
  • IFC Intelligent File Classifier
  • the IFC server 130 can include a data processor and electronic memory modules.
  • the IFC server may also include a file classifier program 140 module which can access the file attribute 170 and the file content 180 indexes and classify electronic data files as members of various categories, according to the methods described below.
  • the IFC server 130 may also include a user interface 160 to allow a user to input the characteristics or content of a category of interest and to view a listing of the designated member files of a data classification operation performed by the file classifier program 140 .
  • the user interface 160 may comprise any type of user interface known in the art, such as an I/O terminal coupled to the IFC server 130 or a web server to allow a user to remotely access the IFC server 130 .
  • FIG. 2 is a flowchart which represents an exemplary method of performing data classification operations using the system illustrated in FIG. 1 .
  • a filesystem scanning agent 110 traverses a filesystem and compiles information regarding the attributes and content of electronic files stored in the filesystem.
  • the filesystem scanning agents 110 may have access to a database which indicates the date that a particular file's attributes and content were last gathered.
  • the filesystem scanning agents 110 may determine whether this date came after the last known modification to the file, in which case the filesystem scanning agent 110 may be configured to skip the current file and move on to the next available file.
  • the filesystem scanning agents 110 may be notified any time a file is created or modified so that the new or modified file's attributes and contents can be compiled or updated.
  • the filesystem scanning agents 110 may be notified of these events by filesystem drivers whenever a filesystem I/O request is made, by a packet sniffer coupled to a network which scans the contents of data packets transmitted over the network to determine when a file is created or modified, or using any other technique known in the art.
  • File attributes compiled by the filesystem scanning agents 110 may include, but are not limited to, the file name, its full directory path, size, type, dates of last modification and access, or other types of metadata.
  • the file attribute information may be transmitted to a file indexing service 150 to be stored in a file attribute index 170 .
  • This index may take the form of a relational database which can be searched by any attribute entry or combination of attributes.
  • the file attribute index 170 can be a centralized database managed by a file indexing service 150 which receives file attribute and content information from a plurality sources.
  • the file attribute index 170 may also include information regarding the categories to which a particular file is presently marked as belonging to, or has been marked as having belonged to in the past.
  • the filesystem scanning agents 110 can also analyze data files to catalog their content. For example, if the file includes text, the filesystem scanning agents 110 may create a list of keywords found within the file as well as frequency counts for each of the keywords. If the file is not a text file but rather an image of a document, the classification element 312 may first perform an optical character recognition (OCR) operation before creating keyword lists and frequency counts.
  • OCR optical character recognition
  • the file content information may be transmitted to a file indexing service 150 to be stored in a file content index 180 .
  • the file content index 180 may take the form of a searchable database which contains the keyword lists and frequency counts gathered by the filesystem scanning agents 110 as well as logical mappings of keywords to the files in which they are found. Much like the file attribute index 170 , it may be advantageous for the file content index to be managed by a file indexing service 150 which receives file attribute and content information from a plurality of sources.
  • the file content index 180 may be searched by file, producing a list of keywords for the file.
  • the file content index 180 may also be searched by keyword, producing a list of files which contain that word.
  • This type of search result can include a relevance ranking which orders the list of files which contain the search term by the frequency with which they appear in the file.
  • Other methods of cataloguing and searching the file content index 180 can also be used.
  • filesystem scanning agents 110 can be used to analyze any file type for any type of content to the extent that there exists a method for performing such analysis.
  • a catalogue of the identified file content can be kept in the file content index 180 .
  • a filesystem scanning agent 110 transmits file attribute and content information to the file indexing service 150 .
  • the file indexing service 150 stores that information in the appropriate index. Files stored by the file servers 120 can classified, or designated as members of a defined category, based on the information in these indexes. The classification of a file can be based on information from the file attribute index 170 , the file content index 180 , or some combination of both.
  • Blocks 240 , 250 , and 260 represent an embodiment of a method for training an automated data classification system which employs semi-supervised classification techniques.
  • Embodiments of the invention will be described below primarily in terms of a Na ⁇ ve Bayes classification algorithm, however neural networks or strict Bayesian networks are also suitable candidates. Other types of classifiers or algorithms can also be used.
  • One embodiment of the invention may use a set of explicit user-defined rules to decrease the number of files to which a more computationally expensive classification method is then applied. For example, a user may wish to identify only recent files belonging to a particular category. In such a case, an explicit rule requiring a file to have been modified no longer than thirty days previously could be used to decrease the number of candidate files to be analyzed using a Na ⁇ ve Bayes algorithm, which uses a more computationally complex calculation to determine a probability that a particular file belongs to the desired category.
  • a user creates a name for a particular category of data, members of which he or she would like to locate amongst the mass of data stored in file servers 120 or some other type of shared storage device accessible to a plurality of client computers. This can be done with the user interface 160 of the IFC server.
  • the user can select sample files from the file attribute 170 and file content 180 indexes which are properly designated as members of the category of data which the user wishes to identify. These sample files can constitute a training set of data which allows the file classifier program 140 to “learn” how to identify files stored by the file servers 120 which are members of the desired category. Using this training set of data, the file classifier program 140 computes, at block 260 , a set of classification rules that can be applied to the files from the file attribute 170 and file content 180 indexes which were not included in the training set.
  • the set of test data is used to calculate a probability that a file belongs to the desired category. This can be done for each file indexed by the indexing service 150 that lies outside the training set selected by the user.
  • the user interface 160 can format the results of the classification operation and present the results to the user. For example, the user interface 160 can present a list of each file which was determined by the file classifier program 140 to belong to a desired category.
  • Some classification techniques may output a probability that a given unclassified file should be marked as belonging to a certain category.
  • the determination that a file belongs to a particular category may be based on the calculated probability of the file belonging to the category exceeding a threshold.
  • a determination can be made whether the probability is high enough to risk a mistaken classification and justify classifying the file as a member of the category in question.
  • the file classifier program 140 may be configured to mark the file as a member of the category if the probability exceeds a user-defined threshold.
  • a user might configure the classification element to mark a file as a member of a category only if the calculated probability is greater than 85%.
  • the file classifier program may be configured to mark the file as being a questionable member of the category and allow a user to view the file to determine whether it should or should not be designated as a member of the category in question.
  • a file may be classified as a member of more than one category.
  • a category of files may be defined temporarily by a user query.
  • a category of files can be defined on a relatively permanent basis and new files which meet the criteria of the category previously calculated by the file classifier program 140 on the basis of a training set of data can be automatically added to the category as they are created or modified.
  • the training data includes both members (Foo.doc, Bar.doc, and Quux.doc) of the desired category, as well as non-members (Bas.doc and Qux.doc).
  • the data on whether each of the files in the training set is smaller than 1 KB can be obtained from the file attribute index 170 .
  • the data on whether each file contains the keyword “SSN” can be obtained from the file content index 180 .
  • the file classifier program 140 can calculate a probability that files smaller than 1 KB are members of the “Personnel Records” category. Based on the above training data, one out of three files which are smaller than 1 KB are also members of the “Personnel Records” category, for a probability of 33%. The file classifier program 140 can also calculate a probability that files which contain the keyword “SSN” are members of the “Personnel Records” category. Three out of three files which contain the keyword “SSN” are also members of the “Personnel Records” category. This leads to a calculated probability of 100% that a file belongs to the “Personnel Records” category if it contains the keyword “SSN.”
  • An overall probability that a file belongs to the desired category can also be calculated from the training set of data.
  • three out of the five files in the training set are members of the “Personnel Records” category for an overall probability of membership of 60%.
  • the file classifier program can analyze whether files outside the training set are smaller than 1 KB or contain the keyword “SSN,” and then determine the probability that the file belongs to the “Personnel Records” category using Bayes Theorem, or similar method.
  • the file classifier program 140 uses a Na ⁇ ve Bayes algorithm.
  • other characteristics of a training set of data can be emphasized in embodiments of the invention which use other classification algorithms.
  • IFC server 130 may include a database that contains a list of classification outcomes, such as “File Classified as Personnel Information,” as well as a corresponding action to be performed when the associated classification outcome occurs.
  • the IFC server 130 may include learning algorithms to independently determine what course of action to take after a classification operation is completed based on its past experience or based on a set of training data that has been provided to guide its actions.
  • One action that could be taken by the IFC server 130 based on a file classification outcome is changing access permissions on a file based on the sensitivity of the category to which it belongs. It may be desirable to limit access of the file to certain users of the host computing system for any number of reasons: the file may contain sensitive personal employee information, trade secrets, confidential financial information, etc.
  • Another action that could be taken by the IFC server 130 based on a file classification outcome is to change the backup or archive schedule for the file.
  • Certain categories of files may be classified as non-critical. It may be preferable to backup these types of files less regularly in order to conserve system resources. In addition, these files may be migrated to slower access storage sooner than would be the case for more important files, or possibly never. Other categories of files may be classified as critical data. As such, it will likely be desirable to regularly backup these files and possibly maintain them in fast access memory for an extended period of time.
  • FIG. 1 An exemplary data storage and backup system which can be modified to include a data classification system is illustrated in FIG. 3 .
  • FIG. 3 illustrates a storage cell building block of a modular data storage and backup system.
  • a storage cell 350 of a data storage system performs storage operations on electronic data for one or more client computers in a networked computing environment.
  • the storage system may comprise a Storage Area Network (SAN), a Network Attached Storage (NAS) system, a combination of the two, or any other storage system at least partially attached to a host computing system and/or storage device by a network.
  • SAN Storage Area Network
  • NAS Network Attached Storage
  • storage operation is intended to also convey any other ancillary operation which may be advantageously performed on data that is stored for later access.
  • Storage cells of this type can be combined and programmed to function together in many different configurations to suit the particular data storage needs of a given set of users.
  • Each storage cell 350 may participate in various storage-related functions, such as backup, data migration, quick data recovery, etc.
  • storage cells can be used as modular building blocks to create scalable data storage and backup systems which can grow or shrink in storage-related functionality and capacity as a business' needs dictate.
  • This type of system is exemplary of the CommVault QiNetix system, and also the CommVault GALAXY backup system, available from CommVault Systems, Inc. of Oceanport, N.J. Similar systems are further described in U.S. patent application Ser. Nos. 09/610,738 AND 11/120,619, which are hereby incorporated by reference in their entirety.
  • the storage cell 350 may generally comprise a storage manager 300 to direct various aspects of data storage operations and to coordinate such operations with other storage cells.
  • the storage cell 350 may also comprise a data agent 395 to control storage and backup operations for a client computer 385 and a media agent 305 to interface with a physical storage device 315 .
  • Each of these components may be implemented solely as computer hardware or as software operating on computer hardware.
  • the storage manager 300 may be a software module or other application that coordinates and controls storage operations performed by the storage operation cell 350 .
  • the storage manager 300 may communicate with some or all elements of the storage operation cell 350 including client computers 385 , data agents 395 , media agents 305 , and storage devices 315 , to initiate and manage system backups, migrations, and data recovery. If the storage cell 350 is simply one cell out of a number of storage cells which have been combined to create a larger data storage and backup system, then the storage manager 300 may also communicate with other storage cells to coordinate data storage and backup operations in the system as a whole.
  • the data agent 395 is a software module or part of a software module that is generally responsible for archiving, migrating, and recovering data from a client computer 385 stored in an information store 390 or other memory location.
  • Each client computer 385 may have at least one data agent 395 and the system can support multiple client computers 385 .
  • data agents 395 may be distributed between a client 385 and the storage manager 300 (and any other intermediate components (not shown)) or may be deployed from a remote location or its functions approximated by a remote process that performs some or all of the functions of data agent 395 .
  • Embodiments of the storage cell 350 may employ multiple data agents 395 each of which may backup, migrate, and recover data associated with a different application.
  • different individual data agents 395 may be designed to handle Microsoft Exchange data, Lotus Notes data, Microsoft Windows file system data, Microsoft Active Directory Objects data, and other types of data known in the art.
  • Other embodiments may employ one or more generic data agents 395 that can handle and process multiple data types rather than using the specialized data agents described above.
  • a media agent 305 may be implemented as software module that conveys data, as directed by a storage manager 300 , between a client computer 385 and one or more storage devices 315 such as a tape library, a magnetic media storage device, an optical media storage device, or any other suitable storage device.
  • the media agent 305 controls the actual physical level data storage or retrieval to and from a storage device 315 .
  • Media agents 305 may communicate with a storage device 315 via a suitable communications path such as a SCSI or fiber channel communications link.
  • the storage device 315 may be communicatively coupled to a data agent 305 via a SAN or NAS system, or a combination of the two.
  • any given storage cell in a modular data storage and backup system may comprise different combinations of hardware and software components besides the particular configuration illustrated in FIG. 3 .
  • certain components may reside and execute on the same computer.
  • a storage cell may also be adapted to include extra hardware and software for performing additional tasks in the context of a data storage and backup system.
  • storage operation cells may include hardware and software for performing file classification operations.
  • the storage cell 350 may be modified to include a filesystem scanning agent 110 and an IFC server 130 .
  • the IFC server 130 may comprise a file classifier program 140 , a file indexing service 150 , and a user interface 160 .
  • Each of these components may function substantially in accordance with the description of these components set forth above with reference to FIGS. 1 and 2 . However, certain modification to these components may be dictated by the configuration of the computing system into which they are being incorporated. In these instances it is within the ability of one of ordinary skill in the art to make these adaptations.

Abstract

A system for performing data classification operations. In one embodiment, the system comprises a filesystem configured to store a plurality of computer files and a scanning agent configured to traverse the filesystem and compile data regarding the attributes and content of the plurality of computer files. The system also comprises an index configured to store the data regarding attributes and content of the plurality of computer files and a file classifier configured to analyze the data regarding the attributes and content of the plurality of computer files and to classify the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files. Results of the file classification operations can be used to set appropriate security permissions on files which include sensitive information or to control the way that a file is backed up or the schedule according to which it is archived.

Description

BACKGROUND OF THE INVENTION
1. Field of the Invention
The field of the invention relates to systems and methods for performing data classification operations.
2. Description of the Related Art
As modern enterprise environments trend towards a paperless workplace, electronic data is often created at a high rate. This electronic data takes a variety of forms which may include emails, documents, spreadsheets, images, databases, etc. Businesses have a need to effectively classify and organize all of this electronic data.
However, it can be extremely difficult to accurately classify large amounts of data in ways which are time and cost effective. Existing solutions have typically allowed a user to classify files in at least one of two ways. The user can manually view each file and determine the appropriate classification. While this can be a relatively accurate method of categorizing data, it quickly becomes expensive and impractical as the volume of data-to-be-classified increases.
Alternatively, files can be classified using an explicit set of rules defined by the user. For example, a data classification rule may be based on inclusion of a keyword or a small set of keywords. With this approach, the classification of files can be done by machine, but the use of explicit rules tends to be a relatively inaccurate method of classifying non-homogeneous files and can result in many false classifications.
SUMMARY OF THE INVENTION
Therefore, there is a need for more accurate automated systems for classifying and organizing the large amounts of computer data which exist in modern enterprise environments.
One embodiment of the invention comprises a filesystem configured to store a plurality of computer files; a scanning agent configured to traverse the filesystem and compile data regarding the attributes and content of the plurality of computer files; an index configured to store the data regarding attributes and content of the plurality of computer files; and a file classifier configured to analyze the data regarding the attributes and content of the plurality of computer files and to classify the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files.
Another embodiment of the invention comprises a method of traversing a filesystem and compiling data regarding attributes and content of a plurality of computer files stored in the filesystem; storing the data regarding attributes and content of the plurality of computer files in an index; analyzing the data regarding the attributes and content of the plurality of computer files; and classifying the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files.
Another embodiment of the invention comprises means for traversing a filesystem and compiling data regarding attributes and content of a plurality of computer files stored in the filesystem; means for storing the data regarding attributes and content of the plurality of computer files in an index; means for analyzing the data regarding the attributes and content of the plurality of computer files; and means for classifying the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic representation of a data classification system.
FIG. 2 is a flowchart for performing classification operations on data files.
FIG. 3 is a schematic illustration of an embodiment of a data storage system for performing data storage operations for one or more client computers into which may be integrated a data classification system.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
As discussed previously, there can be tradeoffs involved in performing electronic data classification. Electronic data classification can be performed manually with relatively good accuracy, but the process is slow and expensive. This type of process can be referred to as supervised classification. In other cases, data classification can be performed in an automated manner, but if done using explicit rules only, automated classification can result in relatively poor accuracy. This can be referred to as unsupervised classification. In still other cases, techniques can be used which result in semi-supervised classification.
Semi-supervised classification techniques may rely on some degree of human input to train a machine to recognize various categories of data. Once the machine has been trained, it can perform data classification operations independent of further human intervention. Semi-automated techniques of this sort can result in greater accuracy than more simplistic automated methods which rely solely on explicit rules. One example of a semi-supervised data classification technique of this sort is a Naïve Bayes classifier. Naïve Bayes classifiers have found use in certain email systems to help in rejecting unwanted, or “spam,” messages as they arrive over a network at an email server, for example, but not to existing files stored in a computer system.
Apart from the filtering of incoming email messages, significant benefits can be had from applying the Naïve Bayes method, as well as other classification methods, to data that is already stored in a computer system. In particular, there are tremendous advantages to be had from applying data classification methods to large-scale computing systems with tremendous amounts of stored data. These advantages include, among others, using automated data classification methods classification to place proper security restrictions on access to certain files (this may be required by law in certain instances, such as in the case of medical records or private personnel information) or to control the location where a file is stored or backed up so that it can be located at a later date. Classification of data can also be useful in determining whether certain files should be deleted entirely, backed up in relatively fast access storage media, or permanently archived in slower access media.
Therefore, it would be advantageous to have an automated system, with improved accuracy, for carrying out file classification operations on the data stored in a business' computing system. In certain preferred embodiments of the invention, such an automated system would perform data classification on a substantial portion of a business' stored files on an enterprise-wide, cross-platform scope.
Just as there are many reasons to classify files, there are also many schemes of doing so. Generally speaking, the task of data classification is to assign electronic data to one or more categories based on content or characteristics of the data. In some cases, files may be grouped according to common characteristics such as file size or file extension. In other cases, files could be grouped with more sophisticated techniques according to subject matter. Many other classification schemes also exist and it should be understood that embodiments of the invention can be adapted to use a wide variety of classification schemes.
FIG. 1 is a schematic representation of an automated system for performing data classification on electronic files according to one embodiment of the invention. The file servers 120, which can include or be coupled to electronic data storage devices, handle I/O requests to a filesystem shared by a plurality of client computers (not shown) in a business' computing system. The client computers can be coupled to the file servers 120 via the Local Area Network (LAN) 190, or in any other way known in the art. In this way, the file servers 120 house a substantial portion of a business' electronic data, which is accessible to a plurality of client computers via the network 190.
In other embodiments, the shared data storage capacity could take a form other than shared file servers. For example, shared storage devices could be coupled to a plurality of client computers via a Storage Area Network (SAN) or a Network Attached Storage (NAS) unit. Other shared electronic data storage configurations are also possible.
In one embodiment, each file server 120 may include a filesystem scanning agent 110. The file system scanning agents 110 can systematically traverse data housed by a corresponding file server 120. The filesystem scanning agents 110 can access electronic files and compile information about the characteristics of the files, the content of the files, or any other attribute of interest that could serve as the basis for categorizing the electronic files. Filesystem classification agents 110 can be configured to operate with any type of filesystem.
Furthermore, while the filesystem scanning agents 110 are illustrated as modules operating within the file servers 120, in other embodiments the filesystem scanning agents 110 can be separate devices coupled to file servers 120 via a network 190. In still other embodiments, filesystem scanning agents 110 can be made capable of directly accessing data storage devices shared by a plurality of client computers over the network 190, such as via SAN networks or NAS units. The filesystem scanning agents can be implemented in any combination of hardware and software.
As filesystem scanning agents 110 compile information about file characteristics, content, etc., the information can be shared with a file indexing service 150 which can maintain databases, such as a file attribute index 170 and a file content index 180, to store the information. In some embodiments, the file attribute index 170 can be combined with the file content index 180, or the two indexes can be implemented as a number of sub-indexes. In one embodiment, the file indexing service 150 may be a module operating on an Intelligent File Classifier (IFC) server 130 and information can be exchanged between the filesystem scanning agents 110 and the file indexing service 150 via the network 190.
The IFC server 130 can include a data processor and electronic memory modules. The IFC server may also include a file classifier program 140 module which can access the file attribute 170 and the file content 180 indexes and classify electronic data files as members of various categories, according to the methods described below. The IFC server 130 may also include a user interface 160 to allow a user to input the characteristics or content of a category of interest and to view a listing of the designated member files of a data classification operation performed by the file classifier program 140. The user interface 160 may comprise any type of user interface known in the art, such as an I/O terminal coupled to the IFC server 130 or a web server to allow a user to remotely access the IFC server 130.
FIG. 2 is a flowchart which represents an exemplary method of performing data classification operations using the system illustrated in FIG. 1. At block 210 a filesystem scanning agent 110 traverses a filesystem and compiles information regarding the attributes and content of electronic files stored in the filesystem. In some embodiments, the filesystem scanning agents 110 may have access to a database which indicates the date that a particular file's attributes and content were last gathered. In these embodiments, the filesystem scanning agents 110 may determine whether this date came after the last known modification to the file, in which case the filesystem scanning agent 110 may be configured to skip the current file and move on to the next available file.
In other embodiments, the filesystem scanning agents 110 may be notified any time a file is created or modified so that the new or modified file's attributes and contents can be compiled or updated. The filesystem scanning agents 110 may be notified of these events by filesystem drivers whenever a filesystem I/O request is made, by a packet sniffer coupled to a network which scans the contents of data packets transmitted over the network to determine when a file is created or modified, or using any other technique known in the art.
File attributes compiled by the filesystem scanning agents 110 may include, but are not limited to, the file name, its full directory path, size, type, dates of last modification and access, or other types of metadata. The file attribute information may be transmitted to a file indexing service 150 to be stored in a file attribute index 170. This index may take the form of a relational database which can be searched by any attribute entry or combination of attributes. In certain embodiments, the file attribute index 170 can be a centralized database managed by a file indexing service 150 which receives file attribute and content information from a plurality sources. The file attribute index 170 may also include information regarding the categories to which a particular file is presently marked as belonging to, or has been marked as having belonged to in the past.
The filesystem scanning agents 110 can also analyze data files to catalog their content. For example, if the file includes text, the filesystem scanning agents 110 may create a list of keywords found within the file as well as frequency counts for each of the keywords. If the file is not a text file but rather an image of a document, the classification element 312 may first perform an optical character recognition (OCR) operation before creating keyword lists and frequency counts. The file content information may be transmitted to a file indexing service 150 to be stored in a file content index 180. The file content index 180 may take the form of a searchable database which contains the keyword lists and frequency counts gathered by the filesystem scanning agents 110 as well as logical mappings of keywords to the files in which they are found. Much like the file attribute index 170, it may be advantageous for the file content index to be managed by a file indexing service 150 which receives file attribute and content information from a plurality of sources.
The file content index 180 may be searched by file, producing a list of keywords for the file. The file content index 180 may also be searched by keyword, producing a list of files which contain that word. This type of search result can include a relevance ranking which orders the list of files which contain the search term by the frequency with which they appear in the file. Other methods of cataloguing and searching the file content index 180 can also be used.
Other types of files besides text-containing documents can be analyzed for content as well. For example, digital image processing techniques can be used to scan image files for certain image features using object recognition algorithms to create a catalogue of features that are identified. Similarly, audio files could be scanned to catalogue recognizable features. In fact, the filesystem scanning agents 110 can be used to analyze any file type for any type of content to the extent that there exists a method for performing such analysis. In any case, a catalogue of the identified file content can be kept in the file content index 180.
At block 220, a filesystem scanning agent 110 transmits file attribute and content information to the file indexing service 150. At block 230, the file indexing service 150 stores that information in the appropriate index. Files stored by the file servers 120 can classified, or designated as members of a defined category, based on the information in these indexes. The classification of a file can be based on information from the file attribute index 170, the file content index 180, or some combination of both.
As described above, some classification techniques are semi-supervised in that they rely on some degree of human input to train a machine to recognize various categories of data before. Once the machine has been trained, it can perform data classification operations substantially independent of further human intervention. Blocks 240, 250, and 260 represent an embodiment of a method for training an automated data classification system which employs semi-supervised classification techniques. Embodiments of the invention will be described below primarily in terms of a Naïve Bayes classification algorithm, however neural networks or strict Bayesian networks are also suitable candidates. Other types of classifiers or algorithms can also be used.
For example, it should be understood that fully supervised and fully unsupervised classification techniques can be advantageously used in certain embodiments of the invention. One embodiment of the invention may use a set of explicit user-defined rules to decrease the number of files to which a more computationally expensive classification method is then applied. For example, a user may wish to identify only recent files belonging to a particular category. In such a case, an explicit rule requiring a file to have been modified no longer than thirty days previously could be used to decrease the number of candidate files to be analyzed using a Naïve Bayes algorithm, which uses a more computationally complex calculation to determine a probability that a particular file belongs to the desired category.
At block 240, a user creates a name for a particular category of data, members of which he or she would like to locate amongst the mass of data stored in file servers 120 or some other type of shared storage device accessible to a plurality of client computers. This can be done with the user interface 160 of the IFC server. At block 250, the user can select sample files from the file attribute 170 and file content 180 indexes which are properly designated as members of the category of data which the user wishes to identify. These sample files can constitute a training set of data which allows the file classifier program 140 to “learn” how to identify files stored by the file servers 120 which are members of the desired category. Using this training set of data, the file classifier program 140 computes, at block 260, a set of classification rules that can be applied to the files from the file attribute 170 and file content 180 indexes which were not included in the training set.
At block 270, the set of test data is used to calculate a probability that a file belongs to the desired category. This can be done for each file indexed by the indexing service 150 that lies outside the training set selected by the user. Finally, at block 280, the user interface 160 can format the results of the classification operation and present the results to the user. For example, the user interface 160 can present a list of each file which was determined by the file classifier program 140 to belong to a desired category.
Some classification techniques, such as a Naïve Bayes algorithm, may output a probability that a given unclassified file should be marked as belonging to a certain category. In these embodiments, the determination that a file belongs to a particular category may be based on the calculated probability of the file belonging to the category exceeding a threshold. A determination can be made whether the probability is high enough to risk a mistaken classification and justify classifying the file as a member of the category in question. In such cases, the file classifier program 140 may be configured to mark the file as a member of the category if the probability exceeds a user-defined threshold.
For example, a user might configure the classification element to mark a file as a member of a category only if the calculated probability is greater than 85%. In cases where the accuracy of the classification operation is critical and where the calculated probability falls short of the threshold by a relatively small margin, the file classifier program may be configured to mark the file as being a questionable member of the category and allow a user to view the file to determine whether it should or should not be designated as a member of the category in question.
Once the file has been classified, it may be labeled as a member of the designated category in the file attribute index. A file may be classified as a member of more than one category. In some embodiments, a category of files may be defined temporarily by a user query. In other embodiments, a category of files can be defined on a relatively permanent basis and new files which meet the criteria of the category previously calculated by the file classifier program 140 on the basis of a training set of data can be automatically added to the category as they are created or modified.
A specific example of a Naïve Bayes classifier, according to one embodiment of the invention, will now be given based on the training data in the following chart.
File Contains Belongs to “Personnel
File Name Size < 1 KB? Keyword “SSN”? Records” Category?
Foo.doc Yes Yes Yes
Bar.doc No Yes Yes
Bas.doc Yes No No
Qux.doc Yes No No
Quux.doc No Yes Yes
In the above training set of data, five files have been marked by a user as belonging, or not belonging, to a category called “Personnel Records.” The training data includes both members (Foo.doc, Bar.doc, and Quux.doc) of the desired category, as well as non-members (Bas.doc and Qux.doc). In this example, the data on whether each of the files in the training set is smaller than 1 KB can be obtained from the file attribute index 170. The data on whether each file contains the keyword “SSN” can be obtained from the file content index 180.
Based on this information, the file classifier program 140 can calculate a probability that files smaller than 1 KB are members of the “Personnel Records” category. Based on the above training data, one out of three files which are smaller than 1 KB are also members of the “Personnel Records” category, for a probability of 33%. The file classifier program 140 can also calculate a probability that files which contain the keyword “SSN” are members of the “Personnel Records” category. Three out of three files which contain the keyword “SSN” are also members of the “Personnel Records” category. This leads to a calculated probability of 100% that a file belongs to the “Personnel Records” category if it contains the keyword “SSN.”
An overall probability that a file belongs to the desired category can also be calculated from the training set of data. In this case, three out of the five files in the training set are members of the “Personnel Records” category for an overall probability of membership of 60%. Using these probabilities, the file classifier program can analyze whether files outside the training set are smaller than 1 KB or contain the keyword “SSN,” and then determine the probability that the file belongs to the “Personnel Records” category using Bayes Theorem, or similar method.
In general, the larger the training set of data and the more representative it is of a cross-section of files in the filesystem in terms of attributes, content, and membership in the desired category, the more accurate will be the results obtained from the classification operation performed by the file classifier program 140 when using a Naïve Bayes algorithm. However, other characteristics of a training set of data can be emphasized in embodiments of the invention which use other classification algorithms.
Once the file classifier program 140 has finished classifying a file, some course of action may be taken by the IFC server 130 based on the outcome of the file classification operation. In some cases the course of action may be pre-determined and user-defined. In this type of embodiment, IFC server 130 may include a database that contains a list of classification outcomes, such as “File Classified as Personnel Information,” as well as a corresponding action to be performed when the associated classification outcome occurs. In other embodiments, the IFC server 130 may include learning algorithms to independently determine what course of action to take after a classification operation is completed based on its past experience or based on a set of training data that has been provided to guide its actions.
One action that could be taken by the IFC server 130 based on a file classification outcome is changing access permissions on a file based on the sensitivity of the category to which it belongs. It may be desirable to limit access of the file to certain users of the host computing system for any number of reasons: the file may contain sensitive personal employee information, trade secrets, confidential financial information, etc.
Another action that could be taken by the IFC server 130 based on a file classification outcome is to change the backup or archive schedule for the file. Certain categories of files may be classified as non-critical. It may be preferable to backup these types of files less regularly in order to conserve system resources. In addition, these files may be migrated to slower access storage sooner than would be the case for more important files, or possibly never. Other categories of files may be classified as critical data. As such, it will likely be desirable to regularly backup these files and possibly maintain them in fast access memory for an extended period of time.
In addition, it would be possible to carefully create and manage a schedule for permanently archiving these files due to the critical information they contain. In embodiments of the invention where the results of a data classification operation are used to influence how certain categories of information are backed up or archived, it may be beneficial to integrate a data classification system, such as the one illustrated in FIG. 1, with a data storage and backup system. Many different types of data storage and backup systems can be used for this purpose. However, an exemplary data storage and backup system which can be modified to include a data classification system is illustrated in FIG. 3.
FIG. 3 illustrates a storage cell building block of a modular data storage and backup system. A storage cell 350 of a data storage system performs storage operations on electronic data for one or more client computers in a networked computing environment. The storage system may comprise a Storage Area Network (SAN), a Network Attached Storage (NAS) system, a combination of the two, or any other storage system at least partially attached to a host computing system and/or storage device by a network. Besides operations that are directly related to storing electronic data, the phrase “storage operation” is intended to also convey any other ancillary operation which may be advantageously performed on data that is stored for later access.
Storage cells of this type can be combined and programmed to function together in many different configurations to suit the particular data storage needs of a given set of users. Each storage cell 350 may participate in various storage-related functions, such as backup, data migration, quick data recovery, etc. In this way storage cells can be used as modular building blocks to create scalable data storage and backup systems which can grow or shrink in storage-related functionality and capacity as a business' needs dictate. This type of system is exemplary of the CommVault QiNetix system, and also the CommVault GALAXY backup system, available from CommVault Systems, Inc. of Oceanport, N.J. Similar systems are further described in U.S. patent application Ser. Nos. 09/610,738 AND 11/120,619, which are hereby incorporated by reference in their entirety.
As shown, the storage cell 350 may generally comprise a storage manager 300 to direct various aspects of data storage operations and to coordinate such operations with other storage cells. The storage cell 350 may also comprise a data agent 395 to control storage and backup operations for a client computer 385 and a media agent 305 to interface with a physical storage device 315. Each of these components may be implemented solely as computer hardware or as software operating on computer hardware.
Generally speaking, the storage manager 300 may be a software module or other application that coordinates and controls storage operations performed by the storage operation cell 350. The storage manager 300 may communicate with some or all elements of the storage operation cell 350 including client computers 385, data agents 395, media agents 305, and storage devices 315, to initiate and manage system backups, migrations, and data recovery. If the storage cell 350 is simply one cell out of a number of storage cells which have been combined to create a larger data storage and backup system, then the storage manager 300 may also communicate with other storage cells to coordinate data storage and backup operations in the system as a whole.
In one embodiment, the data agent 395 is a software module or part of a software module that is generally responsible for archiving, migrating, and recovering data from a client computer 385 stored in an information store 390 or other memory location. Each client computer 385 may have at least one data agent 395 and the system can support multiple client computers 385. In some embodiments, data agents 395 may be distributed between a client 385 and the storage manager 300 (and any other intermediate components (not shown)) or may be deployed from a remote location or its functions approximated by a remote process that performs some or all of the functions of data agent 395.
Embodiments of the storage cell 350 may employ multiple data agents 395 each of which may backup, migrate, and recover data associated with a different application. For example, different individual data agents 395 may be designed to handle Microsoft Exchange data, Lotus Notes data, Microsoft Windows file system data, Microsoft Active Directory Objects data, and other types of data known in the art. Other embodiments may employ one or more generic data agents 395 that can handle and process multiple data types rather than using the specialized data agents described above.
Generally speaking, a media agent 305 may be implemented as software module that conveys data, as directed by a storage manager 300, between a client computer 385 and one or more storage devices 315 such as a tape library, a magnetic media storage device, an optical media storage device, or any other suitable storage device. The media agent 305 controls the actual physical level data storage or retrieval to and from a storage device 315. Media agents 305 may communicate with a storage device 315 via a suitable communications path such as a SCSI or fiber channel communications link. In some embodiments, the storage device 315 may be communicatively coupled to a data agent 305 via a SAN or NAS system, or a combination of the two.
It should be appreciated that any given storage cell in a modular data storage and backup system, such as the one described, may comprise different combinations of hardware and software components besides the particular configuration illustrated in FIG. 3. Furthermore, in some embodiments, certain components may reside and execute on the same computer. A storage cell may also be adapted to include extra hardware and software for performing additional tasks in the context of a data storage and backup system. In particular, storage operation cells may include hardware and software for performing file classification operations. In particular, the storage cell 350 may be modified to include a filesystem scanning agent 110 and an IFC server 130.
The IFC server 130 may comprise a file classifier program 140, a file indexing service 150, and a user interface 160. Each of these components may function substantially in accordance with the description of these components set forth above with reference to FIGS. 1 and 2. However, certain modification to these components may be dictated by the configuration of the computing system into which they are being incorporated. In these instances it is within the ability of one of ordinary skill in the art to make these adaptations.
Preferred embodiments of the claimed inventions have been described in connection with the accompanying drawings. While only a few preferred embodiments have been explicitly described, other embodiments will become apparent to those of ordinary skill in the art of the claimed inventions based on this disclosure. Therefore, the scope of the disclosed inventions is intended to be defined by reference to the appended claims and not simply with regard to the explicitly described embodiments of the inventions.

Claims (23)

1. A computer system comprising:
a filesystem configured to store a plurality of computer files in a computer memory;
a plurality of scanning agents implemented on one or more computer processors, wherein the plurality of scanning agents are configured to traverse the filesystem and compile attributes and content indexes about the plurality of computer files wherein the attributes and content indexes are stored in one or more databases that are stored separately from the filesystem; and
a file classifier comprising one or more computer processors, wherein the file classifier is configured to receive user input wherein the user selects a first set of attributes and content indexes from the one or more databases stored separately from a corresponding first set of computer files in the filesystem,
wherein the file classifier is configured to analyze the user input to determine a set of classification rules such that the classification rules are derived from accessing the first set of the attributes and content indexes in the one or more databases stored separately from the corresponding first set of computer files stored in the filesystem, wherein the set of classification rules are derived without directly accessing the first set of computer files stored in the filesystem,
wherein the file classifier is further configured to classify a second set of computer files stored in the filesystem without accessing the filesystem based on a calculated probability derived from a corresponding second set of attributes and context indexes in the one or more databases stored separately from the filesystem.
2. The system of claim 1, wherein the filesystem comprises one or more data storage devices coupled to a plurality of client computers via a Storage Area Network (SAN), a Network Attached Storage (NAS) unit, or some combination of the two.
3. The system of claim 1, further comprising an index configured to store metadata regarding the plurality of computer files.
4. The system of claim 3, wherein the metadata comprises a file size, name, path, type, or date of creation or modification.
5. The system of claim 3, wherein the metadata comprises information regarding a category that a file has been identified as being a member of.
6. The system of claim 3, wherein the content indexes comprise at least a file content index configured to store information regarding the content of the plurality of computer files.
7. The system of claim 6, wherein the content indexes comprise at least a keyword present in a computer file.
8. The system of claim 1, wherein the file classifier is configured to be trained to recognize files belonging to a category based on a user-provided training data set.
9. The system of claim 8, wherein the file classifier determines that a file should be classified as a member of a category based on a probability threshold.
10. The system of claim 9, wherein a computer file is classified as a member of a category only if the probability that it should be classified as a member of the category exceeds a user-defined threshold.
11. The system of claim 1, wherein the file classifier uses at least one of the group consisting of a Naïve Bayesian and a strict Bayesian algorithm.
12. The system of claim 1, wherein the file classifier is configured reduce a field of possible candidate member computer files of a category based on an explicit rule prior to applying a more computationally expensive classification algorithm to identify member computer files of the category.
13. The system of claim 1, wherein the system is configured to alter security access restrictions on a file based upon a category into which the file is classified.
14. The system of claim 1, wherein the system is configured to alter a data backup schedule for a file based upon a category into which the file is classified.
15. The system of claim 1, wherein the system is configured to alter a data migration plan for a file based upon a category into which the file is classified.
16. A method comprising:
traversing a filesystem and compiling data regarding attributes and content indexes about a plurality of computer files stored in the filesystem, wherein the attributes and content indexes are stored in one or more databases that are stored separately from the filesystem;
receiving user input wherein the user selects a first set of attributes and content indexes from the one or more databases stored separately from a corresponding first set of computer files in the filesystem;
analyzing the user input and data regarding the first set of attributes and content indexes about the corresponding first set of computer files stored in the filesystem to derive a set of classification rules from the first set of attributes and content indexes stored separately from the corresponding first set of the corresponding first set of computer files without directly accessing the first set of computer files stored in the filesystem; and
classifying a second set of computer files stored in the filesystem without accessing the filesystem into one or more categories based on a calculated probability derived from a corresponding second set of attributes and context indexes in the one or more databases stored separately from the filesystem.
17. The method of claim 16, further comprising training a file classifier to recognize files belonging to a category based on a user-provided training data set.
18. The method of claim 16, further comprising comparing the calculated probability with a predefined threshold to determine whether a file should be classified as a member of a category.
19. The method of claim 18, wherein a computer file is classified as a member of a category only if the probability that it should be classified as a member of the category exceeds a user-defined threshold.
20. The method of claim 16, further comprising altering a data backup schedule for a file based upon a category into which the file is classified.
21. The method of claim 16, further comprising altering a data migration plan for a file based upon a category into which the file is classified.
22. A computer system comprising:
means for traversing a filesystem and compiling data regarding attributes and content indexes about a plurality of computer files stored in the filesystem in computer memory, wherein the attributes and content indexes are stored in one or more databases that are stored separately from the filesystem, and wherein the means for traversing the filesystem comprises one or more computer processors;
means for receiving user input wherein the user selects a first set of attributes and content indexes from the one or more databases stored separately from a corresponding first set of computer files in the filesystem;
means for analyzing the user input and data regarding the first set of attributes and content indexes about the corresponding first set of computer files stored in the filesystem to derive a set of classification rules from the first set of attributes and content indexes stored separately from the corresponding first set of computer files without directly accessing the first set of computer files stored in the filesystem; and
means for classifying a second set of computer files stored in the filesystem without accessing the filesystem into one or more categories based on a calculated probability derived from a corresponding second set of attributes and context indexes in the one or more databases stored separately from the filesystem.
23. The system of claim 22, wherein the means for classifying comprises a Naïve Bayes classifier.
US12/022,676 2008-01-30 2008-01-30 Systems and methods for probabilistic data classification Active 2030-01-16 US8296301B2 (en)

Priority Applications (7)

Application Number Priority Date Filing Date Title
US12/022,676 US8296301B2 (en) 2008-01-30 2008-01-30 Systems and methods for probabilistic data classification
US13/615,084 US20130066874A1 (en) 2008-01-30 2012-09-13 Systems and methods for probabilistic data classification
US14/968,719 US9740764B2 (en) 2008-01-30 2015-12-14 Systems and methods for probabilistic data classification
US15/654,042 US10628459B2 (en) 2008-01-30 2017-07-19 Systems and methods for probabilistic data classification
US16/818,781 US10783168B2 (en) 2008-01-30 2020-03-13 Systems and methods for probabilistic data classification
US16/944,555 US11256724B2 (en) 2008-01-30 2020-07-31 Systems and methods for probabilistic data classification
US17/327,509 US20210342368A1 (en) 2008-01-30 2021-05-21 Systems and methods for probabilistic data classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/022,676 US8296301B2 (en) 2008-01-30 2008-01-30 Systems and methods for probabilistic data classification

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/615,084 Continuation US20130066874A1 (en) 2008-01-30 2012-09-13 Systems and methods for probabilistic data classification

Publications (2)

Publication Number Publication Date
US20090192979A1 US20090192979A1 (en) 2009-07-30
US8296301B2 true US8296301B2 (en) 2012-10-23

Family

ID=40900238

Family Applications (7)

Application Number Title Priority Date Filing Date
US12/022,676 Active 2030-01-16 US8296301B2 (en) 2008-01-30 2008-01-30 Systems and methods for probabilistic data classification
US13/615,084 Abandoned US20130066874A1 (en) 2008-01-30 2012-09-13 Systems and methods for probabilistic data classification
US14/968,719 Active US9740764B2 (en) 2008-01-30 2015-12-14 Systems and methods for probabilistic data classification
US15/654,042 Active 2028-11-07 US10628459B2 (en) 2008-01-30 2017-07-19 Systems and methods for probabilistic data classification
US16/818,781 Active US10783168B2 (en) 2008-01-30 2020-03-13 Systems and methods for probabilistic data classification
US16/944,555 Active US11256724B2 (en) 2008-01-30 2020-07-31 Systems and methods for probabilistic data classification
US17/327,509 Abandoned US20210342368A1 (en) 2008-01-30 2021-05-21 Systems and methods for probabilistic data classification

Family Applications After (6)

Application Number Title Priority Date Filing Date
US13/615,084 Abandoned US20130066874A1 (en) 2008-01-30 2012-09-13 Systems and methods for probabilistic data classification
US14/968,719 Active US9740764B2 (en) 2008-01-30 2015-12-14 Systems and methods for probabilistic data classification
US15/654,042 Active 2028-11-07 US10628459B2 (en) 2008-01-30 2017-07-19 Systems and methods for probabilistic data classification
US16/818,781 Active US10783168B2 (en) 2008-01-30 2020-03-13 Systems and methods for probabilistic data classification
US16/944,555 Active US11256724B2 (en) 2008-01-30 2020-07-31 Systems and methods for probabilistic data classification
US17/327,509 Abandoned US20210342368A1 (en) 2008-01-30 2021-05-21 Systems and methods for probabilistic data classification

Country Status (1)

Country Link
US (7) US8296301B2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275291B2 (en) 2013-06-17 2016-03-01 Texifter, LLC System and method of classifier ranking for incorporation into enhanced machine learning
US9740764B2 (en) 2008-01-30 2017-08-22 Commvault Systems, Inc. Systems and methods for probabilistic data classification
US10846180B2 (en) 2017-09-14 2020-11-24 Commvault Systems, Inc. Distributed framework for task splitting and task assignments in a content indexing system
US10846266B2 (en) 2017-09-14 2020-11-24 Commvault Systems, Inc. Distributed architecture for content indexing emails
US11036592B2 (en) 2017-09-14 2021-06-15 Commvault Systems, Inc. Distributed content indexing architecture with separately stored file previews
US11086834B2 (en) 2017-09-14 2021-08-10 Commvault Systems, Inc. Distributed framework for data proximity-based task splitting in a content indexing system
US11243832B2 (en) * 2019-10-10 2022-02-08 International Business Machines Corporation Dynamically analyzing diagnostic operations data via machine learning techniques
US11263088B2 (en) 2017-09-14 2022-03-01 Commvault Systems, Inc. Distributed architecture for tracking content indexing
US11321191B2 (en) 2014-07-01 2022-05-03 Commvault Systems, Inc. Lightweight data reconstruction based on backup data
US11392538B2 (en) 2010-09-30 2022-07-19 Commvault Systems, Inc. Archiving data objects using secondary copies
US11461016B2 (en) 2012-08-13 2022-10-04 Commvault Systems, Inc. Lightweight mounting of a secondary file system data
US11710033B2 (en) 2018-06-12 2023-07-25 Bank Of America Corporation Unsupervised machine learning system to automate functions on a graph structure

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8719809B2 (en) 2006-12-22 2014-05-06 Commvault Systems, Inc. Point in time rollback and un-installation of software
US9286443B2 (en) * 2007-06-04 2016-03-15 Rapid Systems, Llc Systems and methods for data aggregation and prioritization
US8671112B2 (en) * 2008-06-12 2014-03-11 Athenahealth, Inc. Methods and apparatus for automated image classification
US8346532B2 (en) * 2008-07-11 2013-01-01 International Business Machines Corporation Managing the creation, detection, and maintenance of sensitive information
US8356137B2 (en) * 2010-02-26 2013-01-15 Apple Inc. Data storage scheme for non-volatile memories based on data priority
US20120101870A1 (en) * 2010-10-22 2012-04-26 International Business Machines Corporation Estimating the Sensitivity of Enterprise Data
US9569449B2 (en) 2010-11-18 2017-02-14 International Business Machines Corporation Method and apparatus for autonomic discovery of sensitive content
WO2013020205A1 (en) * 2011-08-05 2013-02-14 Research In Motion Limited System and method for searching for text and displaying found text in augmented reality
US8996350B1 (en) 2011-11-02 2015-03-31 Dub Software Group, Inc. System and method for automatic document management
US10114843B2 (en) * 2011-11-09 2018-10-30 Sap Se Content migration framework
EP2795487A4 (en) 2011-12-23 2015-07-29 Amazon Tech Inc Scalable analysis platform for semi-structured data
US9037587B2 (en) * 2012-05-10 2015-05-19 International Business Machines Corporation System and method for the classification of storage
US11126720B2 (en) 2012-09-26 2021-09-21 Bluvector, Inc. System and method for automated machine-learning, zero-day malware detection
US9292688B2 (en) 2012-09-26 2016-03-22 Northrop Grumman Systems Corporation System and method for automated machine-learning, zero-day malware detection
GB2506879A (en) * 2012-10-10 2014-04-16 Apperception Services Ltd Document classification using latent semantic analysis
US9514007B2 (en) 2013-03-15 2016-12-06 Amazon Technologies, Inc. Database system with database engine and separate distributed storage service
US9400819B2 (en) * 2013-06-07 2016-07-26 Dell Products, Lp Updating object attributes in a lock-coupled namespace traversal
US9898372B2 (en) 2013-09-18 2018-02-20 International Business Machines Corporation Backing up a computer application
US20150347390A1 (en) * 2014-05-30 2015-12-03 Vavni, Inc. Compliance Standards Metadata Generation
CN104102748B (en) * 2014-08-08 2017-12-22 中国联合网络通信集团有限公司 File Mapping method and device and file recommendation method and device
US20160285918A1 (en) * 2015-03-29 2016-09-29 Whitebox Security Ltd. System and method for classifying documents based on access
WO2017027030A1 (en) * 2015-08-12 2017-02-16 Hewlett Packard Enterprise Development Lp Retraining a machine classifier based on audited issue data
WO2017064705A1 (en) * 2015-10-14 2017-04-20 Minereye Ltd. Method of identifying and tracking sensitive data and system thereof
US10713272B1 (en) 2016-06-30 2020-07-14 Amazon Technologies, Inc. Dynamic generation of data catalogs for accessing data
CN106326348A (en) * 2016-08-08 2017-01-11 成都四威高科技产业园有限公司 Electronic scanning processing system and electronic scanning processing method for papery documents
US11277494B1 (en) 2016-11-27 2022-03-15 Amazon Technologies, Inc. Dynamically routing code for executing
US11481408B2 (en) * 2016-11-27 2022-10-25 Amazon Technologies, Inc. Event driven extract, transform, load (ETL) processing
US11138220B2 (en) 2016-11-27 2021-10-05 Amazon Technologies, Inc. Generating data transformation workflows
US10963479B1 (en) 2016-11-27 2021-03-30 Amazon Technologies, Inc. Hosting version controlled extract, transform, load (ETL) code
US10621210B2 (en) 2016-11-27 2020-04-14 Amazon Technologies, Inc. Recognizing unknown data objects
US11036560B1 (en) 2016-12-20 2021-06-15 Amazon Technologies, Inc. Determining isolation types for executing code portions
US10545979B2 (en) 2016-12-20 2020-01-28 Amazon Technologies, Inc. Maintaining data lineage to detect data events
US10331624B2 (en) * 2017-03-03 2019-06-25 Transitive Innovation, Llc Automated data classification system
US20200152319A1 (en) * 2017-05-26 2020-05-14 Koninklijke Philips N.V. Scheduling a task for a medical professional
CN107273682A (en) * 2017-06-09 2017-10-20 广州慧扬健康科技有限公司 File organization module applied to electronic health record archive arrangement system
US10438000B1 (en) * 2017-09-22 2019-10-08 Symantec Corporation Using recognized backup images for recovery after a ransomware attack
US10628475B2 (en) 2017-10-03 2020-04-21 International Business Machines Corporation Runtime control of automation accuracy using adjustable thresholds
US10824474B1 (en) 2017-11-14 2020-11-03 Amazon Technologies, Inc. Dynamically allocating resources for interdependent portions of distributed data processing programs
US11914571B1 (en) 2017-11-22 2024-02-27 Amazon Technologies, Inc. Optimistic concurrency for a multi-writer database
US10725870B1 (en) 2018-01-02 2020-07-28 NortonLifeLock Inc. Content-based automatic backup of images
CN110109891B (en) * 2018-01-18 2023-04-21 伊姆西Ip控股有限责任公司 Method, apparatus and storage medium for data migration
US10908940B1 (en) 2018-02-26 2021-02-02 Amazon Technologies, Inc. Dynamically managed virtual server system
US11288385B2 (en) 2018-04-13 2022-03-29 Sophos Limited Chain of custody for enterprise documents
US11269911B1 (en) 2018-11-23 2022-03-08 Amazon Technologies, Inc. Using specified performance attributes to configure machine learning pipeline stages for an ETL job
FR3090927A1 (en) * 2018-12-21 2020-06-26 Orange Method for processing ephemeral messages in electronic messaging and corresponding processing system
US11017117B2 (en) 2019-01-02 2021-05-25 Bank Of America Corporation Pre-firewall data classification
US11341163B1 (en) 2020-03-30 2022-05-24 Amazon Technologies, Inc. Multi-level replication filtering for a distributed database
CN111858836B (en) * 2020-08-14 2024-02-09 连接派(杭州)互联网有限公司 Data processing and providing method, device, system and storage medium
CN112084410B (en) * 2020-09-10 2023-07-25 北京百度网讯科技有限公司 Document type recommendation method and device, electronic equipment and readable storage medium
CN113569315B (en) * 2021-07-27 2023-11-28 中铁大桥局集团有限公司 Bridge cluster dynamic evaluation method, device, equipment and readable storage medium
FR3136570A1 (en) * 2022-06-14 2023-12-15 Thales Data labeling device and associated labeling method

Citations (192)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4686620A (en) 1984-07-26 1987-08-11 American Telephone And Telegraph Company, At&T Bell Laboratories Database backup method
US4995035A (en) 1988-10-31 1991-02-19 International Business Machines Corporation Centralized management in a computer network
US5005122A (en) 1987-09-08 1991-04-02 Digital Equipment Corporation Arrangement with cooperating management server node and network service node
EP0259912B1 (en) 1986-09-12 1991-10-16 Hewlett-Packard Limited File backup facility for a community of personal computers
EP0467546A2 (en) 1990-07-18 1992-01-22 International Computers Limited Distributed data processing systems
US5093912A (en) 1989-06-26 1992-03-03 International Business Machines Corporation Dynamic resource pool expansion and contraction in multiprocessing environments
US5133065A (en) 1989-07-27 1992-07-21 Personal Computer Peripherals Corporation Backup computer program for networks
US5193154A (en) 1987-07-10 1993-03-09 Hitachi, Ltd. Buffered peripheral system and method for backing up and retrieving data to and from backup memory device
US5212772A (en) 1991-02-11 1993-05-18 Gigatrend Incorporated System for storing data in backup tape device
US5226157A (en) 1988-03-11 1993-07-06 Hitachi, Ltd. Backup control method and system in data processing system using identifiers for controlling block data transfer
US5239647A (en) 1990-09-07 1993-08-24 International Business Machines Corporation Data storage hierarchy with shared storage level
US5241670A (en) 1992-04-20 1993-08-31 International Business Machines Corporation Method and system for automated backup copy ordering in a time zero backup copy session
US5241668A (en) 1992-04-20 1993-08-31 International Business Machines Corporation Method and system for automated termination and resumption in a time zero backup copy process
US5276860A (en) 1989-12-19 1994-01-04 Epoch Systems, Inc. Digital data processor with improved backup storage
US5276867A (en) 1989-12-19 1994-01-04 Epoch Systems, Inc. Digital data storage system with improved data migration
US5287500A (en) 1991-06-03 1994-02-15 Digital Equipment Corporation System for allocating storage spaces based upon required and optional service attributes having assigned piorities
US5321816A (en) 1989-10-10 1994-06-14 Unisys Corporation Local-remote apparatus with specialized image storage modules
US5333315A (en) 1991-06-27 1994-07-26 Digital Equipment Corporation System of device independent file directories using a tag between the directories and file descriptors that migrate with the files
US5347653A (en) 1991-06-28 1994-09-13 Digital Equipment Corporation System for reconstructing prior versions of indexes using records indicating changes between successive versions of the indexes
US5410700A (en) 1991-09-04 1995-04-25 International Business Machines Corporation Computer system which supports asynchronous commitment of data
US5448724A (en) 1993-07-02 1995-09-05 Fujitsu Limited Data processing system having double supervising functions
US5491810A (en) 1994-03-01 1996-02-13 International Business Machines Corporation Method and system for automated data storage system space allocation utilizing prioritized data set parameters
US5495607A (en) 1993-11-15 1996-02-27 Conner Peripherals, Inc. Network management system having virtual catalog overview of files distributively stored across network domain
US5504873A (en) 1989-11-01 1996-04-02 E-Systems, Inc. Mass data storage and retrieval system
US5519865A (en) 1993-07-30 1996-05-21 Mitsubishi Denki Kabushiki Kaisha System and method for retrieving and classifying data stored in a database system
US5544347A (en) 1990-09-24 1996-08-06 Emc Corporation Data storage system controlled remote data mirroring with respectively maintained data indices
US5544345A (en) 1993-11-08 1996-08-06 International Business Machines Corporation Coherence controls for store-multiple shared data coordinated by cache directory entries in a shared electronic storage
US5559957A (en) 1995-05-31 1996-09-24 Lucent Technologies Inc. File system for a data storage device having a power fail recovery mechanism for write/replace operations
EP0405926B1 (en) 1989-06-30 1996-12-04 Digital Equipment Corporation Method and apparatus for managing a shadow set of storage media
US5619644A (en) 1995-09-18 1997-04-08 International Business Machines Corporation Software directed microcode state save for distributed storage controller
EP0774715A1 (en) 1995-10-23 1997-05-21 Stac Electronics System for backing up files from disk volumes on multiple nodes of a computer network
US5638509A (en) 1994-06-10 1997-06-10 Exabyte Corporation Data storage and protection system
US5673381A (en) 1994-05-27 1997-09-30 Cheyenne Software International Sales Corp. System and parallel streaming and data stripping to back-up a network
US5699361A (en) 1995-07-18 1997-12-16 Industrial Technology Research Institute Multimedia channel formulation mechanism
US5729743A (en) 1995-11-17 1998-03-17 Deltatech Research, Inc. Computer apparatus and method for merging system deltas
US5737747A (en) 1995-10-27 1998-04-07 Emc Corporation Prefetching to service multiple video streams from an integrated cached disk array
US5751997A (en) 1993-01-21 1998-05-12 Apple Computer, Inc. Method and apparatus for transferring archival data among an arbitrarily large number of computer devices in a networked computer environment
US5758359A (en) 1996-10-24 1998-05-26 Digital Equipment Corporation Method and apparatus for performing retroactive backups in a computer system
US5761677A (en) 1996-01-03 1998-06-02 Sun Microsystems, Inc. Computer system method and apparatus providing for various versions of a file without requiring data copy or log operations
US5764972A (en) 1993-02-01 1998-06-09 Lsc, Inc. Archiving file system for data servers in a distributed network environment
US5812398A (en) 1996-06-10 1998-09-22 Sun Microsystems, Inc. Method and system for escrowed backup of hotelled world wide web sites
US5813017A (en) 1994-10-24 1998-09-22 International Business Machines Corporation System and method for reducing storage requirement in backup subsystems utilizing segmented compression and differencing
US5813009A (en) 1995-07-28 1998-09-22 Univirtual Corp. Computer based records management system method
US5829046A (en) 1995-10-27 1998-10-27 Emc Corporation On-line tape backup using an integrated cached disk array
US5832510A (en) 1995-07-05 1998-11-03 Hitachi, Ltd. Information processing system enabling access to different types of files, control method for the same and storage medium for storing programs to control the same
US5875478A (en) 1996-12-03 1999-02-23 Emc Corporation Computer backup using a file system, network, disk, tape and remote archiving repository media system
EP0899662A1 (en) 1997-08-29 1999-03-03 Hewlett-Packard Company Backup and restore system for a computer network
US5887134A (en) 1997-06-30 1999-03-23 Sun Microsystems System and method for preserving message order while employing both programmed I/O and DMA operations
US5892917A (en) 1995-09-27 1999-04-06 Microsoft Corporation System for log record and log expansion with inserted log records representing object request for specified object corresponding to cached object copies
US5901327A (en) 1996-05-28 1999-05-04 Emc Corporation Bundling of write data from channel commands in a command chain for transmission over a data link between data storage systems for remote data mirroring
US5907621A (en) 1996-11-15 1999-05-25 International Business Machines Corporation System and method for session management
US5924102A (en) 1997-05-07 1999-07-13 International Business Machines Corporation System and method for managing critical files
US5950205A (en) 1997-09-25 1999-09-07 Cisco Technology, Inc. Data transmission over the internet using a cache memory file system
US5953721A (en) 1995-12-15 1999-09-14 Sony Corporation Data supplying method and device thereof
US5974563A (en) 1995-10-16 1999-10-26 Network Specialists, Inc. Real time backup system
US6021415A (en) 1997-10-29 2000-02-01 International Business Machines Corporation Storage management system with file aggregation and space reclamation within aggregated files
US6026414A (en) 1998-03-05 2000-02-15 International Business Machines Corporation System including a proxy client to backup files in a distributed computing environment
US6052735A (en) 1997-10-24 2000-04-18 Microsoft Corporation Electronic mail object synchronization between a desktop computer and mobile device
US6061692A (en) 1997-11-04 2000-05-09 Microsoft Corporation System and method for administering a meta database as an integral component of an information server
US6076148A (en) 1997-12-26 2000-06-13 Emc Corporation Mass storage subsystem and backup arrangement for digital data processing system which permits information to be backed up while host computer(s) continue(s) operating in connection with information stored on mass storage subsystem
US6094416A (en) 1997-05-09 2000-07-25 I/O Control Corporation Multi-tier architecture for control network
US6131095A (en) 1996-12-11 2000-10-10 Hewlett-Packard Company Method of accessing a target entity over a communications network
US6131190A (en) 1997-12-18 2000-10-10 Sidwell; Leland P. System for modifying JCL parameters to optimize data storage allocations
US6148412A (en) 1996-05-23 2000-11-14 International Business Machines Corporation Availability and recovery of files using copy storage pools
US6154852A (en) 1998-06-10 2000-11-28 International Business Machines Corporation Method and apparatus for data backup and recovery
US6154787A (en) 1998-01-21 2000-11-28 Unisys Corporation Grouping shared resources into one or more pools and automatically re-assigning shared resources from where they are not currently needed to where they are needed
US6161111A (en) 1998-03-31 2000-12-12 Emc Corporation System and method for performing file-handling operations in a digital data processing system using an operating system-independent file map
US6167402A (en) 1998-04-27 2000-12-26 Sun Microsystems, Inc. High performance message store
US6175829B1 (en) 1998-04-22 2001-01-16 Nec Usa, Inc. Method and apparatus for facilitating query reformulation
US6212512B1 (en) 1999-01-06 2001-04-03 Hewlett-Packard Company Integration of a database into file management software for protecting, tracking and retrieving data
US6240416B1 (en) 1998-09-11 2001-05-29 Ambeo, Inc. Distributed metadata system and method
US6260069B1 (en) 1998-02-10 2001-07-10 International Business Machines Corporation Direct data retrieval in a distributed computing system
US6269431B1 (en) 1998-08-13 2001-07-31 Emc Corporation Virtual storage and block level direct access of secondary storage for recovery of backup data
US6275953B1 (en) 1997-09-26 2001-08-14 Emc Corporation Recovery from failure of a data processor in a network server
US6301592B1 (en) 1997-11-05 2001-10-09 Hitachi, Ltd. Method of and an apparatus for displaying version information and configuration information and a computer-readable recording medium on which a version and configuration information display program is recorded
US6324581B1 (en) 1999-03-03 2001-11-27 Emc Corporation File server system using file system storage, data movers, and an exchange of meta data among data movers for file locking and direct access to shared file systems
US6330570B1 (en) 1998-03-02 2001-12-11 Hewlett-Packard Company Data backup system
US6328766B1 (en) 1997-01-23 2001-12-11 Overland Data, Inc. Media element library with non-overlapping subset of media elements and non-overlapping subset of media element drives accessible to first host and unaccessible to second host
US6330642B1 (en) 2000-06-29 2001-12-11 Bull Hn Informatin Systems Inc. Three interconnected raid disk controller data processing system architecture
US20020004883A1 (en) 1997-03-12 2002-01-10 Thai Nguyen Network attached virtual data storage subsystem
EP1174795A1 (en) 2000-07-19 2002-01-23 Hewlett-Packard Company, A Delaware Corporation Multiplexing computing apparatus
US6343324B1 (en) 1999-09-13 2002-01-29 International Business Machines Corporation Method and system for controlling access share storage devices in a network environment by configuring host-to-volume mapping data structures in the controller memory for granting and denying access to the devices
US6350199B1 (en) 1999-03-16 2002-02-26 International Game Technology Interactive gaming machine and method with customized game screen presentation
US6356801B1 (en) 2000-05-19 2002-03-12 International Business Machines Corporation High availability work queuing in an automated data storage library
USRE37601E1 (en) 1992-04-20 2002-03-19 International Business Machines Corporation Method and system for incremental time zero backup copying of data
US6374336B1 (en) 1997-12-24 2002-04-16 Avid Technology, Inc. Computer system and process for transferring multiple high bandwidth streams of data between multiple storage units and multiple applications in a scalable and reliable manner
US20020049738A1 (en) 2000-08-03 2002-04-25 Epstein Bruce A. Information collaboration and reliability assessment
US6389432B1 (en) 1999-04-05 2002-05-14 Auspex Systems, Inc. Intelligent virtual volume access
US20020069324A1 (en) 1999-12-07 2002-06-06 Gerasimov Dennis V. Scalable storage architecture
US20020083055A1 (en) 2000-09-29 2002-06-27 Francois Pachet Information item morphing system
US20020087550A1 (en) 2000-11-29 2002-07-04 James Carlyle Data storage and retrieval system
US6418478B1 (en) 1997-10-30 2002-07-09 Commvault Systems, Inc. Pipelined high speed data transfer mechanism
US6421779B1 (en) 1997-11-14 2002-07-16 Fujitsu Limited Electronic data storage apparatus, system and method
US6421683B1 (en) 1999-03-31 2002-07-16 Verizon Laboratories Inc. Method and product for performing data transfer in a computer system
US6421711B1 (en) 1998-06-29 2002-07-16 Emc Corporation Virtual ports for data transferring of a data storage system
US6430575B1 (en) 1999-09-10 2002-08-06 Xerox Corporation Collaborative document management system with customizable filing structures that are mutually intelligible
US6438586B1 (en) 1996-09-30 2002-08-20 Emc Corporation File transfer utility which employs an intermediate data storage system
US20020133476A1 (en) 1996-07-08 2002-09-19 Gert J. Reinhardt Database system
US20020174107A1 (en) 2001-03-13 2002-11-21 Poulin Christian D. Network transaction method
US6487644B1 (en) 1996-11-22 2002-11-26 Veritas Operating Corporation System and method for multiplexed data back-up to a storage tape and restore operations using client identification tags
US6487561B1 (en) 1998-12-31 2002-11-26 Emc Corporation Apparatus and methods for copying, backing up, and restoring data using a backup segment size larger than the storage block size
US20030018607A1 (en) 2000-08-04 2003-01-23 Lennon Alison Joan Method of enabling browse and search access to electronically-accessible multimedia databases
US6519679B2 (en) 1999-06-11 2003-02-11 Dell Usa, L.P. Policy based storage configuration
US6538669B1 (en) 1999-07-15 2003-03-25 Dell Products L.P. Graphical user interface for configuration of a storage system
US6542909B1 (en) 1998-06-30 2003-04-01 Emc Corporation System for determining mapping of logical objects in a computer system
US6542972B2 (en) 2000-01-31 2003-04-01 Commvault Systems, Inc. Logical view and access to physical storage in modular data and storage management system
US6564228B1 (en) 2000-01-14 2003-05-13 Sun Microsystems, Inc. Method of enabling heterogeneous platforms to utilize a universal file system in a storage area network
US6581143B2 (en) 1999-12-23 2003-06-17 Emc Corporation Data processing method and apparatus for enabling independent access to replicated data
US20030115219A1 (en) 2001-12-19 2003-06-19 International Business Machines Corporation Method, system, and program for storing data in a data store
US20030130993A1 (en) * 2001-08-08 2003-07-10 Quiver, Inc. Document categorization engine
US6625623B1 (en) 1999-12-16 2003-09-23 Livevault Corporation Systems and methods for backing up data files
US20030182583A1 (en) 2002-03-25 2003-09-25 Panareef Pty. Ltd. Electronic document classification and monitoring
US6647396B2 (en) 2000-12-28 2003-11-11 Trilogy Development Group, Inc. Classification based content management system
US6658436B2 (en) 2000-01-31 2003-12-02 Commvault Systems, Inc. Logical view and access to data managed by a modular data and storage management system
EP0981090B1 (en) 1998-08-17 2003-12-17 Connected Place Limited A method of producing a checkpoint which describes a base file and a method of generating a difference file defining differences between an updated file and a base file
US20040010487A1 (en) 2001-09-28 2004-01-15 Anand Prahlad System and method for generating and managing quick recovery volumes
US20040010493A1 (en) 1997-11-19 2004-01-15 Ns Solutions Corporation Database system and a method of data retrieval from the system
US20040015514A1 (en) 2002-04-03 2004-01-22 Austin Melton Method and system for managing data objects
US20040015468A1 (en) 2002-07-19 2004-01-22 International Business Machines Corporation Capturing data changes utilizing data-space tracking
US6732124B1 (en) 1999-03-30 2004-05-04 Fujitsu Limited Data processing system with mechanism for restoring file systems based on transaction logs
US6763351B1 (en) 2001-06-18 2004-07-13 Siebel Systems, Inc. Method, apparatus, and system for attaching search results
US20040139059A1 (en) 2002-12-31 2004-07-15 Conroy William F. Method for automatic deduction of rules for matching content to categories
US6775790B2 (en) 2000-06-02 2004-08-10 Hewlett-Packard Development Company, L.P. Distributed fine-grained enhancements for distributed table driven I/O mapping
US20040254919A1 (en) 2003-06-13 2004-12-16 Microsoft Corporation Log parser
US20040255161A1 (en) 2003-04-12 2004-12-16 Deep Nines, Inc. System and method for network edge data protection
US20040260678A1 (en) 2003-06-18 2004-12-23 Microsoft Corporation State based configuration failure detection using checkpoint comparison
US6847984B1 (en) 1999-12-16 2005-01-25 Livevault Corporation Systems and methods for backing up data files
US20050021537A1 (en) 2003-07-22 2005-01-27 Rainer Brendle Self-describing business objects
US20050033800A1 (en) 2003-06-25 2005-02-10 Srinivas Kavuri Hierarchical system and method for performing storage operations in a computer network
US20050037367A9 (en) 1994-02-10 2005-02-17 Affymetrix, Inc., A California Corporation Scanned image alignment systems and methods
US20050044114A1 (en) 2003-04-03 2005-02-24 Rajiv Kottomtharayil System and method for dynamically performing storage operations in a computer network
US20050050075A1 (en) 2003-08-29 2005-03-03 Fujitsu Limited Data classification processing apparatus, data classification processing method and storage medium
US6871163B2 (en) 2002-05-31 2005-03-22 Sap Aktiengesellschaft Behavior-based adaptation of computer systems
US6886020B1 (en) 2000-08-17 2005-04-26 Emc Corporation Method and apparatus for storage system metrics management and archive
US20050114406A1 (en) 2003-11-26 2005-05-26 Veritas Operating Corporation System and method for detecting and storing file content access information within a file system
US20050154695A1 (en) 2004-01-09 2005-07-14 Oracle International Corporation Rule-based transformation of metadata
US20050182797A1 (en) 2004-02-12 2005-08-18 International Business Machines Corporation Method and apparatus for file system snapshot persistence
US20050182773A1 (en) 2004-02-18 2005-08-18 Feinsmith Jason B. Machine-implemented activity management system using asynchronously shared activity data objects and journal data items
US20050188248A1 (en) 2003-05-09 2005-08-25 O'brien John Scalable storage architecture
US20050193128A1 (en) 2004-02-26 2005-09-01 Dawson Colin S. Apparatus, system, and method for data access management
US20050203964A1 (en) 2003-03-27 2005-09-15 Naoto Matsunami Storage device
US6947935B1 (en) 2001-04-04 2005-09-20 Microsoft Corporation Training, inference and user interface for guiding the caching of media content on local stores
US20050216453A1 (en) 2004-03-23 2005-09-29 Koichi Sasaki System and method for data classification usable for data search
US20050228794A1 (en) 2004-03-30 2005-10-13 Navas Julio C Method and apparatus for virtual content access systems built on a content routing network
US20050257083A1 (en) 2004-05-13 2005-11-17 Cousins Robert E Transaction-based storage system and method that uses variable sized objects to store data
US20050262097A1 (en) 2004-05-07 2005-11-24 Sim-Tang Siew Y System for moving real-time data events across a plurality of devices in a network for simultaneous data protection, replication, and access services
US20050289193A1 (en) 2004-06-25 2005-12-29 Yan Arrouye Methods and systems for managing data
US6983322B1 (en) 1999-10-14 2006-01-03 Al Acquisitions, Inc. System for discrete parallel processing of queries and updates
US20060004820A1 (en) 2004-07-01 2006-01-05 Claudatos Christopher H Storage pools for information management
US20060010227A1 (en) 2004-06-01 2006-01-12 Rajeev Atluri Methods and apparatus for accessing data from a primary data storage system for secondary storage
US6996616B1 (en) 2000-04-17 2006-02-07 Akamai Technologies, Inc. HTML delivery from edge-of-network servers in a content delivery network (CDN)
US20060031263A1 (en) 2004-06-25 2006-02-09 Yan Arrouye Methods and systems for managing data
US20060031287A1 (en) 2001-01-29 2006-02-09 Ulrich Thomas R Systems and methods for load balancing drives and servers
US20060031225A1 (en) 2004-08-06 2006-02-09 Grand Central Communications, Inc. Providing on-demand access to services in a wide area network
US7003519B1 (en) 1999-09-24 2006-02-21 France Telecom Method of thematic classification of documents, themetic classification module, and search engine incorporating such a module
US7035880B1 (en) 1999-07-14 2006-04-25 Commvault Systems, Inc. Modular backup and retrieval system used in conjunction with a storage area network
US20060101285A1 (en) 2004-11-09 2006-05-11 Fortiva Inc. Secure and searchable storage system and method
US20060106814A1 (en) 2004-11-17 2006-05-18 Steven Blumenau Systems and methods for unioning different taxonomy tags for a digital asset
US20060195449A1 (en) * 2005-02-28 2006-08-31 Microsoft Corporation Discoverability and enumeration mechanisms in a hierarchically secure storage system
US7103740B1 (en) 2003-12-31 2006-09-05 Veritas Operating Corporation Backup mechanism for a multi-class file system
US20060224846A1 (en) 2004-11-05 2006-10-05 Amarendran Arun P System and method to support single instance storage operations
US7130970B2 (en) 2002-09-09 2006-10-31 Commvault Systems, Inc. Dynamic storage device pooling in a computer system
US20060253495A1 (en) 2002-04-02 2006-11-09 Png Teck H Metadata database management system and method therefor
US20060259724A1 (en) * 2003-09-12 2006-11-16 Hitachi, Ltd. Backup system and method based on data characteristics
US20060259468A1 (en) 2005-05-10 2006-11-16 Michael Brooks Methods for electronic records management
US20060294094A1 (en) 2004-02-15 2006-12-28 King Martin T Processing techniques for text capture from a rendered document
US7167895B1 (en) 2000-03-22 2007-01-23 Intel Corporation Signaling method and apparatus to provide content on demand in a broadcast system
US20070027861A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Automated content categorization
US20070033191A1 (en) 2004-06-25 2007-02-08 John Hornkvist Methods and systems for managing permissions data and/or indexes
US7181444B2 (en) 2000-11-21 2007-02-20 America Online, Inc. System and process for searching a network
US7216043B2 (en) 1997-02-12 2007-05-08 Power Measurement Ltd. Push communications architecture for intelligent electronic devices
US7240100B1 (en) 2000-04-14 2007-07-03 Akamai Technologies, Inc. Content delivery network (CDN) content server request handling mechanism with metadata framework support
US7246211B1 (en) 2003-07-22 2007-07-17 Swsoft Holdings, Ltd. System and method for using file system snapshots for online data backup
US20070179995A1 (en) 2005-11-28 2007-08-02 Anand Prahlad Metabase for facilitating data classification
US20070185926A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070288536A1 (en) 2006-06-07 2007-12-13 Microsoft Corporation Managing data with backup server indexing
US20080021921A1 (en) 2002-07-16 2008-01-24 Horn Bruce L Computer system for automatic organization, indexing and viewing of information from multiple sources
US7330997B1 (en) 2004-06-03 2008-02-12 Gary Odom Selective reciprocal backup
US20080059515A1 (en) 2006-09-01 2008-03-06 Fulton Michael S Method, system, and program product for organizing a database
US7346676B1 (en) 2000-07-19 2008-03-18 Akamai Technologies, Inc. Load balancing service
US7359917B2 (en) 2001-12-28 2008-04-15 Thomson Licensing Llc Method and apparatus for automatic detection of data types for data type dependent processing
US20080091655A1 (en) 2006-10-17 2008-04-17 Gokhale Parag S Method and system for offline indexing of content and classifying stored data
US20080228771A1 (en) 2006-12-22 2008-09-18 Commvault Systems, Inc. Method and system for searching stored data
US20080249999A1 (en) * 2007-04-06 2008-10-09 Xerox Corporation Interactive cleaning for automatic document clustering and categorization
US20080263029A1 (en) * 2007-04-18 2008-10-23 Aumni Data, Inc. Adaptive archive data management
US7500150B2 (en) 2005-12-30 2009-03-03 Microsoft Corporation Determining the level of availability of a computing resource
US7529748B2 (en) 2005-11-15 2009-05-05 Ji-Rong Wen Information classification paradigm
US20090177728A1 (en) * 2007-12-20 2009-07-09 Pottenger William M Peer-to-peer indexing-based marketplace
US7583861B2 (en) 2002-11-27 2009-09-01 Teramedica, Inc. Intelligent medical image management system
US7590997B2 (en) 2004-07-30 2009-09-15 Broadband Itv, Inc. System and method for managing, converting and displaying video content on a video-on-demand platform, including ads used for drill-down navigation and consumer-generated classified ads
US7627617B2 (en) 2004-02-11 2009-12-01 Storage Technology Corporation Clustered hierarchical file services
US7627598B1 (en) 1998-06-26 2009-12-01 Alexander James Burke User interface and search system for internet and other applications

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69425658T2 (en) 1993-06-03 2001-04-19 Network Appliance Inc ARRANGEMENT OF A FILE SYSTEM FOR DESCRIBING ANY AREAS
EP0728333A1 (en) 1993-11-09 1996-08-28 Arcada Software Data backup and restore system for a computer network
EP1008048B1 (en) 1997-08-29 2001-10-31 Hewlett-Packard Company Data backup and recovery systems
US6067541A (en) 1997-09-17 2000-05-23 Microsoft Corporation Monitoring document changes in a file system of documents with the document change information stored in a persistent log
US6678700B1 (en) 2000-04-27 2004-01-13 General Atomics System of and method for transparent management of data objects in containers across distributed heterogenous resources
US6976039B2 (en) 2001-05-25 2005-12-13 International Business Machines Corporation Method and system for processing backup data associated with application, querying metadata files describing files accessed by the application
US7028079B2 (en) 2001-05-25 2006-04-11 Lenovo (Singapore) Pte, Ltd. Method and apparatus for the automatic migration of applications and their associated data and configuration files
JP2003248605A (en) 2002-02-26 2003-09-05 Hitachi Ltd Storage system, main storing system, sub-storing system, and its data copying method
JP3726263B2 (en) * 2002-03-01 2005-12-14 ヒューレット・パッカード・カンパニー Document classification method and apparatus
JP4255699B2 (en) 2003-01-20 2009-04-15 株式会社日立製作所 Storage device control apparatus control method and storage device control apparatus
US8280894B2 (en) * 2003-01-22 2012-10-02 Amazon Technologies, Inc. Method and system for maintaining item authority
EP1687745A2 (en) 2003-11-26 2006-08-09 Veritas Operating Corporation System and method for generating extensible file system metadata and file system content processing
US7461063B1 (en) * 2004-05-26 2008-12-02 Proofpoint, Inc. Updating logistic regression models using coherent gradient
US20060112146A1 (en) * 2004-11-22 2006-05-25 Nec Laboratories America, Inc. Systems and methods for data analysis and/or knowledge management
US20060129745A1 (en) * 2004-12-11 2006-06-15 Gunther Thiel Process and appliance for data processing and computer program product
US7213116B2 (en) 2005-03-24 2007-05-01 Hitachi, Ltd. Method and apparatus for mirroring objects between storage systems
US7849030B2 (en) * 2006-05-31 2010-12-07 Hartford Fire Insurance Company Method and system for classifying documents
US7937345B2 (en) * 2006-07-12 2011-05-03 Kofax, Inc. Data classification methods using machine learning techniques
US8370442B2 (en) 2008-08-29 2013-02-05 Commvault Systems, Inc. Method and system for leveraging identified changes to a mail server
US20080301757A1 (en) * 2007-05-31 2008-12-04 Marc Demarest Systems and methods for policy enforcement in electronic evidence management
US7792850B1 (en) * 2007-07-27 2010-09-07 Sonicwall, Inc. On-the-fly pattern recognition with configurable bounds
US8296301B2 (en) 2008-01-30 2012-10-23 Commvault Systems, Inc. Systems and methods for probabilistic data classification

Patent Citations (222)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4686620A (en) 1984-07-26 1987-08-11 American Telephone And Telegraph Company, At&T Bell Laboratories Database backup method
EP0259912B1 (en) 1986-09-12 1991-10-16 Hewlett-Packard Limited File backup facility for a community of personal computers
US5193154A (en) 1987-07-10 1993-03-09 Hitachi, Ltd. Buffered peripheral system and method for backing up and retrieving data to and from backup memory device
US5005122A (en) 1987-09-08 1991-04-02 Digital Equipment Corporation Arrangement with cooperating management server node and network service node
US5226157A (en) 1988-03-11 1993-07-06 Hitachi, Ltd. Backup control method and system in data processing system using identifiers for controlling block data transfer
US4995035A (en) 1988-10-31 1991-02-19 International Business Machines Corporation Centralized management in a computer network
US5093912A (en) 1989-06-26 1992-03-03 International Business Machines Corporation Dynamic resource pool expansion and contraction in multiprocessing environments
EP0405926B1 (en) 1989-06-30 1996-12-04 Digital Equipment Corporation Method and apparatus for managing a shadow set of storage media
US5133065A (en) 1989-07-27 1992-07-21 Personal Computer Peripherals Corporation Backup computer program for networks
US5321816A (en) 1989-10-10 1994-06-14 Unisys Corporation Local-remote apparatus with specialized image storage modules
US5504873A (en) 1989-11-01 1996-04-02 E-Systems, Inc. Mass data storage and retrieval system
US5276867A (en) 1989-12-19 1994-01-04 Epoch Systems, Inc. Digital data storage system with improved data migration
US5276860A (en) 1989-12-19 1994-01-04 Epoch Systems, Inc. Digital data processor with improved backup storage
EP0467546A2 (en) 1990-07-18 1992-01-22 International Computers Limited Distributed data processing systems
US5239647A (en) 1990-09-07 1993-08-24 International Business Machines Corporation Data storage hierarchy with shared storage level
US5544347A (en) 1990-09-24 1996-08-06 Emc Corporation Data storage system controlled remote data mirroring with respectively maintained data indices
US5212772A (en) 1991-02-11 1993-05-18 Gigatrend Incorporated System for storing data in backup tape device
US5287500A (en) 1991-06-03 1994-02-15 Digital Equipment Corporation System for allocating storage spaces based upon required and optional service attributes having assigned piorities
US5333315A (en) 1991-06-27 1994-07-26 Digital Equipment Corporation System of device independent file directories using a tag between the directories and file descriptors that migrate with the files
US5347653A (en) 1991-06-28 1994-09-13 Digital Equipment Corporation System for reconstructing prior versions of indexes using records indicating changes between successive versions of the indexes
US5410700A (en) 1991-09-04 1995-04-25 International Business Machines Corporation Computer system which supports asynchronous commitment of data
US5241668A (en) 1992-04-20 1993-08-31 International Business Machines Corporation Method and system for automated termination and resumption in a time zero backup copy process
USRE37601E1 (en) 1992-04-20 2002-03-19 International Business Machines Corporation Method and system for incremental time zero backup copying of data
US5241670A (en) 1992-04-20 1993-08-31 International Business Machines Corporation Method and system for automated backup copy ordering in a time zero backup copy session
US5751997A (en) 1993-01-21 1998-05-12 Apple Computer, Inc. Method and apparatus for transferring archival data among an arbitrarily large number of computer devices in a networked computer environment
US5764972A (en) 1993-02-01 1998-06-09 Lsc, Inc. Archiving file system for data servers in a distributed network environment
US5448724A (en) 1993-07-02 1995-09-05 Fujitsu Limited Data processing system having double supervising functions
US5519865A (en) 1993-07-30 1996-05-21 Mitsubishi Denki Kabushiki Kaisha System and method for retrieving and classifying data stored in a database system
US5544345A (en) 1993-11-08 1996-08-06 International Business Machines Corporation Coherence controls for store-multiple shared data coordinated by cache directory entries in a shared electronic storage
US5495607A (en) 1993-11-15 1996-02-27 Conner Peripherals, Inc. Network management system having virtual catalog overview of files distributively stored across network domain
US20050037367A9 (en) 1994-02-10 2005-02-17 Affymetrix, Inc., A California Corporation Scanned image alignment systems and methods
US5491810A (en) 1994-03-01 1996-02-13 International Business Machines Corporation Method and system for automated data storage system space allocation utilizing prioritized data set parameters
US5673381A (en) 1994-05-27 1997-09-30 Cheyenne Software International Sales Corp. System and parallel streaming and data stripping to back-up a network
US5638509A (en) 1994-06-10 1997-06-10 Exabyte Corporation Data storage and protection system
US5813017A (en) 1994-10-24 1998-09-22 International Business Machines Corporation System and method for reducing storage requirement in backup subsystems utilizing segmented compression and differencing
US5559957A (en) 1995-05-31 1996-09-24 Lucent Technologies Inc. File system for a data storage device having a power fail recovery mechanism for write/replace operations
US5832510A (en) 1995-07-05 1998-11-03 Hitachi, Ltd. Information processing system enabling access to different types of files, control method for the same and storage medium for storing programs to control the same
US5699361A (en) 1995-07-18 1997-12-16 Industrial Technology Research Institute Multimedia channel formulation mechanism
US5813009A (en) 1995-07-28 1998-09-22 Univirtual Corp. Computer based records management system method
US5619644A (en) 1995-09-18 1997-04-08 International Business Machines Corporation Software directed microcode state save for distributed storage controller
US5892917A (en) 1995-09-27 1999-04-06 Microsoft Corporation System for log record and log expansion with inserted log records representing object request for specified object corresponding to cached object copies
US5974563A (en) 1995-10-16 1999-10-26 Network Specialists, Inc. Real time backup system
EP0774715A1 (en) 1995-10-23 1997-05-21 Stac Electronics System for backing up files from disk volumes on multiple nodes of a computer network
US5778395A (en) 1995-10-23 1998-07-07 Stac, Inc. System for backing up files from disk volumes on multiple nodes of a computer network
US5829046A (en) 1995-10-27 1998-10-27 Emc Corporation On-line tape backup using an integrated cached disk array
US5737747A (en) 1995-10-27 1998-04-07 Emc Corporation Prefetching to service multiple video streams from an integrated cached disk array
US5729743A (en) 1995-11-17 1998-03-17 Deltatech Research, Inc. Computer apparatus and method for merging system deltas
US5953721A (en) 1995-12-15 1999-09-14 Sony Corporation Data supplying method and device thereof
US5761677A (en) 1996-01-03 1998-06-02 Sun Microsystems, Inc. Computer system method and apparatus providing for various versions of a file without requiring data copy or log operations
US6148412A (en) 1996-05-23 2000-11-14 International Business Machines Corporation Availability and recovery of files using copy storage pools
EP0809184B1 (en) 1996-05-23 2002-07-31 International Business Machines Corporation Apparatus and method for the availability and recovery of files using copy storage pools
US5901327A (en) 1996-05-28 1999-05-04 Emc Corporation Bundling of write data from channel commands in a command chain for transmission over a data link between data storage systems for remote data mirroring
US5812398A (en) 1996-06-10 1998-09-22 Sun Microsystems, Inc. Method and system for escrowed backup of hotelled world wide web sites
US20020133476A1 (en) 1996-07-08 2002-09-19 Gert J. Reinhardt Database system
US6438586B1 (en) 1996-09-30 2002-08-20 Emc Corporation File transfer utility which employs an intermediate data storage system
US5758359A (en) 1996-10-24 1998-05-26 Digital Equipment Corporation Method and apparatus for performing retroactive backups in a computer system
US5907621A (en) 1996-11-15 1999-05-25 International Business Machines Corporation System and method for session management
US6487644B1 (en) 1996-11-22 2002-11-26 Veritas Operating Corporation System and method for multiplexed data back-up to a storage tape and restore operations using client identification tags
US5875478A (en) 1996-12-03 1999-02-23 Emc Corporation Computer backup using a file system, network, disk, tape and remote archiving repository media system
US6131095A (en) 1996-12-11 2000-10-10 Hewlett-Packard Company Method of accessing a target entity over a communications network
US6328766B1 (en) 1997-01-23 2001-12-11 Overland Data, Inc. Media element library with non-overlapping subset of media elements and non-overlapping subset of media element drives accessible to first host and unaccessible to second host
US7216043B2 (en) 1997-02-12 2007-05-08 Power Measurement Ltd. Push communications architecture for intelligent electronic devices
US6658526B2 (en) 1997-03-12 2003-12-02 Storage Technology Corporation Network attached virtual data storage subsystem
US20020004883A1 (en) 1997-03-12 2002-01-10 Thai Nguyen Network attached virtual data storage subsystem
US5924102A (en) 1997-05-07 1999-07-13 International Business Machines Corporation System and method for managing critical files
US6094416A (en) 1997-05-09 2000-07-25 I/O Control Corporation Multi-tier architecture for control network
US5887134A (en) 1997-06-30 1999-03-23 Sun Microsystems System and method for preserving message order while employing both programmed I/O and DMA operations
EP0899662A1 (en) 1997-08-29 1999-03-03 Hewlett-Packard Company Backup and restore system for a computer network
US5950205A (en) 1997-09-25 1999-09-07 Cisco Technology, Inc. Data transmission over the internet using a cache memory file system
US6275953B1 (en) 1997-09-26 2001-08-14 Emc Corporation Recovery from failure of a data processor in a network server
US6052735A (en) 1997-10-24 2000-04-18 Microsoft Corporation Electronic mail object synchronization between a desktop computer and mobile device
US6021415A (en) 1997-10-29 2000-02-01 International Business Machines Corporation Storage management system with file aggregation and space reclamation within aggregated files
US6418478B1 (en) 1997-10-30 2002-07-09 Commvault Systems, Inc. Pipelined high speed data transfer mechanism
US6061692A (en) 1997-11-04 2000-05-09 Microsoft Corporation System and method for administering a meta database as an integral component of an information server
US6301592B1 (en) 1997-11-05 2001-10-09 Hitachi, Ltd. Method of and an apparatus for displaying version information and configuration information and a computer-readable recording medium on which a version and configuration information display program is recorded
US6421779B1 (en) 1997-11-14 2002-07-16 Fujitsu Limited Electronic data storage apparatus, system and method
US20040010493A1 (en) 1997-11-19 2004-01-15 Ns Solutions Corporation Database system and a method of data retrieval from the system
US6131190A (en) 1997-12-18 2000-10-10 Sidwell; Leland P. System for modifying JCL parameters to optimize data storage allocations
US6374336B1 (en) 1997-12-24 2002-04-16 Avid Technology, Inc. Computer system and process for transferring multiple high bandwidth streams of data between multiple storage units and multiple applications in a scalable and reliable manner
US6076148A (en) 1997-12-26 2000-06-13 Emc Corporation Mass storage subsystem and backup arrangement for digital data processing system which permits information to be backed up while host computer(s) continue(s) operating in connection with information stored on mass storage subsystem
US6154787A (en) 1998-01-21 2000-11-28 Unisys Corporation Grouping shared resources into one or more pools and automatically re-assigning shared resources from where they are not currently needed to where they are needed
US6260069B1 (en) 1998-02-10 2001-07-10 International Business Machines Corporation Direct data retrieval in a distributed computing system
US6330570B1 (en) 1998-03-02 2001-12-11 Hewlett-Packard Company Data backup system
US6026414A (en) 1998-03-05 2000-02-15 International Business Machines Corporation System including a proxy client to backup files in a distributed computing environment
US6161111A (en) 1998-03-31 2000-12-12 Emc Corporation System and method for performing file-handling operations in a digital data processing system using an operating system-independent file map
US6175829B1 (en) 1998-04-22 2001-01-16 Nec Usa, Inc. Method and apparatus for facilitating query reformulation
US6167402A (en) 1998-04-27 2000-12-26 Sun Microsystems, Inc. High performance message store
US6154852A (en) 1998-06-10 2000-11-28 International Business Machines Corporation Method and apparatus for data backup and recovery
US7627598B1 (en) 1998-06-26 2009-12-01 Alexander James Burke User interface and search system for internet and other applications
US6421711B1 (en) 1998-06-29 2002-07-16 Emc Corporation Virtual ports for data transferring of a data storage system
US6542909B1 (en) 1998-06-30 2003-04-01 Emc Corporation System for determining mapping of logical objects in a computer system
US6269431B1 (en) 1998-08-13 2001-07-31 Emc Corporation Virtual storage and block level direct access of secondary storage for recovery of backup data
EP0981090B1 (en) 1998-08-17 2003-12-17 Connected Place Limited A method of producing a checkpoint which describes a base file and a method of generating a difference file defining differences between an updated file and a base file
US6240416B1 (en) 1998-09-11 2001-05-29 Ambeo, Inc. Distributed metadata system and method
US6487561B1 (en) 1998-12-31 2002-11-26 Emc Corporation Apparatus and methods for copying, backing up, and restoring data using a backup segment size larger than the storage block size
US6212512B1 (en) 1999-01-06 2001-04-03 Hewlett-Packard Company Integration of a database into file management software for protecting, tracking and retrieving data
US6324581B1 (en) 1999-03-03 2001-11-27 Emc Corporation File server system using file system storage, data movers, and an exchange of meta data among data movers for file locking and direct access to shared file systems
US6350199B1 (en) 1999-03-16 2002-02-26 International Game Technology Interactive gaming machine and method with customized game screen presentation
US6732124B1 (en) 1999-03-30 2004-05-04 Fujitsu Limited Data processing system with mechanism for restoring file systems based on transaction logs
US6421683B1 (en) 1999-03-31 2002-07-16 Verizon Laboratories Inc. Method and product for performing data transfer in a computer system
US6389432B1 (en) 1999-04-05 2002-05-14 Auspex Systems, Inc. Intelligent virtual volume access
US6519679B2 (en) 1999-06-11 2003-02-11 Dell Usa, L.P. Policy based storage configuration
US7035880B1 (en) 1999-07-14 2006-04-25 Commvault Systems, Inc. Modular backup and retrieval system used in conjunction with a storage area network
US6538669B1 (en) 1999-07-15 2003-03-25 Dell Products L.P. Graphical user interface for configuration of a storage system
US6430575B1 (en) 1999-09-10 2002-08-06 Xerox Corporation Collaborative document management system with customizable filing structures that are mutually intelligible
US6343324B1 (en) 1999-09-13 2002-01-29 International Business Machines Corporation Method and system for controlling access share storage devices in a network environment by configuring host-to-volume mapping data structures in the controller memory for granting and denying access to the devices
US7003519B1 (en) 1999-09-24 2006-02-21 France Telecom Method of thematic classification of documents, themetic classification module, and search engine incorporating such a module
US6983322B1 (en) 1999-10-14 2006-01-03 Al Acquisitions, Inc. System for discrete parallel processing of queries and updates
US20020069324A1 (en) 1999-12-07 2002-06-06 Gerasimov Dennis V. Scalable storage architecture
US6847984B1 (en) 1999-12-16 2005-01-25 Livevault Corporation Systems and methods for backing up data files
US6625623B1 (en) 1999-12-16 2003-09-23 Livevault Corporation Systems and methods for backing up data files
US6581143B2 (en) 1999-12-23 2003-06-17 Emc Corporation Data processing method and apparatus for enabling independent access to replicated data
US6564228B1 (en) 2000-01-14 2003-05-13 Sun Microsystems, Inc. Method of enabling heterogeneous platforms to utilize a universal file system in a storage area network
US6542972B2 (en) 2000-01-31 2003-04-01 Commvault Systems, Inc. Logical view and access to physical storage in modular data and storage management system
US6658436B2 (en) 2000-01-31 2003-12-02 Commvault Systems, Inc. Logical view and access to data managed by a modular data and storage management system
US7167895B1 (en) 2000-03-22 2007-01-23 Intel Corporation Signaling method and apparatus to provide content on demand in a broadcast system
US7240100B1 (en) 2000-04-14 2007-07-03 Akamai Technologies, Inc. Content delivery network (CDN) content server request handling mechanism with metadata framework support
US6996616B1 (en) 2000-04-17 2006-02-07 Akamai Technologies, Inc. HTML delivery from edge-of-network servers in a content delivery network (CDN)
US6356801B1 (en) 2000-05-19 2002-03-12 International Business Machines Corporation High availability work queuing in an automated data storage library
US6775790B2 (en) 2000-06-02 2004-08-10 Hewlett-Packard Development Company, L.P. Distributed fine-grained enhancements for distributed table driven I/O mapping
US6330642B1 (en) 2000-06-29 2001-12-11 Bull Hn Informatin Systems Inc. Three interconnected raid disk controller data processing system architecture
EP1174795A1 (en) 2000-07-19 2002-01-23 Hewlett-Packard Company, A Delaware Corporation Multiplexing computing apparatus
US7346676B1 (en) 2000-07-19 2008-03-18 Akamai Technologies, Inc. Load balancing service
US20020049738A1 (en) 2000-08-03 2002-04-25 Epstein Bruce A. Information collaboration and reliability assessment
US20030018607A1 (en) 2000-08-04 2003-01-23 Lennon Alison Joan Method of enabling browse and search access to electronically-accessible multimedia databases
US6886020B1 (en) 2000-08-17 2005-04-26 Emc Corporation Method and apparatus for storage system metrics management and archive
US20020083055A1 (en) 2000-09-29 2002-06-27 Francois Pachet Information item morphing system
US7181444B2 (en) 2000-11-21 2007-02-20 America Online, Inc. System and process for searching a network
US20020087550A1 (en) 2000-11-29 2002-07-04 James Carlyle Data storage and retrieval system
US6647396B2 (en) 2000-12-28 2003-11-11 Trilogy Development Group, Inc. Classification based content management system
US20060031287A1 (en) 2001-01-29 2006-02-09 Ulrich Thomas R Systems and methods for load balancing drives and servers
US20020174107A1 (en) 2001-03-13 2002-11-21 Poulin Christian D. Network transaction method
US6947935B1 (en) 2001-04-04 2005-09-20 Microsoft Corporation Training, inference and user interface for guiding the caching of media content on local stores
US6763351B1 (en) 2001-06-18 2004-07-13 Siebel Systems, Inc. Method, apparatus, and system for attaching search results
US20030130993A1 (en) * 2001-08-08 2003-07-10 Quiver, Inc. Document categorization engine
US7346623B2 (en) 2001-09-28 2008-03-18 Commvault Systems, Inc. System and method for generating and managing quick recovery volumes
US20040010487A1 (en) 2001-09-28 2004-01-15 Anand Prahlad System and method for generating and managing quick recovery volumes
US20030115219A1 (en) 2001-12-19 2003-06-19 International Business Machines Corporation Method, system, and program for storing data in a data store
US7359917B2 (en) 2001-12-28 2008-04-15 Thomson Licensing Llc Method and apparatus for automatic detection of data types for data type dependent processing
US20030182583A1 (en) 2002-03-25 2003-09-25 Panareef Pty. Ltd. Electronic document classification and monitoring
US20060253495A1 (en) 2002-04-02 2006-11-09 Png Teck H Metadata database management system and method therefor
US20040015514A1 (en) 2002-04-03 2004-01-22 Austin Melton Method and system for managing data objects
US6871163B2 (en) 2002-05-31 2005-03-22 Sap Aktiengesellschaft Behavior-based adaptation of computer systems
US20080021921A1 (en) 2002-07-16 2008-01-24 Horn Bruce L Computer system for automatic organization, indexing and viewing of information from multiple sources
US20040015468A1 (en) 2002-07-19 2004-01-22 International Business Machines Corporation Capturing data changes utilizing data-space tracking
US7130970B2 (en) 2002-09-09 2006-10-31 Commvault Systems, Inc. Dynamic storage device pooling in a computer system
US7583861B2 (en) 2002-11-27 2009-09-01 Teramedica, Inc. Intelligent medical image management system
US20040139059A1 (en) 2002-12-31 2004-07-15 Conroy William F. Method for automatic deduction of rules for matching content to categories
US20050203964A1 (en) 2003-03-27 2005-09-15 Naoto Matsunami Storage device
US7246207B2 (en) 2003-04-03 2007-07-17 Commvault Systems, Inc. System and method for dynamically performing storage operations in a computer network
US20050044114A1 (en) 2003-04-03 2005-02-24 Rajiv Kottomtharayil System and method for dynamically performing storage operations in a computer network
US20040255161A1 (en) 2003-04-12 2004-12-16 Deep Nines, Inc. System and method for network edge data protection
US20050188248A1 (en) 2003-05-09 2005-08-25 O'brien John Scalable storage architecture
US20040254919A1 (en) 2003-06-13 2004-12-16 Microsoft Corporation Log parser
US20040260678A1 (en) 2003-06-18 2004-12-23 Microsoft Corporation State based configuration failure detection using checkpoint comparison
US20050033800A1 (en) 2003-06-25 2005-02-10 Srinivas Kavuri Hierarchical system and method for performing storage operations in a computer network
US7454569B2 (en) 2003-06-25 2008-11-18 Commvault Systems, Inc. Hierarchical system and method for performing storage operations in a computer network
US7246211B1 (en) 2003-07-22 2007-07-17 Swsoft Holdings, Ltd. System and method for using file system snapshots for online data backup
US20050021537A1 (en) 2003-07-22 2005-01-27 Rainer Brendle Self-describing business objects
US7533103B2 (en) 2003-07-22 2009-05-12 Sap Ag Self-describing business objects
US20050050075A1 (en) 2003-08-29 2005-03-03 Fujitsu Limited Data classification processing apparatus, data classification processing method and storage medium
US20060259724A1 (en) * 2003-09-12 2006-11-16 Hitachi, Ltd. Backup system and method based on data characteristics
US20050114406A1 (en) 2003-11-26 2005-05-26 Veritas Operating Corporation System and method for detecting and storing file content access information within a file system
US7103740B1 (en) 2003-12-31 2006-09-05 Veritas Operating Corporation Backup mechanism for a multi-class file system
US20050154695A1 (en) 2004-01-09 2005-07-14 Oracle International Corporation Rule-based transformation of metadata
US7627617B2 (en) 2004-02-11 2009-12-01 Storage Technology Corporation Clustered hierarchical file services
US20050182797A1 (en) 2004-02-12 2005-08-18 International Business Machines Corporation Method and apparatus for file system snapshot persistence
US20060294094A1 (en) 2004-02-15 2006-12-28 King Martin T Processing techniques for text capture from a rendered document
US20050182773A1 (en) 2004-02-18 2005-08-18 Feinsmith Jason B. Machine-implemented activity management system using asynchronously shared activity data objects and journal data items
US20050193128A1 (en) 2004-02-26 2005-09-01 Dawson Colin S. Apparatus, system, and method for data access management
US20050216453A1 (en) 2004-03-23 2005-09-29 Koichi Sasaki System and method for data classification usable for data search
US20050228794A1 (en) 2004-03-30 2005-10-13 Navas Julio C Method and apparatus for virtual content access systems built on a content routing network
US20050262097A1 (en) 2004-05-07 2005-11-24 Sim-Tang Siew Y System for moving real-time data events across a plurality of devices in a network for simultaneous data protection, replication, and access services
US20050257083A1 (en) 2004-05-13 2005-11-17 Cousins Robert E Transaction-based storage system and method that uses variable sized objects to store data
US20060010227A1 (en) 2004-06-01 2006-01-12 Rajeev Atluri Methods and apparatus for accessing data from a primary data storage system for secondary storage
US7330997B1 (en) 2004-06-03 2008-02-12 Gary Odom Selective reciprocal backup
US20070112809A1 (en) 2004-06-25 2007-05-17 Yan Arrouye Methods and systems for managing data
US20050289193A1 (en) 2004-06-25 2005-12-29 Yan Arrouye Methods and systems for managing data
US20070033191A1 (en) 2004-06-25 2007-02-08 John Hornkvist Methods and systems for managing permissions data and/or indexes
US20060031263A1 (en) 2004-06-25 2006-02-09 Yan Arrouye Methods and systems for managing data
US20060004820A1 (en) 2004-07-01 2006-01-05 Claudatos Christopher H Storage pools for information management
US7590997B2 (en) 2004-07-30 2009-09-15 Broadband Itv, Inc. System and method for managing, converting and displaying video content on a video-on-demand platform, including ads used for drill-down navigation and consumer-generated classified ads
US20060031225A1 (en) 2004-08-06 2006-02-09 Grand Central Communications, Inc. Providing on-demand access to services in a wide area network
US20060224846A1 (en) 2004-11-05 2006-10-05 Amarendran Arun P System and method to support single instance storage operations
US20060101285A1 (en) 2004-11-09 2006-05-11 Fortiva Inc. Secure and searchable storage system and method
US20060106814A1 (en) 2004-11-17 2006-05-18 Steven Blumenau Systems and methods for unioning different taxonomy tags for a digital asset
US20060195449A1 (en) * 2005-02-28 2006-08-31 Microsoft Corporation Discoverability and enumeration mechanisms in a hierarchically secure storage system
US20060259468A1 (en) 2005-05-10 2006-11-16 Michael Brooks Methods for electronic records management
US20070027861A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Automated content categorization
US7529748B2 (en) 2005-11-15 2009-05-05 Ji-Rong Wen Information classification paradigm
US20070198570A1 (en) 2005-11-28 2007-08-23 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070198601A1 (en) 2005-11-28 2007-08-23 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070198612A1 (en) 2005-11-28 2007-08-23 Anand Prahlad Data classification systems and methods for organizing a metabase
US20070198611A1 (en) 2005-11-28 2007-08-23 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070198608A1 (en) 2005-11-28 2007-08-23 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070203938A1 (en) * 2005-11-28 2007-08-30 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070203937A1 (en) 2005-11-28 2007-08-30 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070179995A1 (en) 2005-11-28 2007-08-02 Anand Prahlad Metabase for facilitating data classification
US20070198613A1 (en) 2005-11-28 2007-08-23 Anand Prahlad User interfaces and methods for managing data in a metabase
US20070185925A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070185914A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for using metadata to enhance data management operations
US20070192360A1 (en) 2005-11-28 2007-08-16 Anand Prahlad Systems and methods for using metadata to enhance data identification operations
US20070192385A1 (en) 2005-11-28 2007-08-16 Anand Prahlad Systems and methods for using metadata to enhance storage operations
US20070185917A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US7613752B2 (en) 2005-11-28 2009-11-03 Commvault Systems, Inc. Systems and methods for using metadata to enhance data management operations
US20070185916A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070185915A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070185921A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for cataloging metadata for a metabase
US20070185926A1 (en) 2005-11-28 2007-08-09 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US20070198593A1 (en) 2005-11-28 2007-08-23 Anand Prahlad Systems and methods for classifying and transferring information in a storage network
US7500150B2 (en) 2005-12-30 2009-03-03 Microsoft Corporation Determining the level of availability of a computing resource
US20070288536A1 (en) 2006-06-07 2007-12-13 Microsoft Corporation Managing data with backup server indexing
US20080059515A1 (en) 2006-09-01 2008-03-06 Fulton Michael S Method, system, and program product for organizing a database
US20080294605A1 (en) 2006-10-17 2008-11-27 Anand Prahlad Method and system for offline indexing of content and classifying stored data
US20080091655A1 (en) 2006-10-17 2008-04-17 Gokhale Parag S Method and system for offline indexing of content and classifying stored data
US20080249996A1 (en) 2006-12-22 2008-10-09 Anand Prahlad Method and system for searching stored data
US20080243796A1 (en) 2006-12-22 2008-10-02 Anand Prahlad Method and system for searching stored data
US20080228771A1 (en) 2006-12-22 2008-09-18 Commvault Systems, Inc. Method and system for searching stored data
US20090287665A1 (en) 2006-12-22 2009-11-19 Anand Prahlad Method and system for searching stored data
US20080249999A1 (en) * 2007-04-06 2008-10-09 Xerox Corporation Interactive cleaning for automatic document clustering and categorization
US20080263029A1 (en) * 2007-04-18 2008-10-23 Aumni Data, Inc. Adaptive archive data management
US20090177728A1 (en) * 2007-12-20 2009-07-09 Pottenger William M Peer-to-peer indexing-based marketplace

Non-Patent Citations (55)

* Cited by examiner, † Cited by third party
Title
"Text Figures," retrieved from http://www.microsoft.com/msj/1198.nffs/ntfstextfigs.htm on Nov. 10, 2005, 7 pages.
Armstead et al.; "Implementation of a Campus-wide Distributed Mass Storage Service: The Dream vs. Reality," IEEE, 1995, pp. 190-199.
Arneson, "Mass Storage Archiving in Network Environments" IEEE, 1998, pp. 45-50.
Arneson, Control Data Corporation, Development of Omniserver; Mass Storage Systems, 1990, pp. 88-93.
Cabrera, et al., "ADSM: A Multi-Platform, Scalable, Backup and Archive Mass Storage System," Digest of Papers, Compcon '95, Proceedings of the 40th IEEE Computer Society International Conference, Mar. 5-9, 1995, pp. 420-427, San Francisco, CA.
Communication with extended European Search Report for Application No. PCT/US2006/061304, dated Dec. 30, 2008.
Cooperstein, et al., "Keeping an Eye on Your NTFS Drives: The Windows 2000 Change Journal Explained," Sep. 1999, retrieved from http://www.microsoft.com/msj/0999/journal/journal.aspx on Nov. 10, 2005, 17 pages.
Eitel, "Backup and Storage Management in Distributed Heterogeneous Environments," IEEE, 1994, pp. 124-126.
EMC Corporation, "Today's Choices for Business Continuity," 2004, 12 pages.
Examination Report for European Application No. 06 844 595.6, Mail Date: Sep. 26, 2008, 5 pages.
http://en.wikipedia.org/wiki/Machine-learning, Jun. 1, 2010.
http://en.wikipedia.org/wiki/Naive-Bayes-classifier, Jun. 1, 2010.
Jander, M., "Launching Storage-Area Net," Data Communications, US, McGraw Hill, NY, vol. 27, No. 4 (Mar. 21, 1998), pp. 64-72.
Jason Gait, "The Optical File Cabinet: A Random-Access File system for Write-Once Optical Disks," IEEE Computer, vol. 21, No. 6, pp. 11-22 (1988) (see in particular figure 5 in p. 15 and the recitation in claim 5).
Langdon, et al., "Data Classification: Getting Started," Storage Magazine, Jul. 2005, retrieved from http://storagemagazine.techtarget.com/magPrintFriendly/0,293813,sid35-gci1104445,00.html; on Aug. 25, 2005, 3 pages.
Microsoft, "GetFileAttributes," updated Sep. 2005, retrieved from http://msdn.microsoft.com/library/en-us/fileio/fs/getfileattributes.asp?frame=true on Nov. 10, 2005, 3 pages.
Microsoft, "GetFileAttributesEx," updated Sep. 2005, retrieved from http://msdn.microsoft.com/library/en-us/fileio/fs/getfileattributesex.asp?frame=true on Nov. 10, 2005, 2 pages.
Microsoft, "WIN32-File-Attribute-Data," updated Sep. 2005, retrieved from http://msdn.microsoft.com/library/en-us/fileio/fs/win32-file-attribute-data-str.asp?frame on Nov. 10, 2005, 3 pages.
O'Neill, "New Tools to Classify Data," Storage Magazine, Aug. 2005, retrieved from http://storagemagazine.techtarget.com/magPrintFriendly/0,293813,sid35-gci1114703,00.html on Aug. 25, 2005, 4 pages.
Partial International Search Results, mailed May 25, 2007, International Application No. PCT/US2006/045556, 2 pages.
PCT International Search Report and Written Opinion for International Application No. PCT/US07/81681, Mail Date: Nov. 13, 2009, 8 pages.
Richter, et al., "A File System for the 21st Century: Previewing the Windows NT 5.0 Files System," Nov. 1998, retrieved from http://www.microsoft.com/msj/1198/ntfs/ntfs.aspx on Nov. 10, 2005, 17 pages.
Rosenblum, et al., "The Design and Implementation of a Log-Structure File System," Operating Systems Review SIGOPS, vol. 25, No. 5, New York, US, pp. 1-15 (May 1991).
Search Report for European Application No. 06 844 595.6, Mail Date: Sep. 26, 2008, 5 pages.
Szor, "The Art of Computer Virus Research and Defense", 2005 Symantec Corporation, part 1.
Szor, "The Art of Computer Virus Research and Defense", 2005 Symantec Corporation, part 2.
U.S. Appl. No. 11/563,940, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,119, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,130, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,136, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,153, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,163, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,170, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,174, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,177, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,180, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,194, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,197, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,215, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,220, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,221, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/564,233, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/605,931, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/605,932, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/605,944, filed Nov. 28, 2006, Prahlad et al.
U.S. Appl. No. 11/931,034, filed Oct. 31, 2007, Kottomtharayil et al.
U.S. Appl. No. 12/058,487, filed Mar. 28, 2008, Prahlad.
U.S. Appl. No. 12/058,575, filed Mar. 28, 2008, Prahlad et al.
U.S. Appl. No. 12/058,589, filed Mar. 28, 2008, Prahlad et al.
U.S. Appl. No. 12/511,653, filed Jul. 29, 2009, Prahlad, et al.
U.S. Appl. No. 12/548,953, filed Aug. 27, 2009, Ahn et al.
U.S. Appl. No. 12/695,203, filed Jan. 28, 2010, Prahlad et al.
U.S. Appl. No. 12/695,289, filed Jan. 28, 2010, Prahlad et al.
Witten, et al., "Data Mining, Practical Machine Learning Tools and Techniques", 2nd Edition, 2005, part 1.
Witten, et al., "Data Mining, Practical Machine Learning Tools and Techniques", 2nd Edition, 2005, part 2.

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9740764B2 (en) 2008-01-30 2017-08-22 Commvault Systems, Inc. Systems and methods for probabilistic data classification
US10628459B2 (en) 2008-01-30 2020-04-21 Commvault Systems, Inc. Systems and methods for probabilistic data classification
US10783168B2 (en) 2008-01-30 2020-09-22 Commvault Systems, Inc. Systems and methods for probabilistic data classification
US11256724B2 (en) 2008-01-30 2022-02-22 Commvault Systems, Inc. Systems and methods for probabilistic data classification
US11768800B2 (en) 2010-09-30 2023-09-26 Commvault Systems, Inc. Archiving data objects using secondary copies
US11392538B2 (en) 2010-09-30 2022-07-19 Commvault Systems, Inc. Archiving data objects using secondary copies
US11461016B2 (en) 2012-08-13 2022-10-04 Commvault Systems, Inc. Lightweight mounting of a secondary file system data
US9275291B2 (en) 2013-06-17 2016-03-01 Texifter, LLC System and method of classifier ranking for incorporation into enhanced machine learning
US11321191B2 (en) 2014-07-01 2022-05-03 Commvault Systems, Inc. Lightweight data reconstruction based on backup data
US11656956B2 (en) 2014-07-01 2023-05-23 Commvault Systems, Inc. Lightweight data reconstruction based on backup data
US11321190B2 (en) 2017-09-14 2022-05-03 Commvault Systems, Inc. Distributed framework for task splitting and task assignments in a content indexing system
US11263088B2 (en) 2017-09-14 2022-03-01 Commvault Systems, Inc. Distributed architecture for tracking content indexing
US11086834B2 (en) 2017-09-14 2021-08-10 Commvault Systems, Inc. Distributed framework for data proximity-based task splitting in a content indexing system
US11036592B2 (en) 2017-09-14 2021-06-15 Commvault Systems, Inc. Distributed content indexing architecture with separately stored file previews
US10846266B2 (en) 2017-09-14 2020-11-24 Commvault Systems, Inc. Distributed architecture for content indexing emails
US11687511B2 (en) 2017-09-14 2023-06-27 Commvault Systems, Inc. Distributed framework for data proximity-based task splitting in a content indexing system
US10846180B2 (en) 2017-09-14 2020-11-24 Commvault Systems, Inc. Distributed framework for task splitting and task assignments in a content indexing system
US11710033B2 (en) 2018-06-12 2023-07-25 Bank Of America Corporation Unsupervised machine learning system to automate functions on a graph structure
US11243832B2 (en) * 2019-10-10 2022-02-08 International Business Machines Corporation Dynamically analyzing diagnostic operations data via machine learning techniques

Also Published As

Publication number Publication date
US20130066874A1 (en) 2013-03-14
US20200364244A1 (en) 2020-11-19
US11256724B2 (en) 2022-02-22
US20180032602A1 (en) 2018-02-01
US20200210460A1 (en) 2020-07-02
US9740764B2 (en) 2017-08-22
US20210342368A1 (en) 2021-11-04
US10628459B2 (en) 2020-04-21
US20160171079A1 (en) 2016-06-16
US10783168B2 (en) 2020-09-22
US20090192979A1 (en) 2009-07-30

Similar Documents

Publication Publication Date Title
US10783168B2 (en) Systems and methods for probabilistic data classification
CN102741803B (en) For the system and method promoting data to find
US9836541B2 (en) System and method of managing capacity of search index partitions
US8914412B2 (en) Determining file ownership of active and inactive files based on file access history
US10417265B2 (en) High performance parallel indexing for forensics and electronic discovery
US20080270462A1 (en) System and Method of Uniformly Classifying Information Objects with Metadata Across Heterogeneous Data Stores
US20120290927A1 (en) Data Classifier
US20150242452A1 (en) Early exit from table scans of loosely ordered and/or grouped relations using nearly ordered maps
US20130198161A1 (en) Monitoring content repositories, identifying misclassified content objects, and suggesting reclassification
US11030054B2 (en) Methods and systems for data backup based on data classification
CN111026961A (en) Method and system for indexing data of interest within multiple data elements
US8533150B2 (en) Search index generation apparatus
US11100048B2 (en) Methods and systems for metadata tag inheritance between multiple file systems within a storage system
WO2022081812A1 (en) Artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
Manne et al. Text categorization with K-nearest neighbor approach
JP2022542920A (en) Active learning for data matching
US8819021B1 (en) Efficient and phased method of processing large collections of electronic data known as “best match first”™ for electronic discovery and other related applications
Qiao et al. StackInsights: Cognitive Learning for Hybrid Cloud Readiness
Freire Author Consolidation across European National Bibliographies.
Sridharan et al. RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION

Legal Events

Date Code Title Description
AS Assignment

Owner name: COMMVAULT SYSTEMS, INC., NEW JERSEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:LUNDE, NORMAN R.;REEL/FRAME:020991/0936

Effective date: 20080501

STCF Information on status: patent grant

Free format text: PATENTED CASE

CC Certificate of correction
AS Assignment

Owner name: BANK OF AMERICA, N.A., AS ADMINISTRATIVE AGENT, NORTH CAROLINA

Free format text: SECURITY INTEREST;ASSIGNOR:COMMVAULT SYSTEMS, INC.;REEL/FRAME:033266/0678

Effective date: 20140630

Owner name: BANK OF AMERICA, N.A., AS ADMINISTRATIVE AGENT, NO

Free format text: SECURITY INTEREST;ASSIGNOR:COMMVAULT SYSTEMS, INC.;REEL/FRAME:033266/0678

Effective date: 20140630

FPAY Fee payment

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: 7.5 YR SURCHARGE - LATE PMT W/IN 6 MO, LARGE ENTITY (ORIGINAL EVENT CODE: M1555); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

AS Assignment

Owner name: COMMVAULT SYSTEMS, INC., NEW JERSEY

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:BANK OF AMERICA, N.A.;REEL/FRAME:054913/0905

Effective date: 20180209

AS Assignment

Owner name: JPMORGAN CHASE BANK, N.A., AS ADMINISTRATIVE AGENT, ILLINOIS

Free format text: SECURITY INTEREST;ASSIGNOR:COMMVAULT SYSTEMS, INC.;REEL/FRAME:058496/0836

Effective date: 20211213